 All right, guys, thanks for coming today. This is the second week in our seven databases and seven weeks. So today, I'm really excited to have Mike Willing. He is an old school database developer. So like Garth is an old school file system guy. Mike is an old school database guy. So we cut his teeth in the early days working on shore back in Wisconsin. Back then, the answer to the problem wasn't just let's buy more RAM as we can do now. It actually had to work really hard to maybe do the systems work reasonably well. He worked in there for six years. And then he's been in Microsoft working on the SQL server team. He's one of the original guys that took the original side base that Microsoft got and converted it into SQL servers. He's been with them about 20, 25 years, right? 19 years. So what you're talking about today is a new cable expansion in storage in the Uniforced SQL Server called Hecateaning, which is sort of their in-memory optimized O or TB system, right? OK, yeah. Great, thank you, Andy. It's great to be here. And I have a chance to talk about this. I love talking about it. I like it so much. My talks can run on and on. And so I want to welcome questions throughout and encourage the interaction. I hopefully removed anything that's kind of product specific from the talk because I'm not here to talk about the product so much as the technology, as how it's built and history around it. This picture just shows some of the people who worked on the project throughout the years. And it was not only a great engineering experience, it was great collaboration with Microsoft Research on a number of areas. So is this going? Page down. There we go. So just quick background on me. Andy kind of covered this. I got my master's at the University of Wisconsin way back in 89. And then I joined research staff in the database systems group there, building database systems, including Shure. And it's neat to see that folks from CMU have taken the Shure project and really made it a more real system and certainly a much more modernized system and faster system. I got to join Microsoft in 95, end of 95, and was really fortunate to work on basically rebuilding the database system from the ground up. And when we did that, we looked at more modern architecture back then even and said, well, we want something that'll last 10 to 15 years. And so we built that. And then around 2006, 2007, that release came out in 2008. We're like, hey, it's kind of approaching 15 years. And we really ought to be thinking more about deeper architectureally where the database system should be going given hardware trends, given application trends, and I'll talk about that. And that led to an incubation project in 2008. And near the end of 2008, we kind of formally started a small incubation team, the Hecaten project. That was combined with some folks from Microsoft Research, including Paul Larson. And that led to I've worked with Microsoft Research throughout my career. That's been a neat experience. So let me jump into the talk. So we'll start out with a bit of motivation around hardware, software, application trends. This is a systems audience. You get this. So we'll go through this fairly quickly. Then we'll get into the architecture of Hecaten and how it fits into the SQL Server database product. To be clear, Hecaten itself is not really a database. You might call it a database storage engine, combined with a compiler. But the database itself is SQL Server. And this is just basically a component within the larger SQL Server database system. So we'll go through that. We'll go through its core architecture, how it's integrated, things like that. I'll cover some customer and performance results. I could talk a little bit about the future more indirectly than directly telling you what our future plans are. And then I'll review it. And I have a bunch of links to papers and where you can get more information on the system. So why did we do this? Why did we set out to build a new transaction processing system at the heart of SQL Server? One is applications and user's expectations. Across the board, low latency in almost all dimensions. Whether you're swiping on your phone or running big analytical queries, people want more and more. On the phone, you want instantaneous human response time on big analytic queries. You don't want hours. You want seconds. And in transaction processing, it's really that you're talking machine time and you want microseconds. This photograph is just emblematic of that. It's a laser interconnect network in Frankfurt for basically high-frequency stock trading. And you see the extreme microwave or laser networking system set up to reduce latency point to point because light basically travels in a straight line through one of these networks versus fiber optics, where not only is the cable not straight, within the cable, the light's bouncing around. So people are going to extreme to reduce latency at all scales. And then so that's kind of the demand from the application space. And then the hardware trends, whether it's SSDs, very large main memories, non-volatile memory, high-performance networking, all kinds of things are coming together to say, hey, things have shifted a lot over the years. And how does the database system going to evolve or adapt to that? So ultimately, we tried to build high-performance, main-memory optimized engine that's deeply integrated into SQL Server and architected for both these application trends and the hardware trends. So a quick reminder of these hardware trends. This slide's a little dated. If someone has more interesting or more recent CPU trends, that'd be great to know. But basically, the green line shows Moore's law more transistors. And for a while, especially without lots of cores, but what we're seeing more is systems on ship, GPUs, things like that integrated right into the CPU. So you're getting horsepower in that sense. But you're certainly not getting aggressive clock rate increases, because basically, the CPUs would melt at the power consumption. They basically popped out at 100 watts in a socket and aren't going up above that. And meanwhile, memory prices have dropped. And the bottom line for transaction processing, which is often short little pieces of work, is that new CPUs won't run them dramatically faster than old CPUs. And it's hard to parallelize really short operations. You can do a lot with parallelization over big queries, but a lot less over small, short transactions. So SQL Server has basically two core in-memory components. The middle one here in-memory data warehousing is basically column store technology. And that's been in the product since before 2012. And it's basically column stores engine. It's not just in-memory. The column stores are all on disk and are generally all managed on storage. But I call them in-memory technology, because they are optimized for feeding the CPU. And they do happen to reduce IO a great deal, but that wasn't their primary goal. And then there's the in-memory OTP technology, which is the focus of this talk. Let's jump ahead into the architecture. So I'm going to walk through these kind of architectural pillars, if you will. And the first one speaks right to the performance in main memory. And the main principle is that if you want to go fast, your data's in memory. And sometimes customers, users, people will talk about main memory systems as, well, they're fast because they eliminate IO. And it's like, no, that's not why they're fast. If you want to go fast, you put lots of memory on a machine, and any decent database system will eliminate IO, because they have a big buffer cache. They'll cache everything. The key to an in-memory system is they rely on the fact that the data is going to be in memory. And architecturally, they're built differently because of that. Main thing is, hey, you can use direct pointers to rows. No, rows are in memory. You don't have to go through indirections into buffer caches. Indexes, I mean, indexes help you locate data. Well, if data's in memory, then there's no reason to have indexes on disk because data's in memory. Yes, everything has to be durable on disk. And we'll talk about how that's achieved. But while processing data, it's all in memory. And therefore, indexes are an in-memory concept. There's no buffer pool. And the common term in databases is write-a-head logging. The notion that if you modify anything, you have to log the modification before you do it, because you never know when that modification might get flushed out of memory because there's insufficient space. Well, that doesn't exist in an in-memory system, so you don't have to build write-a-head logging. And all the checkpointing and transaction logging that you typically do can actually be more streamed-based because you're not trying to eliminate or have high performance, low latency IO. It's great to have it, and it's kind of important for the transaction log still, but it's not critical to the performance of the system in general. One of the goals, then, is basically give me the speed of an in-memory hash table, an in-memory cache, but in the databases. And a lot of this stuff, you won't see. This isn't necessarily a new idea. I mean, the h-store stuff that Andy worked on was professing this. Basically, at the same time, we were working on trying to achieve it, even a little bit before. So but this is, in essence, a critical pillar. But it's not enough. It's not enough to say the core engine and the indexes and the transaction processing system is efficient. If you really want to go fast, and I guess I should say where did Hecketton come from? Hecketton basically is a derivation of the Greek word for 100. And I think the early h-store work was trying to achieve, hey, 100 times faster in a database system. And we looked at it and said, OK, we're going to name that project over that goal. And that goal is an aspirational goal. We didn't achieve 100x. But it's important to focus on that 100x from the sense that it forces you to think about everything in the core execution path of the transaction. If you say 2x, well, you can leave a lot of inefficiencies in the system. But at 100x, you have to remove 99% of what you're doing. So and a key piece of that to achieve performance really shows up in the second pillar, which is that one of the database system is running queries and executing the SQL language. It needs to be doing that at native code execution speed. Basically, in a lot of cases, just the database system is highly flexible about schema, about changes, about modifications, with a bunch of code that's special case logic for rare conditions and often evaluated at runtime, these rare conditions. And the key to performance here is not to do that, to compile it into native code and wherever possible, push decision-making to compilation time, with the goal being that your queries, the business logic, and your stored procedures run at native code speed. So those two pillars speak to the efficiency and will go into those more. The next one is, well, you do have a proliferation of cores, so you better be able to run well and scale well across all the cores in the system. And the key principle around that is making the assumption and observation that conflicts are rare. And typical lock-based systems, including SQL Server the way it's existing, Rostore Engine is built, are all very pessimistic. They assume that if I'm changing something in the database that you're going to modify something at the same time or want to read it and conflict with me. And the reality is conflicts are very rare, and the faster you run, they're even more rare. So the system uses multi-version, optimistic concurrency control, but also does it supporting the full transactional isolation level, as you would expect from a database system, it's materializable. Yes, it will do retry if it fails, so it assumes no failure, and if it fails, it retries. And the question is, does it have a fallback mechanism to a locking-based system? No, it does not. And we actually built, in fact, it was an intern in our engineering team. He's now a professor at StereoSlanus, right? He's now a professor at Ohio State. He built a version of the Hecatenin with a full locking-based system. And we have kind of the grand idea that we could combine those two, but we haven't done it, haven't needed to do it, I guess, is the way to say it. Yeah, it switched to a locking-based system. And that was designed to be able to work flexibly with both. You can combine that even on the same tables and same indexes. So from a transaction concurrency point of view, it's optimistic concurrency control. And then from a multi-threading point of view, the system uses block-free data structures throughout. And we'll talk about what those data structures are later. The key being, the end result, basically, is your transaction runs beginning to end with nothing to stop it, from nothing to block it. It never requires locks. It doesn't have any latching mechanisms with other threads, and it doesn't do any IO. Except for the very end, when you want to commit your transaction, we do have to write the log. But that's after you've done all the work of the transaction. So there's no blocking within. So there's no context switching. That's one of the goals of the system is that it's within a transaction that's not doing that kind of stuff. Of course, it will context switch. It turns out to be a long-running transaction and all that. We want the common cases not to require that. And then finally, when we were building all this, we had the option of building another product, building a separate database product. But in talking with all the users and our own kind of the way we talk about doing the engineering, we decided to ultimately build it into SQL Server. It's just another capability, another feature of the database system. That leverages SQL Servers, parser, and security subsystem, and backup restore system, and high availability system. And we didn't need to rewrite all those. We could just leverage what it already has, because none of those are in the critical path of processing a transaction. You might say the high availability system is, but the majority of the code in the high availability system isn't about the latency of replicating a transaction. But the complexity in that is managing all the failure cases and the distributed system part of things. So the building it in was a huge challenge, because SQL Server is a massive code base and it's very complex. But at the same time, it got us a lot of this stuff nearly for free, and it really helps with our customers and users, because they basically are managing and using the same product they always have. So let's drill into some of these. Actually, before we do that, I want to kind of give you the overall how this fits into the database system as a whole. So if you look in this diagram, we've got applications running. They connect over a network into the session management layer of the database server. This is SQL Server.exe. And as typical requests come in, they have to be parsed, and there's a catalog tracking all the tables and a query optimizer. And the end result, if it's a request that we've never seen before, we compile it all into an execution plan. And if it's one we've seen before, we've cached those execution plans and we rerun those. This is all how SQL Server works today. But those execution plans, we call it compilation in the product today. But it's not compilation as into native code. It's compilation into something you can interpret, a tree you can walk and interpret. So we have SQL Server's language. It's called tSQL. And the interpreter runs against the storage engine. The storage engine has access method APIs to access indexes and tables. Those indexes and tables are all cached in a buffer pool. And then there's two broad storage containers, something we call file groups. So there's a data file group that stores all your indexes and tables and a transaction log. And I show the tables and indexes here, meaning every modification to a table or an index is stored in the transaction log. And that's how the system basically works. We introduce an engine technology in here to manage what we call memory optimized tables and indexes. It has its own storage format, which we'll talk about. This is not drawn to scale. If this was drawn to scale, this engine's probably about this big compared to the entire code base of the system. And that's the value of leveraging what's everything's already there and not building it all from scratch. So you can mark tables as memory optimized. And they're going to be managed by this engine. Tables and indexes and their indexes. Their storage is going to be down here in the memory optimized file group. And they will have log. But you'll notice, as I mentioned earlier, indexes aren't logged and they're not stored. They're an in-memory concept only. And we have an option also called non-durable tables, where the schema is durable, but the row content is not durable. And so in this example, I've made green not durable. So it has no artifacts on disk or in the log whatsoever. But so it's fully atomic, consistent, and isolated, but not durable. And this turned out to be a really popular option. And it's something we prototyped and gave to customers early to play around with before we had built the storage side of things. And we were giving that to them early so that they could get the look and feel of how the system might work. And they were telling us, we like this as is. Don't give us storage. We would take that as a feature. Oh, OK. So we turned it into a feature also. Yeah, the equivalent of temp tables, temporary stuff. Maybe they're bulk loading a bunch of data. And they're going to destage it into the column storage engine or somewhere else. But they want to absorb it much more quickly. They would just use the non-durable table. So basically, when they can refresh it from somewhere else, this was a nice option. So but even web shopping carts, you can use it for that if you want. It's a cache. The simplest way to access the data from the user's perspective in this engine is to just go through the interpreter. The query optimizer compiler knows how to build plans that will execute this data just as well as that data. And you can do that. And you get the high concurrency and efficiency of this engine, but you don't get the efficiency of the native code generation. That comes from incorporating the Hecketon compiler in. So the execution plan trees, effectively, that the compiler and optimizer in SQL Server generates can be fed into this compiler, and I'll show that a little later on, to generate natively-compiled-for-procedures libraries that we can load in to SQL Server and run directly. Then those run directly linked against this engine. And for the most part, execute entirely in this space and never touch any of that while they're running. OK, so when things are running in this space, it's generally 20 to 40 times more efficient than running over in that space. And that efficiency is just basically, it's not even about scale efficiency. It's about taking few recycles to do all the work. Real applications, though, see 2 to 30x. And I'll talk about that spread a little bit later. And there's a variety of reasons for it. I'll explain some of that now. One is that when you make this core execution path basically free relative to everything else, the whole communication stack, everything from the app and the operating system, the session management, gets a huge magnifying glass over it. There's always the next bottleneck, right? And the next bottleneck, for certain cases, is going to be a net communication stack. We haven't tried to really improve that too much. We've done some. But we believe we can take the compiler technology and actually apply it to that whole space and handle request processing with specialized code for requests rather than generic code. Likewise, we've done a lot to reduce contention on the transaction log, which is a typical source of contention and database systems. But ultimately, if you're making modifications and you're committing transactions and you're waiting for that commit to be hardened to some durable storage, you still want low latency there. You're still going to want an SSD there at least. And that will still add latency that we can't hide by reducing instructions. It still exists. So that limits some of the speed that you will see. So that gives you the overall picture. And then I'm going to drill into some of those pillars, how things are arranged in memory, how the concurrency control works, a little bit about how the compiler works. So what I show in memory here are rows. Their schema is a name and a city. These are versions of rows. This is a multi-versioning system. So rows have two timestamps, a begin timestamp and an end timestamp. In this case, if you look at John, John is currently living in Beijing since timestamp 200. The infinite timestamp on the end means this is the live row in the system. John used to live in Paris from timestamp 100 to 200. There was a row for Susan, a couple versions of Susan. Susan isn't a live record in the database anymore. Both originally Susan and Brussels updated to Susan in Beijing. And then someone must have deleted the Susan row because it's end timestamp is set and there's no new version. And basically, in a multi-versioning system, a transaction gets a timestamp when it starts and it sees all versions live as of that timestamp. If you add indexes into the picture, we have two forms of indexes, hash indexes, and range indexes. You'll notice that the rows have index pointers in them. So bucket chain collisions are just linked rows together. Same thing with duplicates at the leaf level of the B3. The rows are just chained together. And all versions are effectively duplicates, in that sense. Especially in the hash table, you won't see any replication of the data from the row in the index. There's no need for that. It's simple enough to get it from a pointer. But if you look at a disk-based system, you'll often see columns included in the index just to reduce the cost of having to go reference another index. And here it's as simple as just following pointers. The range index is based on what's called the BW tree. It's a lock-free B3 developed by Microsoft Research. I think you guys had a talk a couple weeks ago from M-SQL. They talked about skip lists. And what's neat about that is skip lists look like the lock-free solution for range indexing. It's one we seriously looked at as well. And it was all on our plans to do it. But then I was talking with David LeMay at Microsoft Research on some stuff. And he said, hey, I've got this lock-free tree structure I'm building. Oh, OK. So we put them head-to-head with prototypes of skip lists and prototypes. And it was quite a bit faster, factored faster, mainly because of cache efficiency in the CPU. So we went with it. And I think that's probably the fastest case I've seen of research technology getting into a product, especially something so fundamental and core to a database system as its access method. So that was fun to get that happening. Let me jump ahead. So that's the basics of how things work in memory. If questions come up later, I've got a slide that kind of animates the lock-free pieces. But in interest of showing you the whole system, I'm going to move on to the non-blocking execution. I think I've probably said most of this. Sometimes I include slides like this just for readers of slides to get the picture if they can't hear me talk about it. Let's talk about the lifecycle of a transaction, which I think is the next slide. So the way transactions, the lifecycle of a transaction basically is transactions begin. They get a timestamp. And then they do their processing. They read rows that are visible to them based on their timestamp. And the versioning system, I'll talk about garbage collection a little later, the versioning system basically ensures that versions they're supposed to see will always be around for the lifetime of that transaction. So that's why they don't need any locking, and they're free to just follow pointers. So they do their normal processing. And what they remember, depending on the isolation level, is a pointer to everything they read, a pointer or a information about all the scans that they did of the system, and then a pointer to everything they wrote, all the new versions they created. And then they proceed into the commit phase. And so they enter pre-commit where they acquire an end timestamp. And so with this end timestamp, this establishes the serialization point of that transaction. So you'll hear in database systems serialization and serialization history. Basically, we want to make it appear as if every transaction executed isolated at a particular point in time. And Hecaten does this, regardless of your isolation level, a transaction has a point in time at which it commits. And that is independent of every other transaction sooner or before or after that point. So it gets this end timestamp. Now it has to confirm that because it's been optimistic that there are no conflicts with what it did. No one else tried to modify a row that it read is the main conflict, or no one inserted something into a range scan that it didn't see that it should now have seen. So it goes through this validation phase. And if it completes that validation phase, it proceeds to write redo information to the transaction log about all the versions it created and ended lifetimes of. So it does all of that. And once it does that, we set its state to commit it. And at that point, it's irrevocably committed. And then we go back, and I showed all those timestamps on the rows. We go back and fix up the end timestamps on all the rows. While the transaction's running, if it attempts to modify a row that's going to create a new version, it sets a tentative end timestamp, which is basically a transaction ID on the row. There's a whole paper on this that explains how it all works, and I have references for that later. But we have to go fix up those timestamps, and then we can terminate the transaction. Let's see if I have anything on dependency. So the interesting thing that can happen here, though, is the minute you get the end timestamp, you're going to either commit or abort based on validating that your transaction is still valid. If you commit, it means any transaction that's starting in this time frame must see all your changes. And if you don't commit any transaction in the future, it must not see any of your changes. But you're kind of in this ambiguous state. You don't know if you're going to commit or not, but you know what your timestamp is. And you've created all these versions that all future transactions, including those starting while you're in this small window, are obligated to see. So what we do is a bit of speculation. We allow other transactions that begin in this time frame to see your changes and take a dependency on your outcome in order for them not to have to wait for your validation to complete. Normally, that all goes fine. You commit and you clear all the dependencies. But if you roll back while other people have dependencies on you, you will take them down. They will have to roll back also. It looks a lot like speculative execution in the CPU. But we don't take it as far as you could take it much further and say, I'll speculate on everything. In every modification I'm making, if you read it, you see it, and you take a dependency on my outcome. We could build the system to do that, but we didn't want to cascade speculative failures. So we haven't pushed that. And we haven't really needed to. So in the interest of time, I'm going to skip over this. This basically talks about the types of validation we must do depending on the level of isolation we need to achieve. So if you're at snapshot isolation, there's no validation required at all, meaning you see a snapshot of the database and you don't care if it's changing underneath you. If you want full serializable, or if you want repeatable read, we just have to verify that whatever you read hasn't been changed. That's relatively straightforward. Full serializable, the main difference is we have to avoid phantoms. We have to avoid the fact that I scan a range of data, and I don't expect there to be any new content in that range that I would miss. So we have to validate that no new content has arrived in that range. What's nice about this compared to locking-based solutions is that in locking-based solutions, the higher your isolation level, the more other people pay, because you're acquiring locks that prevent them from executing or from seeing that data. Whereas with this, yeah, the transaction that requests a high isolation level pays, but the other transactions systems don't pay any more than they would have. So that works out well. There's a link here and there's a link at the end of the slides for gory details of how this is achieved. It's a multi-versioning system that creates old versions, meaning it creates garbage. And usually when I use the term garbage collection and high performance, people don't put those two things together. Garbage collection systems are slow. They cause blocking. They cause stalls. And what's nice about this compared to typical programming language garbage collection systems, which are based on reachability. Can I reach data? If I can reach data from the root of some object, then I had to keep the data around. And that's been a challenge for years about how to achieve that with high performance and non-blocking. Fortunately, with a multi-versioning transaction system, it's not about reachability. It's about visibility. And if a row version isn't visible to any transaction, it could be its garbage. And it's pretty easy to define whether it's garbage or not. And so Peckerton's garbage collection system has the properties you really want out of a garbage collection system. One, it's non-blocking. So no one ever blocks around garbage. It's cooperative. So any thread that encounters an old row that's garbage can remove it and get it out of the way of the system, get it out of the linked lists and stuff that other threads don't encounter it. It's incremental, so you can do a little bit at a time. It can be done in parallel. You have lots of threads doing it, basically all the worker threads in the system can be doing it. And we sell throttle. So sometimes systems will build separate garbage collection threads, separate from the threads that are doing all the work. That leads to a problem of, hey, how many worker threads should I have? How many garbage collection threads should I have? In this case, what Peckerton does is worker threads, after they're done processing a transaction, are obligated to go collect some garbage. And if there's more garbage around, they're obligated to collect more and more of it. And that kind of self-throttle the system. It's creating lots of garbage. The worker threads are cleaning it up rather than depending on complex separate scheduling of garbage collection. It shouldn't be a serious concern. I mean, there are some hard engineering problems in building it, but it should not be causing you to collect lots of garbage, too. It shouldn't block execution, and it shouldn't be a heavy consumption of CPU resources. The biggest downside of a multi-versioning system, and this is a long-running transaction. A long-running transaction prevents garbage collection. It suffers massively from a variety of problems. So basically, don't do those kind of things. I mean, we could get more sophisticated. If I just start a transaction and then I go off to lunch, the system will hold hostage all row versions in the database system. They're going to be stuck around in memory. Just like in a pessimistic system, I'd be holding locks and preventing log truncation and all kinds of other bad things. But the reality is that that transaction only holds hostage certain row versions, not all row versions. And we could build a system that can collect around a long-running transaction. Yeah, but typically what the system will do is abort it. If it's causing that much consumption of memory resources, it'll get aborting. Correct. Yes, and that's typically what systems do with locking-based systems. If you're a transaction and you've been holding locks for two hours, you're going to get killed. If you're blocking others, same, yes. They get trained not to do that kind of behavior. Which gears again to compilation and how a query execution or how compilation works, I should say, not how query execution necessarily works. So I mentioned that we have this mode called interop. That lets the regular keys equal interpreter access main memory optimized tables. But the alternative option is something we call natively compiled score procedures, where we convert it into C code and actually run it through a C compiler and get a DLL. And I will show you, walk you through that in a bit here. So here you see three broad architectural pieces. One I just called the SQL engine. This is everything the SQL already has for parsing, schema management, query optimization, and constant folding in the expression system. All kinds of stuff you would expect in the front end of a compiler. And then you have the Hecatec compiler, which just knows how to compile abstract trees into C. And then here you have the visual C compiler that knows how to compile C into machine code. And so basically what happens on the front end, we modified SQL server to take its, and we take its basically execution plan and generate a map. That's a mixed abstract tree. And we call it mixed because it's got imperative logic. So the key SQL language has while loops and if statements and things like that, combined with the query plans, set-based operator query plan trees, and expression trees. And we generate a single tree representing all of that. That's a pretty inconvenient form to generate C from. So we go through a layer of compilation that converts it into pure imperative logic. And that you can walk the tree in a single pass and generate C codes from. And we feed that to the compiler and get C code out of it. The other piece is we also need the schema. We need to know what the row format of the row is, how many of those index pointers should be in there, what are the hash functions and key comparison functions, and all that's generated separately. We go right to PIT and generate basically a separate DLL for the schema that has all of those functions, key comparison, row formats, all of that for the execution. And so then basically we represent an entire short procedure with many queries and expressions and everything in it as just the C function that you have to execute. Interesting thing I don't talk about in the slides, but in the database system world, there's common problem called SQL injection attacks. The ability to have malformed SQL exploiting bugs in the higher levels of a SQL database system to inject SQL that was never intended to run. The concern here is you could have C injection attacks, which could be even much more severe. So we have many layers of protection against that built in from day one into the system. Basically, none of the human text code that comes into the front end ever appears in this D code. It's all completely abstracted away. So that's an interesting challenge from the compiler space. We also have to generate code. We are generating C code. If someone writes a while loop computing pi, we can't translate that directly into a trivial while looping C, compiling pi, because that will lock out the core of the system doing that. So the code generated here has to actually provide all the capabilities people expect from the database system, which is how basically I can still control it. I can kill runaway executions. I can see what's happening. So we can't have random exceptions and things like that. So it actually gets quite complex to generate the code to make it completely safe and behave as if you've got a simpler interpreter running. So I've talked about the core pieces around efficiency and compilation and concurrency. But it's a database system. And people expect their data to be available despite crashes and failovers and a variety of other things that can go wrong in the system. So how do we make it all durable? And what I think is interesting about the challenge here is that in a traditional row store engine where you're caching disk pages in memory, what's in memory and what's in disk are basically identical. If it's consistent in memory, it's consistent on disk. Barring bugs in the IO system or things like that. But basically, what's on disk and what's in memory is just the cache of what's on disk. Anytime you want to validate the consistency of the database system, examine what's on disk. Or bring it all in the memory and examine it there. It's all the same thing. In this system, and I think probably typical most main memory system, what's on disk looks nothing like what's in memory. Memory is not just the cache of what's on disk. It's an optimized format for processing in the CPU, not an optimized format for reducing IO. And so you've basically built two systems. And you'll see that in some of the diagrams I'll go through. For durability, purely for durability, the ability to handle a crash or failover and things like that. Right, power fail. You can mirror it and you can say, yeah, the mirror isn't on independent power supply or independent data centers and all that. But ultimately, and you could argue some of this is a cut from a customer perspective. The notion that it's just in memory, even if it's in the memory of multiple machines, is too scary, much too scary. So what does the system do for durability? So we've got in memory, we've got rows and indexes, like we talked about, and we are doing operations. And when we do operations that modify data, we log it. What we log is old versions, new versions. That's what's going into the transaction log. And ultimately, we want to produce something on storage. Now typically, what systems have done is they snapshot the content of memory and write it to disk every so often. We wanted to avoid that, especially with very large main memory systems out there. The notion of clushing it to disk periodically, the notion of the entire contents of memory, multiple terabytes, we prefer not to do that. Especially if most of it hasn't changed. But even though, let's say, well, only plus was changed. Well, that gets you back into tracking changes all over memory and then having an efficient way to figure that out and to write a differential checkpoint. Well, we don't want to get into that complication either. It seems to push, it pushes extra overhead in the main execution path of the system. So instead, what we have is a separate checkpoint process that's scanning the log. And what it's doing is streaming data from the log into different streams. I don't know, why would you do that? What it's doing is it's pulling new versions into temporally organized streams of data. So all rows between transaction 199 or 199 are here. And after that file fills up, we'll start another file on another file just with all the row versions. Of course, row versions get deleted or get modified and become invalid. So right alongside each of these files is a separate stream of the IDs of deleted versions. Again, just streamed out as we scan the log. And so this is what's checkpointed. So once we stream through this log and produce these, or scan through this log and produce these streams, we can go ahead and concatenate the log, or truncate the log as a typical database of wood, because actually checkpoints aren't supposed to help you do. So what happens if you crash and you've got these files? So on the right-hand side here, we've crashed, there's nothing in memory, what do we do? First thing we do is stream these delta files into memory and build hash filters. And then we stream the row versions through those hash filters. So any row that hasn't been deleted will get loaded into memory and put back into its indexes. So this gives us nice streaming parallel load of the database after recovery. That gets us back into a, it's actually just a hash filter. We're not even sorting, we're not even sorting. In terms of the range, we're just inserting into the indexes that we're bulk loading it. It's like we're bulk loading the database, yeah. So we use all the cores in the system and we stream through all the SSDs or disks in the system. And then we apply the transaction log to catch up everything that hadn't been pushed into checkpoint file. Now, I don't have an animation on this. If you ran this in perpetuity, you're on this format, we just grow and grow because I'm not actually deleting any data and just recording this deletion. So ultimately, recovery time would go through the roof because most of these early versions, basically this is kind of meant to be drawn to scale to how much is deleted. They would get mostly deleted and when they become mostly deleted, it's just wasted space and it hurts recovery time. So another thing that happens in the background is we'll take adjacent data files that have had a lot of deletion. We will scan their delta file, produce the filter, scan the data files and produce a new data file that will cover the entire, so we might consolidate four data files with lots of deleted rows into a single data file with no deleted rows. Well, the fraction, right? And basically, what you see a couple of patterns here, one is all IO is stream-based. And you say, well, why would you do that? You've got SSDs and all that. Well, this happens to be the way to do things for the cloud also. Stream-based IO, even with SSDs, is basically a simpler and requires less from underlying storage systems. Yeah, yeah. And the other property you'll see here is no update in place. We never modify these files, they're immutable. The green files are immutable once they are full and the red files are append-only. And again, we never modify anything. Just like we never modify anything in memory, we create new versions, same thing with files. Right, right. When we compact these two files, we'll delete them and rely on the new replacement file. So that's what's happening there. That's local crash recovery. No one will buy it. I mean, there's no such thing as durability if it's only on a single machine. It's got to be across machines. So on the left side here, I show basically what was happening before. You can have many replicas in this system. The replicas are maintained by the transaction log. So the transaction log is sent to as many replicas as you have. And basically what we're doing, we're doing two things on the remote side. We're running redo to maintain the in-memory state. The in-memory rows and indexes are fully queryable for read-only queries on the secondaries. And then we're running that same checkpoint algorithm on the secondary. And it's doing whatever it's doing. It may be behind what's here, but basically it's maintaining the replica of storage. So we're maintaining a replica of memory and a replica of storage. You could imagine systems where, oh, you maintain a in-memory replica, but not a storage replica. If you wanted to scale out more queries, but didn't need to maintain all the storage replicas. You could also imagine the system maintaining just the storage replica for durability, but not providing any of the in-memory, not maintaining the in-memory replica. We don't leverage any of that architectural flexibility, but it's in there. I think I've covered that. This just summarizes what I said. Performance, I'll go real quick on that. This outline is not drawn to scale. We're basically 90% done. A few microbenchmarks. So this is, I've got a while loop running queries inside the engine, no communication with the outside world. And I compare that, when I compare regular, this is regular SQL Server row store with all the data in memory. So this isn't, and this is the memory optimized engine. And so none of these performance comparisons are about reducing IO. Everything's in memory in both tests. And you see basically 10 to 20x. Another microbenchmark where you're in a loop just doing updates and inserts to rows. And you see 20x in that internal loop. You said, well, what do customers get? Here's some examples. I'm not gonna go through this in detail. We got some customers getting 8 to 11x. This is basically lots of random inserts coming in. They are collected in the main memory database tables and then destaged to other tables later on in mass after they cool off, after those rows cool off. This is ASP done at session state. So this is shopping cart state, things like that. Stored within the database system. They went from 15,000 to 450,000 batch requests per second. This is a currency exchange trading system. Two experts by the throughput, but the main thing they wanted is the latency reduction for some of their larger business transactions. It was only much in four seconds to one second. So a range of real-world performance gains. If I was to summarize those gains, there are applications that fit really well with the model Hecaton has and within the constraints. So Hecaton's a V1 engine inside of the product, and it has restrictions. There are things it can't do, I'll talk about that. But if you can live within those restrictions and you have a decent amount of computation you wanna do on your data with every exchange or you're not really chatty, grab a row, return, grab a row, return to the app, you can get 25X. We've seen these apps that ingest. They have a lot of data ingestion and then a lot of read, a lot of read mostly after that, get 10X. By legacy app here, I mean an application. The way you can use the Hecaton system is you have SQL Server, you have a database, it's got many tables, many indexes. You have a few hotspots, you pick those hotspot tables and make them memory optimized. You hit your hotspot, pieces of store procedure logic and convert them to native code. And you work around a few limitations in the product and maybe you get 5X overall for your application on that. And then this is an interesting one. TPCC is a classic database benchmark. I am not claiming anything about TPCC performance as an official benchmark. We have not done an official benchmark. In fact, you cannot do an official benchmark with a main memory system because of the scaling rules for the benchmark. The point being here is that in this case, SQL Server's been optimized to run this benchmark and a few other benchmarks. For 20 years, right? And to even get 1% gain on that benchmark is fixed heroic effort, right? So we can get about 2X running the core in memory part of that workload. So that, at one level, we're so disappointing. On the other hand, people think, yeah, but you have to work for years to get 1%. So, yeah, Garth? Yeah. I don't know, I guess the question is around, do we have for an existing length like transaction workload understand what's kind of the minimal amount of work to get it done in terms of instructions? And I don't know of anything in that space. And also, the benchmarks have evolved. So they get harder. TPCE is a much harder benchmark in TPCC in terms of users' richer set of capabilities. Anyway, right? And if you look at this, what we're running here, this is a very chatty app. Relatively, nah, relatively chatty. Relatively chatty. It's extreme update rates versus typical applications don't have such extreme update rates. But. And you guys are all running for the future, right? Yeah, for store procedures, right. So this is basically the future work slide and then I got references. So this is a high level view of SQL Server 2014. This is the engine that's been in it for a long time. Typical road store engine, extremely rich in capabilities, features everything from XML to user-designed functions to myriads of indexing options and all that. And I show on the outside the column store engine and the main memory optimized engine. Again, not drawn to scale, not drawn anywhere near the scale. And these newer engines. And they're, as much as they are all in the same product and you can use them within the same database, you can have tables stored in this engine and this engine and that engine, there are still themes. And I can't say which of those themes we're focused on most right now, but you can imagine that we want to eliminate the themes between these engines. And I'll give you a sense for that. Today, Always On, which is a high availability system, backup, restore, security, administrative deployment, this is the same across all of these. So in that sense, administrators, DBAs, they can see it as a seamless whole. There are a lot of things that work, that work across these engines, but with limitations, transactions, but there are limitations on isolation levels you can use across all of these. You can run queries across all of these, but you can't do the native compilation queries across all of them. You can run store procedure to span all of them. There are some things that work only across two engines. The ability to combine range indexing and column store indexing exists across these engines, but not across there. And that would be an obvious area where we'd like to have main, mirror, mirror consistency across these. Readable secondaries applies to the ability to run queries on a secondary machine. You notice that applies to the memory-optimized engine like I mentioned, but not to the column store engine. And then there are some things that are engine specific, like batch query operation, which is in the column store engine and could clearly be useful elsewhere, but it's not there yet. Natively compiled store procedures, I mean there's nothing magic about that compiler, it could target these other engines. There are some challenges though, there's a lot of nice properties on the main memory engine where pointers, you can just use pointers to rows and rows don't move around. That's not true for this engine, for example. And then blobs, blob storage. They're not typically something you focus on in an old TP system or something you focus on in a column store for data warehousing, but it's a very popular data type and it has a range of uses, and it's not available in all the engines. So, clear lots of interesting problems for us to solve across these and even within these engines there's lots of growth and capabilities we need to do. And finally, there's a common pattern in these in-memory systems. Almost a lot of what I said about the old TP engine applies to the column store engine. The fact that in-memory and storage formats are different because basically you're optimizing for CPU efficiency instead of I-O efficiency. Of course, a column store has great I-O efficiency and I'm saying it doesn't but it's very much there's very much a memory piece of it. You see the immutable data structures both in-memory and on-disk, it makes a lot of things simpler. It adds some overhead, but it makes a lot of things simpler. I-O done in bulk, DW system, and now the hectic and system both do bulk streaming I-O, that's what they focus on, if they do any I-O at all, optimistic concurrency control. And a fundamental, it goes way back to my stone breaker and others pointing out that you aren't going to go a 10 or 100 times faster by modifying existing code. It's there for a good reason. I like to tell when we compare, hey, we're 20 times faster than the original engine in SQL Server. It's not that that original engine is slow. It's actually a highly optimized engine for an architecture that focuses on disk-based storage. And I know of no way to make it faster than it is. It's really well done, and that's true of most of the relational database systems out there, but there's certainly our ways to get a lot better performance out of it if you focus on the main memory side. And so we rewrote the existing code for the critical path, and we kept the rest to keep the database and the product looking like customers expected it to look. So I think that's what I wanted to cover. And then I have a bunch of references. There's a great overview in Sigma 2013. It's pretty much this talk. There's a relatively new paper on how we do the compilation and data engineering. The BW Tree, ICD 2013 has a good talk there. The concurrency control mechanisms. You won't hear the word hackathon or SQL server inside this paper, because it was too early, but we were describing it back in BLDB 2011. David DeWitt gives a real fun keynote to our user group conference on how it all works. And now these are more customer oriented, but they give you a feel for how we expose this to customers, how we explain it, how we, you know, developers using it. All right, I should have paused more often to ask for people to ask questions, but certainly I can do it now. Big nagging question on that. So if you have tons of action that goes after the hackathon memory table, and I'm also talking to you as something that's just based on those words, you can just do the afternoon. They, the biggest challenge that they can use, there are restrictions on the serialization or the isolation levels they can use, mainly because SQL server already has a snapshot based mechanism for its row store, but that's a pessimistic snapshot versus the optimistic multi-version snapshot that the hackathon uses. So for example, they can't use snapshot on both. They can say, I want hackathon to be serializable and the regular row store to be snapshot or vice versa, but they can't do, they can't do snapshot, snapshot, which is one of those kind of frustrating things, you see. Typically, a lot of apps, but for worse, you use recommitted isolation level in SQL server and recommitted in the regular SQL server and snapshot in hackathon, which is hackathon's default mode. We're fine together and they're perfectly compatible. And if you're doing statement level transactions that you're joining across these two, a table and one engine and the other, we make that look fine, even in recommitted. So some things work well, common application patterns work well, but like the really desirable one, you'd like to have snapshot, snapshot, that one doesn't work because of optimism and pessimism. We're working on that. No, this feature option is available in the high-end version of SQL server, enterprise edition. It's also available in the developer edition. So you can develop around it. It's not available in Express right now. All right, I think it's time to thank Mike again. Thank you. Thank you.