 All right, guys, thanks for coming. Today is the last day of our seven databases in seven weeks, which is sort of six databases. This foundation need to be canceled for us, but that's OK. But we're really excited today, because we have Ryan Vets, who's the CTO of OTP. Now, I know what you're thinking. I know you're thinking, like, oh, well, it's an executive guy. What is he going to be able to tell me? But hold on, chicken eyes. Ryan is not your typical executive. He's not the stuff shirt guy, right? He is actually engineer number two at OTP in the very early days back in New England. He was running around for summer when we built a real system in nature that then became OTP. But now he's moving on up the executive ladder. He's now the CTO. He's been there for about two years. And he's here to talk about some of the retrospective of the things that they've done in OTP dealing with real customer problems. So for that, thanks a lot, Ryan. Thanks, Andy. Is this microphone on on my shirt? Should I use this one either way? That one uses the mic ones in the video. That one's for the room. OK, I have to stand real still and use this mic. All right, sounds good. So thank you, Andy. It's a pleasure to be here. Thanks for having us. The series looks really awesome. In fact, I wish I could have seen more of it. I know the videos are online, and I'm going to have to watch them at some point. We saved the best for last. Thank you. So as Andy said, my name is Ryan. I'm one of the original engineers on the BoltDB project. I'm going to talk a little bit about what BoltDB is. And then I'm going to talk a little bit about some of the problems that we had to solve after we started shipping BoltDB to real customers. I think that maybe is a slightly different point of view from some of the talks that you've heard. I think we're not going to dig a ton into the database theory, but I'll describe what makes Bolt unique. And then we'll walk through kind of what we had to change. So first of all, what is BoltDB? Who here knows about the HStore research? Anyone? All right. So BoltDB is the productized version of HStore. HStore was a stonebreaker paper that Andy and a bunch of others worked on that set forth a specialization of a database for online transaction processing. I don't see any big stonebreaker paper. Yeah, it depends on how you look at it. So it depends on who you ask. So it turns out that the paper actually was instantiated in code by a bunch of people that don't know their Andy nor Mike, but it's OK. So what is HStore? HStore said, if I want to specialize a database to be really good at transactions, what would I do? If you follow kind of the stonebreaker history, it followed the Vertica, C-Store research, and implementation, which asks kind of the inverse question. If I want to be really good at OLAP, what do I do? I use columns. I use compression. So this is the opposite. If I want to be really, really fast at transactions, what do I do? So this is the result of that research. A SQL relational database clustered from the beginning, fault tolerant. The core of BoltDB is open source. You can find it at github.com. Slash BoltDB in memory, highly parallel, used for operations and analytics. So let's take a step back. It's 2008. I should also reference the slides I'm presenting are a mixture of my slides and slides from John Hugg, who's engineer at number one at full, a great colleague. And so I want to give him credit. He's kind of laid out the topic here today. I think I really like this talk. It's 2008. The world is fresh and new. There's an election going on. No one thinks that in-memory databases are really the next big thing. In fact, in 2008, if I went and talked to someone and said, hey, I have this really awesome in-memory database for you, they'd look at me and they'd say, well, what's going to happen when it crashes? No one says that anymore. The world is kind of a different place. Looking forward, though, through the HStore research, there were a couple of major trends that were pretty easy to forecast and that have come true in retrospect. The first is that multi-core is obviously the future. CPUs are going to get faster and faster. Memory is going to get larger and cheaper and that you can specialize using these trends to create a database that's extremely good at transactions that's fully in memory, that's clustered and that's cloud-friendly. Specifically, if you look at traditional row stores, they're not very cloud-friendly. They're scale-up systems. They require a lot of disk I.O. That disk I.O. requires a lot of specialized hardware and a specialized hardware works against cloud deployment. They also have fault-tolerance mechanisms that aren't necessarily friendly to cloud deployment. So basically, we're going to put the database in memory. We're going to build a scale-out system. We're going to make it really fast at transactions as opposed to online kind of dataset-wide analytics. If you look at this, the major qualm, I think the major big question mark here wasn't cloud or multi-core or scale-out versus scale-up, the major question mark was all about in-memory. I can't tell you how many times I would go talk to someone about an in-memory database and they would really question that. There were a lot of questions around how are you going to make it durable? How am I going to afford this? What workloads really fit in memory? You can still argue today about what workloads fit in memory in which don't, but no one has any prejudice off the top of their head around an in-memory system. VoltDB is a fully durable system. Actually, I'm going to talk about our durability model. It has an incredibly rich durability model, synchronous durability of a per-transaction basis to disk with multiple F-syncs across all replicas before a response back to a client. At the same time, all of the data is in memory which is absolutely critical to the performance that we're trying to gain. A second major qualification around VoltDB was external transaction control. If you look at modern online transaction processing applications, they're all generating work from other middleware systems, or from distributed sensor systems or machine-to-machine systems. They're not really driven by operators who are typing at a green screen. I'm not sure the word green screen has a lot of meaning here, but if you are older, you may have gone to book a plane reservation or something, and you'd actually go visit someone in person and they would type at a computer and they were typing essentially database transactions to make a reservation for you. No one uses transaction processing databases that way. Now you go to Kayak and they automatically generate this transactional load. It's fast frequency. So the way that you interact with these database applications is really different. And if you consider the change in that interaction pattern, you can begin to reformulate how you manage the context or the scope of what a transaction is. So in VoltDB, we removed client-side transaction control. You can't have a program that begins a transaction, has multiple conversational round trips back and forth to the database and then decides to commit or roll back. One input from the application is exactly one transaction in both. So I would say that these were the two main stumbling blocks or the two main kind of key questions that we weren't sure if the market was going to accept. As I had mentioned, the in-memory has been readily adopted by lots of other people. In fact, in-memory is kind of a hot new topic and external transaction control is something that a lot of other people will have done. So essentially all no SQL systems kind of have no external transaction control. One interaction with the database is effectively one transactional context, whatever that means, right? Whatever level of animicity they're giving you. Systems like, for example, I believe you had a MemSQL talk. MemSQL has also given up external transaction control. They don't offer any commit or roll back control to a client or to an application, right? So this has become a common way to scale. This is a great paper. If you haven't read the HStore paper, this end of an architectural era is a good summary of what the marketplace looked like and what technology trends look like. So what did we build coming from these assumptions and from that research? We built an in-memory transactional store. We decided to keep the limitation of no external transaction control and we adopted a really interesting concurrency model. And I'm gonna talk through that. But basically we made all access to data in both DB single threaded, right? So there are no concurrent index or table structures in Bolt. It's all single threaded access. That's one of the key insights about Bolt. So let's talk through what that looks like. I like to explain Bolt to technical audiences this way. I think this works out pretty well. Bolt is remarkably simple. So I need you all to clear your minds of complexity. It's kind of a Zen moment we're going to have. So picture that I give you a simple task, right? I say, I want you to write a program that can execute as many relatively short running SQL statements in a period of time as you possibly can, right? Don't worry about anything else. The only challenge is how many SQL statements can you run in a unit of time, right? You might come up with a pretty simple program that's just a very tight loop evaluating prepared statements, right? So at the top of this loop, you could generate some parameters for a prepared statement, right? And over the course of this loop, you could evaluate all of the SQL that is necessary to complete that prepared statement, right? So you just a single threaded loop running prepared statements. This is actually the core of BoltDB. Of course, that's not really enough to make a database. I only have one loop, right? I have one thread doing this. What happens? How do I utilize the rest of the cores in my machine? How do I scale this horizontally? In BoltDB, it's really simple. We simply partition data across multiples of these loops. So we'll take your table and we'll shard it row-wise across a bunch of these different SQL executors. If anyone's familiar with no SQL systems, right? They're all distributed hash table systems that hash values or column families or whatever by a key. BoltDB also hashes and distributes data using a consistent hash algorithm in the same way, but we distribute rows by the value in a particular column, right? And so we take a table and we shard it across a bunch of these different SQL executors and each executors kind of in a tight loop just pulling a SQL statement and running it. Now that we have a sharded system, you have to be able to route work to the appropriate shard. And so there's a very simple what I called command router here, right? It's gonna look at an incoming command. It's gonna say which particular shard, we use the word partition, which particular partition has the data relevant to this command and it's going to run it. And then of course you need to make this durable because at the moment this is all in memory, right? So how do you make this durable? You can produce a journal, a log essentially of the incoming commands on a per shard basis. So when you do this, you can begin to scale out the ability to run millions of SQL statements per second in this eval loop. You can add multiples of these evaluators to scale across a server and you can continue the scaling pattern across multiple servers. You can begin to replicate these little SQL eval loop boxes which have some data in them for fault tolerance. And to achieve that replication, you can achieve that replication by teeing the incoming commands at the command router off to the individual SQL evaluators. So when a VoltDB cluster is running at the heart of that cluster, there's basically a single threaded SQL evaluator which has exclusive access to table and index data that represent some shard of the database. It's receiving incoming commands which are individual SQL statements or requests to run a stored procedure from an external application. Those commands are being routed to the appropriate partitions. The commands are being teed to all of the replicas of that partition because everything in VoltDB is strictly serializable, single threaded and we enforce determinism in our SQL planner. You can run all of these commands independently in the same order and you end up at the same deterministic result. So I start from a known starting state. I have a deterministic function I'm going to apply. I'm gonna reach a deterministic result. So this is basically how VoltDB works. This is the core of our concurrency model and this concurrency model is the thing that really separates VoltDB from other database systems. So what's the result? The result of this is that I can build a three node volt cluster that can operate at roughly 100,000 transactions per second on a per node basis. I can replicate that in an active, active, synchronous fashion across a cluster with response times back to a client, round trip response times back to a client at 99 or 99.9th percentile latencies of about one to two milliseconds. So another way to look at this is that every volt node can saturate one or two gigabit ethernet link ports with incoming stored procedure invocations, right? So relatively small pieces of information. Here's 20 or 30 parameters and any of the stored procedure to run with these particular parameters. It can saturate gigabit ethernet link on the inbound side. It can execute those in a fully synchronous, fully durable, serializable acid fashion, generate responses in milliseconds back to an application. And we've also, I'm gonna talk about this a little bit later, but we've built in an export function so that we can connect these incoming streams to a downstream system like an OLAP system. So you can put a transactional system that's running at streaming speeds in front of a high-speed data source. You can transact against the incoming events using our asset sequel model. You can take that transformed data or transact it against data and you can push it downstream in a paralyzed fashion on the cluster to a downstream target for long-term persistence or other OLAP workloads. Yeah, we do. That's the mechanism to roll back transactions that involve multiple nodes. That's correct. So there's really two core transaction types. I'll repeat the question. I guess so the question is, since we're sharding data across table, how do we basically maintain transaction control across the shards, right? So there's two different main transaction types in VoltDB and we call them single partition transactions and multi-partition transactions. The single partition transactions, these are the transactions that scale to hundreds or thousands of millions of transactions per second on a cluster. And those are distributed, as I had mentioned, by essentially teeing the incoming command to run that transaction to all of the nodes and they'll reach the same deterministic result at each node, right? So they'll reach the same fate at each place. We have internally built into the system various consistency checks. So essentially any time that we pass a sequel write or any kind of mutation to the sequel executors, we hash the input parameters to that at each site and then we communicate those hashes back to a transaction coordinator in the system and we compare those hashes before we respond back to a client. So we make sure that, for example, someone can't do something non-deterministic in sequel. The most popular thing that people try to do this non-deterministic in Volt is though like select star limit three, which is basically, give me three random tuples from your database, right? And then they'll try to produce a write based upon those tuples. So select star limit three, let me delete the second one, right? That's really not deterministic. And so we have consistency checks that catch that at a number of different tiers in our system. At the planner we'll catch those and issue determinism warnings and then at the execution time we'll catch it and have a hard stop of the system if it manages to slip through. So the single partition transactions all share the same fate through this determinism scheme. The multi-partition transactions aren't nearly as sexy. They basically get toothpaste committed across the cluster. What's interesting is that within the multi-partition transaction camp it's really important to break them up into reads only versus read-write transactions. And Volt really the only reason that you would do a read-write transaction in a multi-partition fashion is to update replicated data. So the large back tables in Volt are all sharded across the cluster but we also support replicated tables or dimension tables that are complete at every partition instance or every shard. And so to update those replicated tables in the consistent way you can use a multi-partition write transaction. So the main use in VoltDB of multi-partition or global writes is to consistently maintain or update dimension data which is changing at a rate that's orders of magnitude slower than the incoming event stream that's being processed. So you might update your metadata 10 times a second at max in a network of devices. You're updating device state or installing new devices. You're gonna be maintaining a library of installed devices. That changes much less rapidly than for example, reading every 15 minutes smart grid meter readings from 53 million meters, right? So in that case, every 15 minutes you're getting 53 million reads or writes, right? That's at a whole other order of magnitude. So you design your partitioning scheme in Volt to scale the single partition workload or that incoming event to make that single partition and you'll use multi-part writes mostly for updating dimension data. Multi-part reads, however, are a slightly different story and I guess I'll talk to that now. So one thing that VoltDB does that's really cool is it maintains in-memory materialized views that are transactionally consistent. So as you change data, you can have a view derived off of a table and will maintain essentially streaming aggregates in that view of the state of that table. And essentially what we're doing there is we're amortizing the cost of maintaining these aggregates across the ingest workload like that single partition incoming workload which we can scale horizontally and that makes it essentially a constant cost operation to read global state. So we have kind of these one-shot multi-part transactions that just do a distributed read usually from a view to pull back some grouping and then at a top level executor will regroup and re-sort and return that result back to a user. So a really common VoltDB application pattern is to have some fast event source. Let's just continue using the smart grid meter example. I have all of my smart meters in the UK over the next few years will be pushing data through a system that uses Volt. So that's all single partitioned right, that scales horizontally. At each partition I could be maintaining a running aggregate of some state on that table of meter readings and then I could be using global of distributed multi-part reads to summarize that state across the cluster. I can handle millions of the incoming writes per second and I can handle five to 10,000 of the distributed reads per second, right? And so you kind of have the analytics workload kind of at a dashboard or human speed versus the right workload which is happening at a machine or device speed. There's a second question. Somewhere related. So, you know, you have to pick the column which is short by and so they're gonna be queries that come in that don't identify that column. So you just have to broadcast them or all this. They get broadcasted. But in the, so the question is how do you basically route queries to data? And so, so in Volt TV the main interface to the database is through Java stored procedures and Volt stored procedure is Java for business logic and SQL for data access. When you define the procedure you include some metadata that identifies which parameter in the procedure corresponds to a partitioning column in the database and that becomes the input to the hashing function that routes that procedure invocation. If you aren't identifying that parameter then it becomes a multi-partition query, right? Which is what I was talking about. In many cases people will to some extent pre-join data in Volt, right? So that they can have multiple tables by partitioning on the same key and direct data to the same singular partition that contains that. So, for example, let's say that my application is that I have wifi access points in a shopping center. And as I walk through the shopping center everyone's phone is announcing themselves, right? To that wifi center. And so you can track basically people's movements through that shopping center by tracking their phone accesses to that wifi point, right? And so you have really two main tables of information in this system. You have one table that's the metadata about the location of all of those different access points and then you essentially have the growing fact table that's all of the associations that phones have made to that point. So on VoltDB we would probably write, we would probably partition that by putting, for example, the access points device ID in both tables, right? And then we would guarantee that all of the data for any particular device was always within the same partition. So we would essentially pre-join that data. So how do these things work out? In memory, it's pretty good. No one complains about it anymore. And I would say that it's a sort of a done deal. There's other faster flash and non-volta memory things coming to market, which I think will only further solidify the value of in-memory systems. No external transaction control. I think this is actually the hardest, one of the hardest things about Volt. So there are two things in Volt that really categorize it strictly into particular use cases or that make it unsuitable for certain other use cases. And external transaction control is one of them. The single-threaded execution model is the other. So by giving up external transaction control, you basically give up large portions of the traditional business application market, which are built on traditional roadstores, right? So anyone that has any kind of a traditional business application that's using like an ORM, I think that, it cannot just be magically ported to Volt because that application expects to have external transaction control, right? So I'd say that that's been one major drawback of external transaction control. However, in the applications where Volt really shines, high-speed transacting against almost unlimited velocity inputs, this is a huge win. Because in all of those cases, people compose systems more as a set of services and our stored procedure set is essentially a transactional API to data, right? You can look at a VoltDB database in many ways as providing a service. So you define some set of stored procedures. Those procedures are the operation or the API of your service, right? Every invocation of that stored procedure is going to run some business logic with the full capabilities of an assets SQL system behind it and return a result. So the system begins to look a lot like a service or a SOA-oriented middleware piece and that's worked out very well for us. So single-threaded and parallel. This is, I think, the other major thing about Volt that really positions it for and against certain applications. The being single-threaded is awesome at high throughput kind of short-running workloads. You simply can't beat this, right? This is the most efficient way to use your CPU. The CPU never stalls on a lock. You avoid all kinds of cache coherency pain. There's no deadlock possibility, right? The amount of overhead that's stripped out of the system, both at the physical execution layer, right? Management of caches and memory coherence as well as at the database transaction management layer is huge when you moved to the single-threaded model and it enables a really incredible amount of throughput. However, it also means that if I begin an operation that runs for a long time, I fully consume that partition, right? So it's not fair. It's fast, but it's not fair, right? So if I have a long-running query, that long-running query is going to get exclusive access to the CPU that's managing the data that's required and all of the other short-running queries behind it will queue, right? This means that it makes both particularly well-suited at places where you have really high-velocity inputs with really fast, repeated kind of application requests and it makes the system poorly suited at workloads that are really mixed between longer-running, OLAP style queries where you're really manipulating or joining or scanning or filtering large portions of the database mixed with really short-lived transactions. However, I have to say it's an absolute joy to work on single-threaded C++ code. So it actually makes development relatively fast in many ways. All nodes the same. So like I had mentioned, my colleague, John Hugg, made these slides. John loves the fact that everything in Vault is purely peer-to-peer, shared nothing state. I have slightly different feelings about this, but in the end, I don't think it matters. It turns out that if you're selling people relatively small clusters of three to five nodes, having all of the operations of the cluster fully contained in that node set is really nice. So if you wanna run Vault, you can download it on tar, tar ball on your nodes and you basically have successfully installed Vault. Unlike other node SQL systems where maybe you need to run ZooKeeper and a Hadoop cluster and this management cluster and a Nimbus node or a name node and data nodes, all these different functions become harder, but at the same time, it turns out that we combine sort of in the same unit into one server all of the storage responsibilities and SQL execution responsibilities and all of the transaction management responsibilities. And those two functions don't really scale on the same axes, right? You don't need as many nodes to do transaction management as you need to store data. So I think if you were to build an extremely large system, if you wanted to target clusters that were hundreds or 1,000 nodes, I would probably build a system that had split responsibility, that had some nodes responsible for transactions and some nodes responsible for data. However, the core Vault use case tends to be nodes, clusters between three and 20 nodes in size and being able to hide all of the complexity of different node types from users in this regard turns out to be a nice commercial advantage. So performance above all else. So this is kind of another way of restating the single threaded and no external transaction control conversation. I think that this is actually pretty great. I have always looked for opportunities to make things that are really different from everything else in the market. I think that differentiation is absolutely king and VaultDB is unbelievably differentiated on its ability to perform transactions. And I think that all of the value in Vault comes from the fact that we put throughput and performance first. You can use Vault to solve problems that you can't solve with any other system, right? Because of the trade-offs that we made for performance, right? At the same time, we've given up the ability to be general purpose, but that fits my personality. So safety through active, active. This is a huge win for Vault. Vault has a synchronous, active, active, high availability and fault tolerance model that's really remarkably fast and remarkably efficient. However, no one saw the value of this until he made the system durable to disk. In the original h-store research, it was proposed that you could run in-memory systems with in-memory replication and that would be sufficient that you wouldn't need to make data durable to disk. It turns out that that doesn't pass muster with anyone. And so one of the very first things that we had to change in Vault was adding a durability log, which turned out to be very low overhead. But once we did that, people really became able to appreciate the replication model. And now fault tolerance, high availability are just required things. SQL support. When you go to build a new commercial database, you sort of have a problem, right? You have two avenues, maybe three avenues you're trying to pursue simultaneously in order to build a commercial product. You're trying to build more SQL support, which takes a lot of time and development effort. I mean, other people have been at it for 20 years, right? Especially in the row store space. You're trying to build an efficient distributed system that's reliable, fault tolerant, high availability with nice performance and latency guarantees and cloud and other kind of unknown environments. Then you're trying to build a commercial entity around some particular use case. So when you release it to the public, it's kind of a question of how much SQL do you need to have? When we shipped BoltDB, essentially the SQL that it supported was some basic cloud operations and some basic filtering operations. And we've been building more and more SQL. A really interesting aspect, if you go out and build databases for a living, is that if you present to people that I've built a transactional row store and it's SQL based, they will expect massive amounts of SQL. If you go and you tell people, hey, I built this key value store and it just stores documents, they will have much more limited desire for SQL. So this is expectation setting paradox here. That BoltDB has always been more interactive and more expressive than those SQL systems. We distribute data and shard and have fault tolerance similar to them. And at the same time, we've often suffered criticism for SQL support. However, five years, six years down the road, our SQL support is much, much richer than it was when we originally launched. So what did we have to add from that basis? And I wanna walk through a little bit of what it looks like to start, not just to build an academic project, but to actually build a product. So the first thing we had to add was durability. No one had any interest whatsoever in memory-only system. It wasn't that interesting to people. So we built a durability mechanism into Bolt. It turns out that the way we do durability is quite clever and works out really well for us. We don't journal changes to tuples. Instead, what we do is we log the incoming commands. So you have an application that's sending SQL statements or stored procedure invocations to the database and we write down those incoming commands. That means that you can kind of figure out how fast that might be by kind of matching the IO throughput of the network to the IO throughput of the disk. So basically, as much data as you can send over the network, at least a gigabit ethernet speeds, you can basically write that much data to disk. So kind of what we do is as we receive the incoming transaction stream or stored procedure stream from the network, we're basically writing that to disk. And then to truncate that log periodically, we can create a copy-on-write snapshot of the entire cluster state in a transactional way. Yes? You just log the mutable commands to disk then? Right, just the mutable commands. Do you return to the client before you log those commands? That's configurable. So first, the commands are being logged at every replica, right? So there's multiple copies of the log being met. And you can run Vulti B in either synchronous or asynchronous durability modes. So in a synchronous mode, we will wait until there's an F sync acknowledged by the operating system at all replicas. So that data is presumably on disk if your disk caches are configured correctly. In asynchronous mode, you can set a desired time window and we'll leave some tail. And it turns out that different modes are valuable to different applications. So in synchronous mode, so I understand how things run at the speed of the disk logging essentially? In terms of latency, yes. So these things are all group committed to disk. So you batch a bunch of them together and you write. So you pay a latency penalty much more than a throughput penalty. You can achieve very high throughput at syncing to disk right at the same time. So you're not writing a lot of data to disk, right? You're basically writing one input for every transaction. So that transaction might change vast amounts of data, right? It might delete half the database. It's kind of a weird example, but it could, but you're only gonna write down a few hundred bytes, right? And so that's not. Key value updates sort of more flow is not going to achieve memory latency. No, right, right. You get away with this because of the determinism. Right, yeah, that plays well with the determinism in the system. So we can produce these logs in these independent places and replaying that log is basically just like receiving commands from the application again in the first place, right? So that's correct, that's correct, yeah. You don't play any kind of compression to reduce the amount of disk right? Yeah, we pass all the logs through a compressor. I forget the algorithm that's currently in Snap, you know? So yeah, which I think makes a lot of sense. So the other thing that we added was elastic clustering, the ability to kind of expand a cluster on the fly. VoltDB actually does this in a really cool way. If you add a node to a volt cluster, we will build an index essentially at each node of the partitioning keys. And then we will adjust the ring on a consistent hash window to slowly kind of trickle, delete tuples from a source partition and insert them into a destination partition in a fully transactional way. So you can actually load balance new or old data to new nodes in a fully transactional way in volt and as a background operation. So there's another really interesting feature that we added and I'm gonna talk to it a little bit here. It's a little bit missing from this presentation but it's absolutely core to volt. And that's something that we call volt export. So volt was originally incubated within Vertica. In fact, all of the original volt employees worked for Vertica for about a year. And of course, Vertica is this OLAP system and both this OLTP system and it would make sense to try to combine them in some way, right? So you can get best in class analytics against large amounts of data and best in class transactional capability against fast incoming data streams. How do you couple these two systems together appropriately? And volt export came out of that desire. So let me describe a little bit about how volt export works. And then at the end of the presentation, I'll talk a little bit about the things that people build with export. In a volt DB sequel in our schema, you can declare an export table. Any data that you insert into that table, essentially enters a queue and will be passed to a connector or framework that will can push the data in that queue to a downstream system like HDFS or an OLAP system like Vertica or another queue like a distributed queue like Kafka or RabbitMQ. So what does this mean? It means that you can use volt DB to transact against incoming data. You could, for example, build five second session windows out of click stream data off of a web application. Every five seconds, you can bundle up some summary of that information. You can use a SQL insert to write it to an export table and volt will in the background push that in a streaming fashion, essentially in a continuous trickle loading fashion to HDFS or to an OLAP system. You can use volt DB this way to bring in 100 odd megabytes of data on the front interface of the system, transact against it at 100,000 transactions per second, kind of in the middle and then stream out another 100 megabytes or so of data at the back end of the system. So you can start building real time ETL capabilities off of this. In fact, when you look at really fast OLTP problems, they start to look a lot like streaming problems because you are kind of interacting against this unending stream of data and you're trying to transact against each individual event. You naturally need some way to flow or capture that incoming stream out of your transaction processing system into your system that you're using for long-term analytics, exploration and other data science activities. And volt DB export provides that conduit. We've also have added JSON capabilities to Volt. You can store JSON data in a Varchar field. In fact, you can index into JSON data using a Varchar function in Volt as part of an index declaration and we use materialized views for a lot of analytics cases. I've talked about some of these performance numbers. It's really hard to overstate kind of how surprisingly fast Volt is on hardware. One of the very first things that we did with Volt as someone came to us and said, I have this tiny data set, but it changes really rapidly and Oracle can only change this data set a couple, like maybe a thousand times a second. And it turns out that even back in 2009, a crappy laptop in Volt could change that data set at 30,000 times per tuple per second. So once you kind of give up all of the locking and latching and once you dedicate yourself to the single threaded processing model, you can get extreme efficiency out of your CPU and you can change data at a really amazingly fast rate. It turns out that reads and writes are kind of on the same performance workload. However, we do load balance reads across all the replicas in the system. So you don't need to read every replica to calculate the result of a read. And so while reads and writes have sort of the same latency, reads, you can have multiple, like two or three X the number of reads as writes in a system depending upon your replication factor. I'm just gonna go a little bit faster here, just kind of watching time. So what could possibly go wrong? I mean, there's incredibly simple system, everything's in memory, there's no disks, there's all the IO problems are gone, there's no concurrent structures that need to be managed, right? Should be good to go. It turns out that the first challenge that we faced wasn't really surprising to anyone that's written a lot of C++ servers, but memory fragmentation totally sucks in production systems. And even though we had put a little bit of effort into this in Volt, it turns out it wasn't nearly enough. The first person that tried to unbolt in a production environment over the course of a week would always just OOM the system on memory fragmentation and you could actually create really evil workloads in Volt that would OOM the system in a matter of minutes once you understood some of the flaws in the memory management designs. So our first major challenge was just to come back and rework the storage model. This actually took a fair amount of time and we approached it in two different ways. Oh, the fun part of this slide is actually not visible. So another thing is you're building a commercial system with a group of engineers as you may know, you probably have more than one opinion. And so a number of these different challenges, there really wasn't a lot of agreement in the team on how to solve this, right? Your face was kind of a stressful situation. There's multiple design choices, picking the right one is hard. This is one example where we basically did two different things. We decided to manage fragmentation on our table system using a compaction algorithm. You can reallocate data into two megabyte blocks. You can specify an occupancy percentage for each block and if you fall below that then we'll look for blocks that we can merge. On the other hand, the team that was fixing memory fragmentation for indexes took a whole filling algorithm. So if you removed an item from an index, they would take the item that was lasted a block and fill the hole that was just created. This works out really well in Volt because in Volt all rows are fixed size, right? We, without a band storage for bar charge stuff that's also pooled. So this whole filling works out well. Who thinks, who has a favorite approach? If you were faced with this problem, would you pick a tuple method or index method? Any, any, come on, raise your hands if you like the index method. All right, you turn out to be the winners in this particular design challenge. So the index method, while it has some penalty, you move more data for large deletes. It turns out that compaction creates these sort of surprising latency moments. And it turns out that in this particular system in production environments, consistent latency is absolutely key. So once you have to enter a compaction event, you, all of a sudden can introduce hundreds of milliseconds of latency and it's kind of a surprise and that matters. So it turns out that the overhead in the index system also scales away horizontally. I might make writing data a little bit more expensive but that totally scales. I can just add another node and get back another 30% capacity, right? By adding a node to a three node system. So it turns out indexing wins here. Crisis two is latency. So the original system in Volt managed transaction control by grouping transactions into windows that were time bound. So I have all these different SQL executors and I have all these transaction kind of initiators at the top of the system. So every node has a transaction initiator. And the way the system worked originally is that at every kind of SQL executor eval loop you would wait until you had heard from every transaction initiator in a time window. Once you had done that, this is all communicated over TCPIP. So you know that you've seen all the ordered messages in that point to the past. You could then merge those into a timeline order and execute that merged window. This works really well if you're providing really high volume distributed asynchronous load to the cluster. No one does that. Especially when they're first trying the system. If you try a system for the first time, what's the program you write? You can figure a three node cluster. You write a single threaded Python application and you point it at one node and you say, just synchronously execute these transactions. It turns out that when you do that, you have to force a heart beating algorithm to close the window because you're only hearing from one initiator naturally. So you're not getting any information from the other initiators that aren't being used until you receive a heartbeat. We heartbeat it in the system every five milliseconds. That meant that your average single latency for a round trip was about two and a half to three milliseconds. And if you had a single synchronous client that was sending a transaction and waiting for response, sending a transaction waiting for response, you would get about a transaction every three milliseconds, which is not a lot. So that was the first major problem. The second major problem was it turned out that since that heart beating turned out to be really important, as you scaled systems horizontally, you're doing N squared heart beating. Every node needs to heartbeat every other node in the system. By the time that you build a 20 node cluster, you would spend about 20 to 30% of your CPU managing the freaking heart beats. It's pretty horrible. All right, so what you couldn't see at the index slide was it says John is right. That actually seriously hurts me to say out loud, especially when I know I'm being recorded. What it says at the bottom of this slide is Ryan was right, I got this one right. And at the bottom of the durability slide, another one of our absolute critical team members, Ariel, had kind of called the durability mechanism designed that he was completely right in that regard, despite resistance from some of the rest of us. So we moved both DB's transaction model, transaction initiation scheme to a much simpler scheme that didn't involve time windows or trains. And it works a lot kind of, who here's heard of the RAF consensus protocol? Anyone? You should totally read the RAF consensus paper. It's awesome. It's kind of a simpler way to get the same result as PAXOS. And it turns out that we shipped an implementation of this about a year before the paper was written. So we redesigned our transaction modeling system and it works a lot like RAF. So we never wrote a paper on what we did or described it, but I just tell people to read the RAF paper because it's kind of similar. So, but essentially we put, we designated one transaction order for each partition set or each replica set and it just communicates transaction order to its downstream systems. And this completely cured a lot of latency. The other latency work that was done was on some of the internal operations of the system. So some of the critical use cases for VoltDB are an authorizing mobile phone call authorizations. When you place a phone call with a prepaid phone, the network needs to decide on an ongoing basis whether you have a balance that should permit that call or not. It turns out that if you want to sell these mobile subscriber services to non-first world markets that people want to virtualize for cost and that kind of necessitates a really fast transactional database, it also means that you need to have five nines latencies like under 50 millis. So 99.999, that 39% of your transactions need to be completed in under 50 milliseconds. There's a ton of work that went into the Java garbage collection and other kind of engineering work to make that the case. So finally, scale out. Even having reduced all of this lot contention, scale out on Volt is still not perfect. Volt scales really well until maybe eight to 16 cores. After that, you actually do better to start more processes into segment at the operating system right across these cores. This is something that actually turns out to be a little bit tricky. You eventually hit bottlenecks at transaction management and some IO management layers. So there's ended interrupt scaling, some other things like this. So it's not so much a problem with the concurrency model of access to data as it is at the networking layers. So something we haven't fully solved yet. I think that probably the easiest solution here is to let people partition their machines into multiple database processes and provide tooling to do correct replica placement across that partitioning. So. You still have 15 minutes, I don't feel like you're running. No, I'm okay. So let me talk a little bit about kind of space efficiency in Volt. So this kind of comes back to the storage layer. When you write a row store, you can't really compress the data because you want to have easy access to tuples. You don't want to have to reconstitute tuples or make it difficult to find the location of a tuple in memory by decompressing a lot of data. So how does VoltDB manage kind of reducing the footprint? We move to a model where all rows are static length. And if you have bar chart fields that are more than 64 bytes, they're stored out of line in pools. And we've gone to a lot of work to make indexing overhead and other structures in the system as small as possible. This is kind of another interesting way to work against Volt because deserializing tuples in Volt is really easy. They're just nice to fix length, reading ahead a certain number of tuples into a block is nice, they're nice to fix length. And it allows us to be relatively efficient with memory, but what we don't do in memory compression. A lot of people who run production systems care a lot about write amplification these days. And so it's interesting to look at what data Volt needs to write to disk to maintain durability versus what for example an LSM or a log structured merge tree system might need to write. And VoltDB to maintain durability every incoming write transaction needs to be written to disk. And then to truncate that log periodically, we put the system into a copy on write mode. And in the background we flush the copy, we call that a snapshot to disk. That snapshot truncates that log. To restore the system to a durable state after a catastrophic failure, you first replay that snapshot and you replay that log of commands. So the amount of writing you have to do here varies really largely on how often you have to truncate the log. At the moment, we always create a full snapshot so we don't do any kind of incremental snapshotting. This works out fine for databases that are maybe a terabyte in size spread across a lot of nodes where the data is changing really rapidly anyway. It's a little less suitable if you have very large databases with really big hot cold splits. So if you have like 10 to 20 terabytes of data spread across your cluster and 80 to 90% of that data hardly ever changes, then you really don't want to have to pay the right pain on the disk to serialize that data over and over as that log truncation method runs. So LSMs have a slightly different trade-off. Are people generally familiar with log-structured merge trees? Okay, so you're probably more familiar with the trade-offs they make. So the key thing here is that Volt never needs to go back and compact its log, but it just fully writes a new snapshot every time. And it turns out that for small, heavily mutating datasets, that's desirable. For large datasets that don't change a lot, it's a little less desirable. So let me talk a little bit about what problems people solve with VoltDB. Although I'm happy to answer any other questions around architecture or design that might come up. VoltDB is used in a lot of production and deployments. If this talk lasted roughly an hour over the course of this talk, there'd be about 300 million volt transactions done in production environments. It turns out that this number adds up pretty fast when you have systems that do one to two billion transactions a day. All of these applications fall into three main categories. And I think this is actually interesting from a forward-looking perspective, because it kind of informs what might be interesting to build or what are people in practice assembling. It turns out that there are a lot of applications that have some high-speed data source. They could be smart grid systems. They could be distributed sensor systems. They could be location and I-beacon systems in a retail environment. It can also be online gameplay, click streams from web applications or tick streams from financial applications or ad tech display streams. All of these situations have some really high-velocity data input. You want to do kind of some piece of work against that to personalize or to build a system of engagement with the customer or the end user or the end device. And that personalization or recommendation is going to be informed by the result of some historic analytic process. This really necessitates a split system. If I'm gathering petabytes of data from an online game, I want to be able to analyze that large data set to look for ways to segment my player population. I want to be able to classify my players and I want to be able to use that classification in real-time to improve their gameplay experience so eventually I can sell them something, right? And so, VoltDB is used on the front end of that system in many cases. VoltDB can take the classification results that were derived from your analytic system, host those or cache those in memory and make fast personalization decisions informed by the current context. What has the user done over the last second or two in combination with the user's classification from their historical data to make a per-segment personalization response back that will have some ability to improve engagement? Basically, think of it this way. I like puzzle games that are easy because I tend to think all of my hard time on Volt and by the time I'm not thinking about Volt, I want something easy and fun to do. Other people really prefer puzzle games that are really challenging. So if I come to an online game and the game is too hard for me, he says non-ecotistically, then I might walk away, right? I'm not going to spend money in that game and I can become an engaged customer or download the add-ons or buy extra levels, right? I'm going to stop playing. My counterpart who likes really hard games is just the opposite problem. So the challenge of the game manufacturer then is to identify me as someone that likes easy games versus my colleague that likes hard games and to make the game easier for me so that I stay and play and spend money and harder for my colleague, right? So that they stay and spend money. Basically, the game has to tune itself to our different gameplay behaviors. So to do that, they have to be able to identify me as someone that likes easy games from all of their historical data and my gameplay activity and my colleague is someone who likes hard games and as we play, they need to use that classification result to tune the gameplay experience for us each individually as the game goes on, right? So the front part of that system is a transaction processing problem making a decision based on a small number of rows and the analytics part is an all that problem. Another common example where VoltDB is deployed is in eliminating batch ETL processes. So I put ETL in quotes here because this is not traditional ETL. There are lots of real-time data sources that need to be filtered or enriched or sessionized. These are common operations against these things. So for example, a customer uses VoltDB to do real-time fraud detection against ad networks. So it turns out that about a billion or $2 a year is fraudulently paid out to people who buy advertisements and then run a botnet to click their own ads and get paid for them. I'm not suggesting that any of the smart people in the room go try to make money doing this. So there are a whole another suite of ad tech companies that are out there trying to use machine learning techniques to identify this fraud. That means that they instrument the JavaScript in each browser and send back this instrument of data to a platform that's going to try to determine if that ad was clicked by a person or by a bot. And so kind of reassembling that data feed from the browser back into a coherent record turns out that it's kind of chopped up and set in pieces and you need to put it back together, kind of a fancy absurd problem, is something that you'd like to be able to do in a streaming fashion. And this absurd problem is a very simple transaction problem. Another example is filtering location data. American Apparel does a lot of real-time inventory and through RFIDs. So they have a lot of RFID sensors and all of the inventory in their store. They would like to eventually be able to track movement of all of the merchandise within their store. In particular, they would love to be able to know if the socks kind of always move with the underwear. If I walk into the store and pick up a pair of socks, then I pick up a T-shirt, I wander around the store and I leave without the socks, they would love to be able to get me to buy a pair of socks, right? Real-time systems of engagement. How can they entice me to buy these socks? Maybe they send me an ad on my phone for socks when I walk into the door or something. So it turns out that RFIDs create really messy inputs. They create a lot of redundant input. You don't care that a pair of socks sat on the shelf. Now and a second later it's still on the shelf and a second later it's still on the shelf. You really only want to record if you're a historical warehouse, it's movement when it moved. That means that you want a system that can filter out all of that redundant information. So that real-time filtering is another common application. VoltiB does this actually for a bunch of companies that manage marathon runners. So if you go and run a marathon or like the Philadelphia road race or a big road race, if you've ever watched anyone that does this or runs triathlons, you can typically drive social media updates over their progress across the course. Those are all sensor sort of proximity-based sensors, filtering that data out and driving that social media update as another common real-time application that has a surprisingly large number of transactions because the sensors are kind of crappy. And then finally, operational analytics. If you have all of these real-time data feeds, you want to have some transparency into them so that you can manage and operate them. You want to know, for example, on a smart meter system, if there are certain concentrators or if the meters report to a concentrator, the concentrator reports to the database infrastructure, you want to be able to identify concentrators that have failed, right? And so there's just a lot of simple grouping and time series and counter analytics that you want to be able to do essentially against real-time data or kind of key performance indicator metrics against some kind of an online platform. This is kind of the third common voltage case. In all of these cases, it's combining the capability of the system to do thousands or hundreds of thousands or millions of transactions per second with the ability to make a real-time decision against each incoming event with the capability of then combining or pushing this incoming data stream after it's been processed downstream to something like HDFS or Vertica. So without writing any application code at all, you can take either HDFS and Hadoop or Vertica. You can run a K-means clustering algorithm against the OLAP system. You can invoke bulk loaders without writing code either through a Hadoop output format or through Vertica UDX to push that classification report back to Bolt. And you can write 15 or 20 lines of Java and a little bit of SQL and Bolt to do real-time event classifications to each cluster as data's arriving at hundreds of thousands of times per second. So that's kind of a common way that the systems are used in combination. So that is the end of my prepared talk. I'm happy to answer any questions people might have. I hope this was of interest. I think it's a little bit different from like an academic deep dive into particular data structures or techniques. But I thought it'd be fun to talk about what it's like and some of the challenges that occur when you build a production system from a piece of academic work. So thank you very much for your time. Happy to take any questions. So who's your chief competitor? Right, right, right to the law. Who is Bolt D.B.'s chief competitor? The answer's always Larry Allen. It's always him. So my answer to this is a little wishy washy, which I apologize for. It turns out that Bolt actually competes on different paradigms. So when we compete for situations where people just want fast lookup, we compete a lot with companies like Aerospike. But when we compete in some of these ETL and kind of fast streaming applications, we compete against much less functional systems that are open source, like Storm or Spark, right? And so that's a whole nother talk, but you kind of can break Bolt into two different paradigms of use. People who are using it primarily for more traditional operational lookup, I think that's kind of one set of competitors, people who use it for streaming systems, kind of another set of competitors. What's the history of a database for the customers that are in the first class that are doing it for traditional data storage? Like how old are the databases? How old are the databases? Meaning how long is the data stored in the system? Yeah. So there are a few customers that use Bolt D.B. as a system of record, but it's much more likely that customers use Bolt D.B. to store some interesting billing period worth of work, right? So they might store a month worth of work or one billing cycle worth of work in the database, right? But the bulk of the data in these modern systems, the system of record is quickly becoming the OLAP store, right? Which is a real reversal from traditional database use, right? The rose stores are becoming about... People know the term systems of engagement. There's a really great paper on systems of engagement about it's basically that as businesses evolve, don't take this too businessy, but as businesses evolve, they need to find ways to interact with customers when customers want to be interacted with, right? You know this intuitively. It's much more successful if someone sends you an alert or an advertisement or a notice or an offer right when you want it, versus if they just send you a mail or to your mailbox that you throw away immediately, right? And so there's all these different kind of systems that are being built to make engagement with your customer more immediate and more personalized. And those systems, it turns out, generate tons of data. And so you use the transactional portion of that system for the personalization problem, but all of the history for that, the value of that history is in data science. And so you wanna pull that history into some system where you can analyze it using different statistical methods, right? And so because of that, the history ends up stored in your OLAP system and that OLAP system eventually becomes your system of record. That's a common trend that we see. How do you handle load balancing across partitions under skewed workload? How do we handle load balancing across partitions under skewed workload? Some of that pain gets pushed to the user. The user needs to pick a partitioning value that has high cardinality, that doesn't have a lot of skew. But it's important to consider skew from two different perspectives. So there's skew for a number of transactions that need to be run, and then there's skew for volume of data that needs to be stored, right? On the first one, number of transactions that need to be run, this is gonna sound kinda hokey, but it's true, volt is fast enough that almost never is total transaction throughput the concern. Almost always it's how much data can I put in memory that becomes the concern, right? So there's almost always overhead for transactional throughput. The only situation we've seen where that's not the case is applications where you would really like to be able to partition a financial application on a ticker symbol, and certain ticker symbols are just so much hotter than others that it's a bad partitioning strategy, right? We have, as I mentioned, we've added these elastic capabilities so that you can rebalance work across new nodes. That's just done using a very traditional consistent hashing scheme. So I mean, you can go, I mean, I'm sure you've already learned about consistent hashing, but you can study that anywhere, and that's not novel and bold, so. Yes? But does that mean that volt doesn't deal so well with say terabytes of data that can't fit into memory? Yeah, there are multi-terabyte volt clusters, but most applications that are suitable to volt seem to be between 10 gigabytes and five terabytes in size. Now, it depends on who you ask, five terabytes is either a tiny amount of data or a huge amount of data. In fact, like I have a little bit of insight into the distribution of Vertica customers by data size, and there are a ton of all applications against one to two terabytes of data, right? At which point five terabytes of real-time transactional data is a lot, but that's kind of the sweet spot. All right, let's thank Brian. Before we go, there's actually both TV T-shirts up here. We have men's and women's cut, so we don't need whatever he wants, right? All right, so let's thank Brian. Thank you. Thank you.