 minutes after 15 after and it's pretty short talk because I imagine there'll be a few questions. So what I'd like to talk about is some of the design challenges implementation challenges that we faced and solved in building M7. Now our M7 is HBase compatible no SQL database provides some really surprising characteristics. Before we do that I'm Ted Donner. I'm the guy in the right hat. That's how you can tell. Right hat. Okay sometimes I take it off. Still nearby. I work in math art. I also work on a lot of open source software either through catchy or unofficially been involved in open source since before dirt. All we have is rocks. There's some hashtags. It's not tonight it's today but it was tonight when I was writing this. I haven't seen it right then. So hashtags no SQL in math bar fast and I'll show you why. So first of all, math bar has a history of just implementing things that run fast. So for instance last fall we set the record the minute sort of record in the end is 1.5 terabytes 59 seconds on 2000 nodes. Of course somebody then got who put 2200 nodes on it and edged up that number a little bit. So one of our customers put 300 nodes on it and pushed it even for it. It's not efficient so I'm not allowed to say. But the fact is math bar runs faster on less hardware and the scaling factors get bigger and bigger the more overhead that there is in ordinary Hadoop distributions and that overhead can come from scale or from applications. So when you're scaling at very large size it tears that size. You start having overhead just due to size of the cluster and scheduling things like that. So the advantage that math bar has expands from roughly 2x to as we saw as we didn't see officially I'm supposed to take that out. To about 7x advantage in terms of hardware to hardware speed advantages. And at the same time math bar has a history of providing lights out sort of things. Snapshots, transaction and correct updates and so on. So that leads us to what is the situation with no SQL because these advantages in raw mapperies compute while still presenting a read write file system. While presenting a read write file system while presenting the ability to do transaction and correct snapshots those really ought to be leveraged and so the prologue to this talk is that HBase is actually really really good. At its core it's architectural vision based on those big tables and the design vision is really good except a lot of times it isn't so good. And these have to do with the tawdry day-to-day implementation details of actually getting it to work. And a lot of the problems have been due to kind of the bad marriage. I was there when Jim and Mike Stack first presented on HBase and the hub. There were two dozen, three dozen people there and the problems that they had that first day are some of the problems that they have still years later. And that is that HDFS is intended to solve the problems of large-scale map reduce and explicitly puts out of scope the problems of read write real-time update. HBase is trying to compensate for that and it's this bad marriage because HDFS is the only high-performance open source file system for Hadoop. There are many other file systems for Hadoop but HDFS is really the one that HBase by its nature has to target and yet HDFS by its own mission does not intend to support HBase all that well and indeed there have been problems. And so part two I'm going to talk about the implementation tour of how we use the advantages, the framework advantage, the platform advantages that we had accessible to drive an HBase API, an HBase-like implementation to very very high performance levels. I'll talk about all of them, but many of the tricks that we were able to use to drive the overhands down to drive the simplicity into the code and therefore get very high speed. And then I'll get some honest to God results. These are out of our QA team. These run all the time. So that's the problem. So the first question is there's a lot of no see why in the world would you want to do another one. Why does HBase even exist? Well the fact is HBase is quite common, especially in the new environment. About half of our customers run HBase. There are issues with that because a very large fraction of our trouble reports are due to HBase. But the core advantage is that as you get larger, as it gets bigger, the ability to scan data coherently becomes more and more important. And HBase's basic table architecture is rows divided into column families and ordered lexicographically so that row one, row two, row three, they're sorted. And then they're divided across hardware. And the fact that they're sorted according to key means that you can do sequential access. And so very large data sets are best accessed that way. If you're going to do any sort of collective operation on a large data set, you desperately want to be reading it from disk in a sequential fashion. I assign sometimes an interview problem where I ask people what would it take to update 1% of a terabyte. You've got one described sitting there. You've got 100-byte records. I want to update 1% of them. And the naive algorithm which goes and finds each record and updates it takes about a month. Isn't that horrible? So it's 10 milliseconds to seek, 10 milliseconds to rotate for the read, and 10 milliseconds to rotate for the write, times 10 billion. But you know, the scaling of many spindles only divides the problem by reasonably small constant. It doesn't fix the problem. Whereas if we were to copy the entire database, say 10 gigabytes at a time, in a memory, make the changes and write it back out to the same disk, it takes three to six hours. So by doing a hundred times more work by reading and writing the entire database, we get the result that's 100 times faster. That's the value of scanning. And that's the value that HBase preserves even in a parallel environment. HBase also provides architecturally a strong consistency model. That means when a write returns, at least in the main API, it says the write has been written to multiple machines. And that means that every reader who reads after that write happens will see that value. Furthermore, at the roll level, it's atomic, which means that every reader will either see that update or they will not see it. Now in no sequel world, one of the first things that people let go of is joins, but very quickly they tend to let go of the strong consistency model, either purposefully in order to gain resiliency across wider networking situations as in Cassandra or inadvertently due to the fact that they just don't quite get how hard it is to implement parallel systems. HBase by design keeps that assumption, keeps that strong consistency, and that allows the application developer to make assumptions, strong assumptions, about the data. And that's a very valuable thing. Scan works. It does not have to broadcast. It does not have to scan out of order. If you have a short scan, it will go to one server typically and it will be very, very fast. The ring base, no sequel databases, which inherently depend on a hash key for good balancing and good structuring and good spreading for a cluster, just aren't going to do sequential scans that way. Now there are many applications that don't care, but there are also very many applications, especially as the data gets bigger. Because once it's getting bigger, one of the things you want to do is not just point updates, but collective analyses on ranges of data. So HBase has a version there. It also scales automatically and it scales well. It spreads data out as you add new nodes. So it can grow or it can shrink dynamically and fairly efficiently. It's also integrated with a do. And it's integrated tightly with map reviews. These are all great, but there are issues. So for instance, crash recovery takes too long. Meantime, the recovery is much too long. May lose edits after a crash. Dot meta may not come, dot meta is just where table meta is. Nothing big there. There's a real problem here. It's also very, very complex to assign algorithms. There's classes of problems with the implementation of HBase. Point time recovery, actual point time recovery doesn't exist. They have snapshots, but as it said in the discussion of the development, they should really be called fuzzy snapshots. Not really point time snapshots. Data that was written before the snapshot may not appear. Data that was written after the snapshot may have been. So they are kind of a quick referential backup. It's more like CP-RL in Linux. It isn't really a snapshot. There's complex backups. There's bottlenecks. There's manageability issues. So, you know, there's this theoretical entity of HBase, which is wonderful. Here's some of the jurors that are problematic. When you have failures, because HDFS and HBase are separated, the data may not wind up where the region is assigned to you. So you have non-lawful data access. You may not use this space very well because it doesn't have control over head spindles. There's a limited number of tables. You can only have several tens of tables, maybe a hundred. If you have a thousand developers, that's maybe not what you want. There are lots of operations that are so invasive that you need to do them manually. So there's one called concactions. The performance of HBase is critically dependent on log structure and merge trees. What these do is they keep data in memory as long as possible. When they need to flush it, they sort it and write a continuous file on it. But then they wind up with all of these parallel files and define a data element you have to probe vertically through those files in some sense. Your read performance starts to suffer, so you need to merge some of those files. And there are updated versions of data cells in there. So, merging them actually reclaims disk space. That merge process is called compaction. And because HDFS, which is storing the data, is over there, and even relative to HDFS, the data is stored in a file set of something that HDFS doesn't control, that arms length thing means that you can have IR storms very easily. And these compactions can therefore compromise latency dramatically. So that's typically done manually. Splitting is another thing. In our benchmarks that I show later, the M7 system will be doing auto splitting, but HBase is unable to complete the benchmarks if we turn on auto splitting. So the standard practice is to manually split. You also have to manually merge if you want to do the opposite of splitting. Basic administration means that you have to take a table down. There's a slow crash recovery. But if we take HDFS 1111 there, that's been superseded now by 5843, and that depends on all of these jurors. This is an example of the complexity of the problems that HDFS faces. To fix the simple problem that HDFS takes tens of minutes to come back after taking it down, all of these things have to be fixed. And about 20% of them have to be fixed. But these, see the HDFS, the HDFS, HDFS, HDFS, those are the real kickers because HDFS doesn't drive HDFS. HDFS has a different master. So these are going to be very hard to solve all of these issues in order to drive down that. This is the fundamental issue with HBase. And at its heart, the problem is distributed systems. In HBase we have an HBase master, a region server, we have a zookeeper, we have HDFS, we have the name node. All of these are themselves distributed systems except the name node to its share. And these distributed systems are each hard to get right and collectively they're enormously hard to get right. For one thing, the failure modes that you have to consider are at least 2 to the nth. Every component could be in a failure mode or an operational mode. So if you have n components, you have 2 to the nth failure scenarios. Clearly you aren't going to address those explicitly by examination of cases. You have to address those logically and formally sometimes. But that's hard to do when you have other people changing other code with very loose API contracts. And there is no distributed transaction framework. That's because HDFS never intended to do the transactions, never intended to have strong consistency. Well that leads back on HBase. The job of garbage collection can just wipe out your consistency models because not only do you have failure modes, you have partial failure modes where somebody just may have checked out. If they didn't really fail and they're going to be back and they're going to expect to keep on working, or they may be slow because of IO competition. So this is really, really difficult. As an example, here's just region assignment. All of these communicating messages are required just to do the correct thing. And now we have to, if we're going to do a correct design, consider the case where any or all of these threads of control on any or all of these systems goes wrong. And at its heart, it's HDFS. HDFS splits files into unsynchronized data nodes with a single point of bottleneck, the name node, where all metadata updates have to go. That's called the name node. And this is the fundamental problem. In order to get high update rates, you have to do the updates without involving the name node. The name node is one machine. It can update the file system at about 500 ops per second. And so we dare not tell it that we're updating the file system at 100,000 changes per second. We would have an enormous problem. And so this architecture, which is okay for members, is fundamentally flawed for distributed systems that depend on coordination. And there are also memory limitations, things like that. We have block locations that are mediated by the name node. The name node keeps track of block reports, which are another source of IO starch over the network. If you have also failure modes, which can freeze the system. And so you have real problems there. Scalability problems of many sorts. And the fundamental issue comes down to four design parameters. These design parameters, the unit of IO should be 4 to 8k. That's what God and physics intend, right? That's just how disks work. There's a unit of chunking. That's a good measure of how much parallelism we're going to have in IO at the node level, not just at the spindle level. That wants to be about 100 megabytes. Big enough to make a difference, but small enough for you to do quickly. We also want a unit of resynchronization. We want that to be gigabytes because we want to make that make a difference. Not just on a file level, but on a cluster level. We want to be able to move things around, resynchronize them in large units. So we have three units there. We also want the unit of administration. We want to be a large fraction of the cluster to be administrable. Essentially all of these numbers, especially the ones on the left, are all glued together in the HDFS block size. So there's another issue that we're going to have to face. There's the IO size, there's the number of size, there's the racing size and the advent size. HDFS block encodes a lot of these. So by getting rid of that, by distributing the name node, by introducing in particular an intermediate concept, files of broken into pieces, they're put into what we call containers. These are large 10s of gigabyte type structures, which are then the units of replication. You're going to see how we can make those give us the things we want. And in particular, here's where we start to see. See there, there's a P-tree data structure primitive inside the container. That's the first major, major trigger here. It's built into the file system itself. There are transactionally updateable this is a primitive data structure. It's not just an array of bytes that's three levels of abstraction down. It is V-trees that are exposed to the file system. And we can hold millions of these things. And in that benchmark that I mentioned at the beginning, the file system is capable of creating, writing, closing, opening, reading, closing, and leading four and a half million files in 20 seconds. So we can update the file system. So two architectural limits are now lifted here. One is, we can make transactional updates at a very, very high rate in a globally visible way. And two, we have built into the file system not just files, but actual database structures, database primaries. The containers are replicated. There's updates. And so what we're going to do is we're going to implement this with the region entirely inside one container where we have those microtransactions, all the files, all the walls, all the V-trees, loom filters, and range maps and everything go into the container there. We get them read write replication, either chain or star shape metadata or data can be replicated in these ways. And we have transactional correct updates to these containers. Transactional correct means that a write that has committed will not appear in a snapshot that occurs after it. And if the snapshot occurs before you write that data, the data will be in the snapshot. It's guaranteed that there's an ordering, a before and after ordering available there. And so the fact that all writes are synchronous is something that we can use. The basic process is we go to first a container that has a directory, it allocates containers which are then sprayed into the cluster. And from there, all updates go direct, not involving any central resource involving only that replication chain. And in failure mode, the application doesn't see any of the failures. If we have a failure, the CO2B rearranges the replication structure and between one transaction and the next, we have again a consistent view from the application. So here's the results. This is a tiny cluster. This is also data that's over a year old. You see a constant 15,000 or so ops per second. And these are metadata operations out to a billion files. The circle over there expanded here is ordinary Hadoop distributions, ordinary Hadoop file systems. So this is the raw stuff that I'm talking about today and how we're going to reach very, very high, no single performance. So here we go. So here's the output. These are the major factors that I've identified in one way to go. Yeah. I'm trying to get a little bit of a line here. So M7, is that a complement to HBase or is that an equivalent to HBase within the MapR distribution? So the question is, what is HBase? That is the basic question. It's a really confusing thing because MapR runs all the endemicosystem components. So MapR now has tables built into the file system as a primitive object that expose an HBase API. It can also run HBase at the same time. And in fact, an HBase application can access both native tables and HBase tables at the same time via the same API. And so what is HBase and what is not becomes very, very tricky to say correctly. Correctly, the code from Apache Software Foundation is HBase. That is HBase. Now by common nomenclature, an HBase application is one which uses the HBase API to talk to HBase. But at this point, an HBase application can also be a MapR table application. Exactly the same API, except for core processors, which are a dangerous concessionation. But this could be to wind up saying one thing is HBase, which is the MapR native tables are not HBase, they exhibit the HBase API. Do you intend to make it part of the open source distribution? Well, the problem is open source evolves. It proceeds by small steps, evolutionary steps. If I present a patch in Mahoot or ZooKeeper or wherever I can do that, I need to have a reasonably bounded patch. I just say what it does and people have to read it and understand it and it has to fit into the current system. If we were to do this, we would have a patch that's about a million lines of C++ that spans HDFS, MapReduce, and HBase. It can't be done. Now, we can expose this technology in talks like this. We can expose this technology by contributing to Apache Drill, which is a Java based system to build high performance SQL. But it's just infeasible to expose this technology otherwise into these Apache projects. We can get back in other ways. We can find problems in systems, especially since our file system has such different performance characteristics, the race conditions often get stretched very differently than on HDFS. It's actually a contribution to run a different speed constraints. But actually saying here, like I said, a million lines of patch in the wrong language with no management is just an infeasible thing to do. It's not possible. Happy to talk about it, happy to describe it, happy to work with people out of it, but it just, it ain't so good. So, MMR is a revolutionary where open source is evolutionary. Where open source succeeds is by starting at some revolutionary point and then taking small steps where everything always works, where independent efforts succeed is by taking a large discontinuous step, doing what open source doesn't do. They're just two different axes. There's no real reconciliation possible, except after a time, period of time. Like when Google publishes paid papers, suddenly Hindu exists. That's the sort of thing that can spawn an open source. Okay. Did somebody else raise a hand? So is the code available, but it's not attached? No, it's not even available. I mean, that's silly to pretend open source. It's just bullshit. We don't bullshit. It's what we did. We don't want people messing with it. We're not going to be accepting patches. It's the exact executive in there has a three nanosecond context switch time. It manages its own stacks and its own threads internally in there. It's on a cooling system and basically it's on IO system. You don't want normal folks messing with that. It's just going to be really bad. And so to pretend that it's open source by saying here you can look at the code. It's just disingenuous. It is. We're not going to accept that. We're going to be very cautious about updates to it. We have Australia's internal testing program. And we have an Australia's pre-implementation review process to make sure we don't screw it up, which is just really easy. And it isn't a matter of just making mistakes in code. There's very subtle interactions across threads, across cores, and across cache consistency. So that three nanosecond context switch time could easily go to three countries if somebody just accesses some other memory incorrectly and pollutes the cache. Suddenly things that are in the L1 cache and they're known to be in the L1 cache and are assumed to be in L1 cache are not even in cache and normal and slam the two orders of magnitude slow down. And so these are very, very subtle issues. They're not suitable for general use. Now the abstraction of both of is suitable for general use and it's important to maintain that code. Sorry. But I'm happy to talk about it. Say what we learn as we learn. So starting with MapRFS was natural for us. And it provided huge advantages to us. We started with C++ and now Java. But then it never moves this way. We have absolute control over positioning. We have control over cache coherency, alignments, a lot of other things. I'll say a bit more. We have a fully lockless design at the low levels. So once a data element hits a queue that it's on its way to disk, there are no more locks. It's locked to one core and it's just going to slide through like slops for a bit. It's going to go very fast. And that's where we have the custom queue executive. It was very, very fast context switch time. It's locked to a particular core of the machine. It's locked into the particular wire pieces around. That becomes very, very stable and very, very tuned into the memory structure over time. We had to reimplement the RPC layer. I'll talk more about that. There's some exciting things there. And that is one example where we are pushing stuff out, pushing it into drip. We also cut out movement in HBase. The client talks to HBase. She talks to HQFest, which talks to the file system. In Manbar, the client talks to the file system, which has control of the disk. So there's network cops and communication ops and serialization steps that are avoided there. We hybridize the basic log structure merge tree with B trees that we have, primitively in there. That allows us to use at the lowest level mutation operations on the file system. Those are not a good thing at the large scale, but at the small scale, we can use those to get substantial benefits. I'll talk about that later. We also adjusted the size and fan outs and number of layers in the system to represent the physics of what we have. And then we were silly in a few places when we fixed that. So our initial implementations were considerably slower than we expected. Turns out those were very, very simple silliness that we had. Sorry about this. A low calendar of exponents. Most of these things, we get almost for free by just implementing inside the file system, rather than two levels above the file system. So talking about cops, for instance. As I mentioned, the client goes to the region server, the region server goes to the data node and may have to go to the data node on the different system where there's already a region server. Here we go to the data server directly. And that has advantages, but it has indirect advantages too, because data cannot exist without a file server and map our system. And because the data on a particular disk will only ever be touched or served by the file server on that machine, that means the problem of data locality is gone. You get to that file server, which is where the primary copy of that table is, and it has control. If that data is moved, you inherently go to a different file server, and that file server is where the data is. So all of the complexity of coordination about who has which region and which region server is handling it, which HDFS are you talking to, is gone. All of that complex sequence diagram that I showed earlier is just gone. So lesson zero, implement the files, implement the tables and the files. That was the first step. Now along with that came this why not job. And I have to say right off the bat, and my name is Ted Donning, I have a job a big I love it. I think you can do extraordinary things with Java, but when you start looking at really low level performance code, it's really hard to touch it, just be C++. Take for instance, an array of structures. Just that something, something that simple. There is no structure job. There are only objects. Objects involve 40 to 50 bytes of overhead. So if I have a structure of two integers, eight bytes of data, I'm going to have 40 bytes of overhead. Now I can reimplement those as multiple arrays and magic data thing, but then to get that struct out, somehow a reference to those two integers, I wind up creating an object. So I pay that penalty either in allocation or in static size. On the other hand, C++ is going to have an array of bytes. It's going to have pointers into those bytes, and it's going to have no overhead on that. Except middle overhead. The middle overhead, I've got to deallocate this, I've got to allocate it correctly. That's hard. That's work. It's work you don't have to do in Java, but it's work you have to do if you want the element of performance. Now Drill has taken some of this and built some very, very interesting capabilities around how Java can provide a fiction of this, but the fact is C++ you feel the bits between your toes. It's very direct. Also consider that when data is arriving on the wire, we want it to land in memory exactly once. We never want to copy in Java. That's really hard. If you use protobufs, for instance, you have a data structure which represents the bytes that have been arrived. So they're already in memory. But then to reference those bits, they can copy out. And if I have an envelope structure which says here's the header and here's the broad data, I copy that envelope structure, then I copy that buffer of the data I want out, and then I copy the data out of that. So adherently again, the Java object structure, which has such huge dividends in programmer productivity, kills us on performance once we go to extreme limits. And again, we've taken that technology and moved it into Drill and showed how it actually is possible to do this in Java. We also have a problem. Suppose you want to core lock some process so you have multiple threads of control, each one on a physical core, and you want to lock that there and you want to absolutely control which data elements are in cache and which are not. Java is abstracted from that. It's essentially impossible to core lock different threads. So this just isn't going to happen. I've seen that the disruptor system is able to move blocks of data through a queue, an in-member queue, from a core to a core of 50 nanoseconds. But in that time, we can make 17 transactions. And again, it's purely that really low level, that's almost a similar language, view of the hardware that allows this. And core locking, lock freeze, zero copy queues, you want to violate all of the versions of Java. You want to write in C++ effectively in Java. So instead of being an advantage, Java becomes a disadvantage, a huge no stop. And of course, there's the question of GC, garbage collection, in a real-time system is a major issue. Our system doesn't work like because it explicitly manages. Now, we have to also make use of our knowledge of the system. So the NAPR system breaks a table into tablets we, for convenience, call those regions. A tablet is then broken into partitions, which one of which is the spell, excuse me, and partitions are broken into segments. Now, only the regions are user-visible. All of this other stuff is internal to the implementation. But what we do is the fan at the tablet level is hundreds of thousands, an enormous number of tablets. At the partition level, it's thousands. At the segment level, it's adjusted so that we have a very small number of segments per partition, so that we can keep all of the active partitions and their segment amounts in memory, in L1 cache. Secondly, segments themselves are structured so that updates to the segments involve a single disk partition, their VAT 1 to 2 megabyte updates there. So what we have is different kinds of fanouts, different appropriate sizes at different levels of the hierarchy. Here, it's absolutely critical that we never do an update that does like 1.1 rotations because that will cause it to do. We also never want to do a 0.2 rotation update because that will cost us 80 percent of that rotation. It's just a wasted cycle. And so sizing this is critical for performance, and splitting widely at the top is also critical. So having enough layers, but not a variable number of layers, so that we can control the sizing at the bottom and the fan out at the top is critical to our performance. What's the difference between tablet and partition this morning? These are just different data structures that have different ways of indexing their contents and accessing the very individual constituents. The table structure is merely key rangers, and it has a file identifier, a primitive file identifier. It says in that container over there is a tablet. The tablet now has something similar, but it no longer has non-local references. It refers primarily to partitions in the same container, not always, but almost always. Each tablet is, well, each partition is on a single storage pool in a single container. Tablets can be spread around a little bit more, but usually they wind up on a single node. Very, very strong reference for that. There's processes that ensure that. And so you can view it as, at the tablet level, we break down to a single node. Now, there are many more tablets than there are nodes, and so many tablets wind up on one node, but each tablet is limited. Yeah, if we want to rub in one segment, we do that pretty much in place, and that's where the read-write file system becomes a huge advantage. Yeah, totally. How do you set these parameters to differences in hardware, like we're on a chip which has, you know, a long cache which is twice the size of another chip? So, we adapt to that. We measure, wind up, we observe, you probe it, and then we adapt. So, differences in L1, L2, L3 cache sizes are rampant. Also differences in different memory hierarchy bandwidths change, and those change parameters. Yeah, now, at the disk level, the world is a little bit simpler. So, within a factor of 2, 3, disks all rotated about the same speed, and they have about the same size cylinder and are striped roughly about the same way within our cloud system. And so, we can view disks, spinning disks, as one kind of thing. There aren't any parameters that need to be adjusted at that point. Then when we run on SSDs, and I can show you, which I don't know that I have any numbers for that, but when we're on on SSDs, we can just ignore those needs for the hard disks. We can pretend they still exist. They don't hurt us at all, because it's still a reasonable way to update an SSD. SSDs do have costs for non-continuous small updates. And so, it turns out okay to not adapt at that level. It's at the processor really tight, bit twiddle, sort of thing. It really matters. And especially which key structures we cash in memory. Being able to control cache and memory residency, and to force key structures into memory, secondarily prioritize locked in memory column families, and then thirdly, to keep data in memory as convenient, is absolutely critical. That allows single-disk reads, and can factor five on our difference in performance at every time. And unfortunately, it's unfortunate for the HBase site of the world, because again, at arm's length, HBase cannot force that. It cannot force the disk subsystem to cache different data structures differently. It can do an F-advise on HBase, or HDFS data in general. But it cannot say, this is a v3 index. It must be in memory. And that is data. It could be cached. It just can't say that, because it's too far away. When we're in the same image, it's easy. But that's a critical performance thing. And it's also a critical scaling factor, because it means that we can keep the critical data in memory. So as data expands far larger than memory, we really have to prioritize and keep the right bits in memory. Another factor there is updates and reads in our system go to the primary. So you gain performance by sharpening more narrow rather than replicating on reads. That means that we get actually a huge advantage because we cache only on the master, not on the replicas. That means the replicas have their memory free to cache the things that they are master for. It effectively triples the amount of memory that we have in cluster. And the amount of data, unique data in memory, is the key performance indicator for a particular size cluster. There's nothing else that compares to how many rows can you keep in memory. And by differentiating caching on the primary and secondary, again, not a thing that HBase can control, but a thing that we can control, we gain a factor of three in memory. And so we don't normally run memory borderline tests. We normally run memory SSD and hard disk resident tests because we want to go to the screen. But at the borderline, that makes a huge difference to scalability and performance. You mentioned SSDs. Are you saying there are too many categories for storage rates? Yeah. And we don't understand what you're saying at this point. I'm being told that we need to hippie hop. So I'm going to move along. Physics, good idea. The product about free implementation was critical. He's cutting me off. I'm going to show the performance numbers anyway. These slides will be up there very shortly on the line. These numbers are throughput, map are HBase. These numbers are latency. You can't see map are latency. Here's a mixed load again, different arrangements. You've seen the same basic idea. Here's a magnifying view. This is HBase latency. We just magnified by 20x. It's all kinds of things. It's lack of control. Yeah. It's the fact that HBase can't control its own universe. And there's the map are latency. In fact, these latency spikes roughly 10x worse than they are because they are averages over 10 seconds. Typically, you have a few transactions that have high latency. The actual latency there is much higher. It's being diluted by at least 10x by the average event. There's our recap. You can get in touch with me. I'm sorry that we only have 45 minutes. But that's the money side. We can produce much higher performance numbers by controlling our universe. And this is available right now, anytime you need it. We can step outside and we can do some more questions out there if you'd like. And I assume there's another speaker coming in. That's why I'm leaving.