 Hello. I am Mark Callahan. I work on small data or transaction processing at Facebook. We're jealous of the hype that big data gets, so we're trying to bring small back. My focus is storage efficiency. We have a lot of data at work, so a few of us focus on reducing the amount of storage hardware. We need to store that data. I also named the project Myrox for the engine, so I'm very happy about my contribution to the effort. Myrox DB is a log-structured merged tree. It started out as a fork of level DB. It's changed so much at this point that it used to be a fork. It's its own project at this point. Myrox is a MySQL storage engine, so we implement the storage engine API for MySQL. Many people have tried to do this. It is hard to implement the storage engine API. We are doing it. It runs in production today. A few months ago, we announced that it's running on about 5% of the database servers in one of the data centers. We've gone significantly beyond that. We just haven't been specific about how far, how much progress we have made. So this is something that's running in production. It's real. It will eventually be usable for people beyond Facebook. Sergei Pertunia is here. I like to always acknowledge his contribution. He is a significant contributor to Myrox. We would not be here without his effort, so I appreciate that. The brief editorial, I just wanted to say for a few years there, I think, doing web scale MySQL, I got a lot of advice from people about what I should be doing or what I should be using. Maybe I don't get that much advice anymore. There were some dark years and there were some issues with the product that kind of earned the reputation, but were in a very different point right now. So if you just look at features, there were several talks today about group replication, synchronous replication in MySQL. It's a really big deal, not losing commits. If you look at performance in MySQL 5.7 and MySQL 8 with what they've done for NODB, it's a really big deal. In terms of usage, I gave a community talk a few years back and at the end of the talk someone said, well, you're just using MySQL because it's legacy. It was an interesting opinion. We are competing for and winning new workloads at Facebook. I'm going to be vague because I haven't been clear to talk about all of these workloads, but one that was disclosed in public was we put NODB on top of HBase for messaging. And the reason was to improve quality of service. We are competing for additional workloads, so we're in a good place. Externally, if you look at what the cloud providers are doing, Amazon with Aurora and RDS, it's a highly competitive product on the cloud. Once you add or once you have highly available storage, it's much easier to scale and operate MySQL. And then finally, if you look at the community and just my proxy for one of my proxies for success of the community is if you look at DB engines rankings, MySQL is likely to become the number one database this year, which means it will pass Oracle. It's ahead of SQL server. So I think we're in a really good place. I get two questions frequently or I'm trying to answer two questions in terms of marketing. Why MyRocks? Why are we doing it? Or why might you want to use it? And when to use it? And for why, my focus is storage efficiency. So I can go technical or less technical, but really it's the best space efficiency, meaning we use the least amount of space. And it's not just an implementation thing. The algorithm that we use is an log structure merge tree with level to compaction. We're likely to, in the worst case, we're going to use about 10% more storage than optimal. And I'm not aware of another database algorithm that can beat the 10% overhead. A B tree you're likely to have about a 50% overhead. Better write efficiency. By better write efficiency, I mean, if, for given workload, how much data are you writing to storage per transaction? So if you run IO stat while running a benchmark, we are with RocksDB and MyRocks, we write less back to storage per transaction. Good read efficiency. And good is the vague term. When the project started, we tried to set some goals. At what point do we think MyRocks would be usable? And so I tried to be vague, but I guessed we want to compress 2x better than compressed in ODB. And we want to do that without being too much slower. So we accepted that we might be slower in terms of response time for queries. We just wanted to be good enough. And the last point for why MyRocks is it has benefits for both SSD, NAND flash, and disk. The benefit for SSD is that compression is better, so you will use less SSD, which can be a big deal if you're purchasing the SSD. Better write efficiency compared to something like an ODB means that endurance on the SSD is going to be less of an issue. For a disk, using less disk capacity is not always a big deal. If you're doing transaction processing on top of a disk array, you might not be using all of the disks, so you might have spare capacity on the disk. The benefit for disks is that better write efficiency means we use less write capacity from the disk. So we save more of the IO capacity for reads, so we can do more queries per second because we're more efficient on the right side. From the user database workload, we actually use about half the space compared to compressed in ODB and about one quarter of the space compared to uncompressed in ODB. So we met the initial requirement for the project, which was 2x better compression than compressed in ODB. The last point was a surprise. The write rate to storage with Myrox for this production workload is about 10% of the write rate to storage within ODB. So we're just, you know, in ODB is writing 10 times as much in terms of kilobytes per second or megabytes per second to the SSD. So the SSD endurance is much more of an issue or switching to Myrox makes that less of an issue. So for when to consider Myrox, I've been trying to slowly expanding the marketing claims I'm making for Myrox. Number one is if you're using in ODB and the database is larger than memory, my goal is for Myrox to be competitive within ODB. And I'm going to say it's a goal at this point. There are performance problems in Myrox. I do a lot of performance testing. I also respond, I think Valeri is probably here. Most recently Valeri reported a performance problem which I cannot reproduce, which is funny because I... Okay, so I initially made a joke when Valeri mentioned the problem that I said, ah, I can't reproduce. Just because it's the worst thing you can do in support is initially just tell the customer I can't reproduce, it's not a bug. But I actually can't reproduce it. So we have performance problems. I know of some of them. The more that I know about, the more we can fix. But we're trying to target similar query latency as in ODB has. We're better. So the goals are expanding. Progress. So Yoshinori speaks after me, which means he can correct all the mistakes I make. Yoshinori is definitely more of an expert on this topic. But from a marketing perspective, efficient performance is how I'd like to describe Myrox. So we want storage efficiency with good enough performance. And we got that. We deployed it in production for the user database workload, which is our most important MySQL workload that we run. And the deployment is continuing. It's moving fast with correctness. We started ports to Percona server and MariaDB server. This is a big deal to me. I want people to use Myrox, not just Facebook people, but anywhere. And for that to happen, it needs to be in a proper distribution with support. And so that's possible now with getting Myrox into Percona server and MariaDB server. For 2017 documentation, and not even better documentation, it might be safe to say, we need documentation. I frequently get reminded of things I forgot about, in part because a lot of the details are on internal discussion groups. We need more production deployments and we are competing for them internally. I know of at least one other company that I won't name in public, but who is potentially using Myrox today. They have a lot of MySQL talent, though, to make that happen. Hopefully it will be usable. I can't claim it will be a GA for MariaDB or Percona server. That's someone else's decision. But it will be released in some form factor that can be used. And then we have a lot of performance improvements. I have more details I need to publish in the performance comparisons. And then the last point is features. We want to expose more of the right optimizations that we have in Myrox, in RocksDB, expose them via SQL. And an example is time to live. So you can have data age out without having to explicitly delete it. And that's likely to be the first RocksDB optimization we expose in SQL. So efficiency. I've made some strong claims. I've done a lot of performance results that I publish. I try to explain the results that I share. I don't always do it. Sometimes I publish and then people ask me and I have to go back and revisit. But at a high level, in terms of space read and write efficiency, why is space efficiency for a log-structured merge tree? Why is it better than a B tree? And the first is fragmentation. The leaf nodes of a B tree will be one-half to two-thirds full, subject to a random-ordered sequence of updates. So if they're one-half to two-thirds full, you're wasting one-third to one-half of your space in the buffer pool, in memory, and on disk. With a log-structured merge tree, the space overhead is about 10%, rather than one-third to one-half. Fix page size. If you use compression with NODB, your page size is fixed on disk. So if you have 2x compression configured, 16k page in memory, 8k page on disk, compress 16k down to five, you still have to use 8k on disk. So it's another source of wasted compression. Per row metadata with NODB, 13 bytes. With ROXDB, it's six or seven bytes, and for most of the data, for 90% of the data, we are usually compressing that down to zero bytes. So for small rows, it's a really big deal. Key prefix encoding is applied in memory for uncompressed blocks and on disk with ROXDB. So keys or indexes take up less space with ROXDB. Why is write efficiency better? Well, if you're using more space with a B-tree, you have more pages to write back. So that's point number one. Point number two, we tend to operate databases configured so that the working set does not fit in RAM. And the reason is if you have fast storage, we have really good NAND flash. If you have fast storage and you have your working set cached in RAM, you're either using storage that's too fast, or you have too much RAM. And that's not always true, but since we have fast storage, we want to use it, we want to do reads to it. We try to use as little RAM as possible. The worst case for a B-tree in this kind of configuration is you're writing back pages with only one modified row on the page. And in that case, the right amplification is the size of the page over the size of the row. That problem is with a log-structured merge tree, we're only writing modified rows. There are no pages on the right path. And then finally, the double write buffer within ODB doubles the write rate to storage. For read efficiency, we have more data in cache, key prefix encoding, no fragmentation, so the cache hit rate is better, so reads are faster. A bloom filter is especially effective if you have a workload that's occasionally trying to read keys that don't exist. With a B-tree, you might have to read a leaf page from storage with a bloom filter, we avoid that. Finally, we spend less on writes, so we have more IO capacity to spend on reads. And then the last point is this thing called read-free index maintenance. For non-unique secondary indexes, there's no leaf page to read. In ODB's read-modify write for secondary index maintenance, MyroxDB is write-only. I feel like I have a personal brand at stake prior to Myrox, I didn't really have a product. I was always on the using side of a product. And so it was easy to write about problems that I wanted to get fixed. I market bugs. So now I'm on a team that has a product called Myrox. So the question is, am I being honest? And it's a challenge. It's not as easy to be open about problems when you're on the product side. So I'm working on it, but there are problems. MyroxDB is too hard to tune. The work around for now is we're improving the defaults. At this point, I claim there's two options you need to set. The block cache size and the number of threads to use for compaction. Today we're not good at guessing that based on the hardware capacity. If you set those two options, you'll get good enough performance. And I'm going to skip the detail because I'm talking too long about some slides. We had someone report a problem with Sysbench, an external user or early evaluator. OLTP.Lua, those potentially long range scans. Under concurrency, those were slow. We doubled the QPS. We fixed the performance problem, doubled performance. We borrowed some code from InnoDB. The code was a little bit broken. InnoDB didn't notice because they weren't using the feature we were using for a performance counter that was supposed to be sharded so that different threads would not compete on the same memory location to do the updates. The brokenness meant all threads were updating array element zero, memory system stalls, fixing that doubled range scan throughput. Group commit. If you turn on bin log crash safety, which is a good thing to have on, we lose 5% to 20% of throughput depending on your workload. This is in discussion. So we have plans to fix it this year. There's no code available yet. Large transactions. All uncommitted changes are buffered in memory by Myrox. If you try to insert one billion rows in one transaction, by default, you're going to need enough memory to really double buffer that transaction temporarily. So we have this commit early option where internally we can commit before you call commit. So for bulk load that helps. Today we have a limit on the maximum number of modified rows per transaction. We're switching that to a limit on the maximum amount of memory per transaction. And then we have design discussions in progress to be better at large transactions. And this is one of the challenges. ROXDB is meant for small transactions. You're building a generic database engine. You need to support a variety of workloads. And so this is one source of diversity we need to get better at handling. So when I report performance, I want to explain performance. So the first number is throughput transactions per second. MySQL 5.6. Significantly better than NODB with and without compression with Myrox. And I was a little bit surprised by that. But to me it's not a big deal. The second column is IOSTAT. How many random reads from storage per transaction? It's about the same. They're all about doing one read per transaction. Third column is a big deal. How many kilobytes are written back to storage per transaction? NODB is 15 to 20 times worse for reasons I previously described. And I've seen this. It's not always 15 to 20. It depends on the workload. But in this case, it was pretty bad. CPU per transaction similar. Database size per transaction. Myrox is about half the size of compressed NODB and about one fourth the size of uncompressed NODB. Last column is just one way of looking at quality of service. What's the 99th percentile response time for a frequent transaction? Myrox was lower here is better. Myrox one millisecond. NODB was six milliseconds. So when I try to explain performance, I use this kind of graph or table. So I show throughput, quality of service, hardware efficiency. And in this case, Myrox did a lot better. The last number and then I'm done. The value of write efficiency. Two results showing why Myrox is doing better thanks to better write efficiency. Doing less on writes saves more for reads. First result is the insert benchmark in memory database. So there's no reads from storage. And NODB 5.7 certainly is faster than 5.6. But the interesting results is that when you go from fast SSD to slow SSD, NODB loses about half its throughput. Myrox loses much less than half. So I was happy with that result. Prior to that, I have something from link bench, similar hardware but using going from fast SSD to slow SSD to disk. The interesting result here is that NODB is using much more throughput as the storage gets slower than Myrox is. So again, better write efficiency allows better read performance. That's all I have. Okay. For the questions. We're going to leave the questions at the end of the Lushino Restore. So there will be the question here answering you.