 All right, good. That's great. Cool. So, hi guys. My name is Omar. I work at TikTok, so I lead a team there and we mostly deal with payments, which is why we rely a lot on databases, particularly Mashiko. We use Mashiko a lot. And in this talk, I wanted to talk a little bit about InnoDB. It's I'm curious. Curious amongst the audience here. How many of you have heard of InnoDB? Awesome. That's good. And how many of you have used it? Probably everyone, right? Okay, cool. Pretty significant. Nice. All right. So, let me just give you guys a little bit of historical background. What is InnoDB? So, InnoDB is a storage engine for MySQL and also MariaDB, which is also technically MySQL. And since MySQL 5.5, which was released 13 years ago, InnoDB kind of replaced MyISEM as the default storage engine. But it's interesting to ask yourself, how come MySQL has multiple storage engines? And this is an interesting design decision. Because from MySQL's perspective, it's designed to work with multiple storage engines. They offer this abstraction. And different storage engines have different... So, you can build a storage engine that's kind of meant more for OLTP workloads or OLAP workloads. This depends. But you can keep the same kind of interface and some common components the same. That's the design philosophy here anyway. So, what does a storage engine do exactly? There's actually two big parts of a storage engine. One is it's going to manage the in-memory structures that basically you need to keep your query... The pages you're querying keep them in memory as much as possible, as well as manage the on-disk data structures. And InnoDB has a bunch of data structures that they use to make all this work. They also manage your write-ahead log. They'll manage a bunch of these things. I won't have time to go through a lot of things because this talk is kind of a lightning talk. I'll just talk a little bit about the high-level picture. So, as you guys probably know, you guys probably know this. In terms of storage hierarchy, the fastest thing you have is our CPU registers. And probably the slowest thing you have is network storage. And hard drives and SSDs are somewhere in between. DRAM is fairly fast as well. Non-voltage storage in general is relatively slow. I think everyone knows that. It's also cheap. So, if you're storing tons of data, you probably want to store it on non-voltage storage. And you probably want it... you're probably okay with higher latencies. This is one of my favorite slides. And this is a website for this as well. So, I'm a bit of a performance nut. So, if you guys are interested in performance, performance is something you must know. What is the latency for a cache hit? So, L1 and L2 caches are in your CPU. They're almost as fast as register lookups. Whereas DRAM is around 100 nanosecond. SSD, you can see it's a few orders of magnitude slower. And HDD is another few or several orders of magnitude slower. And then network storage is the stick that we have. So, if you convert this to time skills we can understand, a cache hit is one second. And compared to DRAM is 100 seconds. And then compared to SSD is four hours. So, if you fetch some data from your cache, if your cache is one second versus four, ready for four hours. If you fetch it from SSD, which is supposedly quite fast. And HDD will be three weeks. So, this is something that you must be aware of when designing any system that needs to deal with a lot of data. Okay. Another thing that we need to be mindful of is random access on, normal storage will almost always be slower than sequential access. And generally storage engines and DBs would have to maximize sequential access. So, they'll try to tune the algorithms because this is the fundamental bottleneck. It's not about algorithmic complexity, it's about making sure everything is sequential. Alright. So, the first thing is inside of DB, there's a concept of a page. Because when you organize data, you organize data in pages. And the reason for doing this is because when you are writing to a non-volatile storage, normally the APIs provided are block based. So, you work with blocks of data. And that's why to optimize that further, you want to structure the way you save the data into blocks. Because if you save one byte, you're actually saving the whole block. And normally the hardware page is four kilobytes. So, you don't want to do byte-by-byte operations because then you're wasting a lot of throughput. So, what databases do is they will organize storage in these pages. The pages can vary in size. Database pages usually will be between 512 bytes to 16 kilobytes. Machiko user, NODB use 16 kilobytes. And the way they organize this page layout is they'll have a header in front, tell you some metadata about what's the stuff about this page and bookkeeping. And then usually what they'll do is they'll have this thing called a slaughtering. So, this is a slotted page design. So, they'll have these kind of slots that are pointers, that are pointing to where the actual tuples that contain your data actually are kept. And the reason for this design is so that it's kind of easy for you to support tuples of variable length because tuples can have variable length. They're not always going to be fixed length. That's why the slotted page design is used. So, another thing about NODB is you want to keep these pages that are described just now. They usually will have the same representation on disk and memory. You want to keep these pages as much in memory as possible. Because, like I mentioned, you want to fix things fast. And memory versus disk, there's a huge difference in latency. So, normally what NODB does is NODB will use an LRU-based algorithm to keep pages in memory. And the larger your buffer pool is, obviously, the higher speed-ups you're going to get, which kind of makes sense. That's also why if you give MySQL some memory, it will never give it back to you because it's going to keep this buffer pool and it want to use it as much as possible. So, just to give you an idea of how the whole flow actually works, normally when you send a query, the query will go to this execution engine. The execution engine will come with a query plan. It will decide, okay, how should I execute this query? And after it has decided, it's going to talk to NODB and then it will decide, okay, I need to fix these pages. These pages might be on disk. They may not be in buffer pool yet. So, they'll fetch it from disk. It'll be loaded in the buffer pool. And then the query will operate on that and then send the result back to the user. Okay, so one thing about NODB in particular is that it's asset-compliant. It follows asset. How many of you guys know what asset means? All right, cool. All right, that's great. So, the A means atomicity and the key idea here is that either when you have derivative-based transactions, either everything happened or nothing happened. You can imagine this is quite important for many applications. You know, the NoSQL movement was quite hip ten years ago and it eventually also realized, eventually, most NoSQL systems ended up adding transactions again because transactions are critical for many, many use cases. So, especially historical payments, you can imagine transactions are extremely critical because you want to make sure if you have two database operations, either both succeeded or nothing happened. And we need atomicity to support that. Yeah, so the way NODB atomicity works is if you just fire a SQL statement without a transaction, it'll be auto-commated, not by default. And you can tune this if you want to. And then for transactions, they also have a commit and they also have rollback statements for you to commit a rollback. Another part about the C in asset is consistency. So, what that usually means is you want to make sure if you write data to NODB, you read it again, you get the result. That's consistency. In NODB, you can imagine this can be quite tricky because while you're running, you can crash. And if you crash while you're running, you might lose consistency. Maybe from the application point of view, you wrote it, but actually you didn't write it to the disk yet. So how to manage this? So the way NODB works is it manages this thing called a double write buffer. So it's kind of like a backing source. So basically you'll write to that buffer and then that buffer will write to disk. And if you lose that buffer, it's fine. So they also use write ahead logs separately to kind of, if you crash, they'll replay the write ahead log so that you can recover. So I won't go into this in too much depth. All right. One other very fascinating area that I'm a big fan of is isolation, the eye in asset. So isolation, what does isolation mean? It's actually controls the extent to which a transaction is exposed to other concurrent transactions, right? Because if you think about it, you can have the simplest database possible, which is just single threaded and it doesn't need to do anything. If you have a single threaded database running on one core and then you just give it a query and then basically you don't need any logs. You don't need to do any protection for concurrency, right? This model can work. And actually there are some databases that do that. So Redis is quite famous for doing that. There's also WoltDB. It's also designed with the same idea in mind. So however the problem is that you also lose a lot of concurrency. Concurrency for databases, especially those that deal with disk, is quite important because you don't want to, if you have a single thread, you'll be spending a lot of time spinning, waiting for a disk IO, right? So we want to maximize a lot of concurrency, especially in modern hardware when we have a lot of cores, right? The problem however is when you have concurrency and multiple transactions are going at the same time and they're all operating the shared buffers and the shared storage, there's a lot of interesting problems that'll happen. For example, you write something, your transaction is ongoing and you read something, you write something, wrote to the same object that you're supposed to read and you'll get a dirty read, for example, right? Or you might get, you might have unrepeatable reads, you might have phantom reads. I won't go into these concepts in too much detail because we don't have much time, but feel free to talk to me more about this afterwards. But isolation levels, isolation is very, very important and in ODB, there are four isolation levels and it's very important for application developers to understand which one to use and what makes sense because this decision is left to the application. Right? So the strongest level is called serializable. It is essentially, the guarantee here is that it is as if the transactions were executing one by one. However, in actuality they're not. In actuality they will implement two-phase locking to still allow the currency but there's still some currency but the ODB will guarantee that the results you get are correct. It's actually a repeatable read. Repeatable read does have an issue that it actually, you might get some phantom reads. That can happen, but that's a default. The reason for doing repeatable read is it's better for performance, because with serializable you basically need to do a lot more locking and the performance is going to suffer. If you can't live with phantom reads, you should use serializable. By default, most people don't. Most people use repeatable read. The one of the interesting things about this is that this is not something that's that visible. You probably wouldn't notice it unless you dive into your data and you see, oh wait, maybe the data doesn't add up or something went wrong here. So like phantoms were a few places, but what I mean by phantom is you have a transaction that will send some objects and then value one of your transactions, someone else inserted a new object that matches your criteria. So this kind of thing can happen. Read committed is another isolation level provided by NODB where the performance is better, but once again, it also allows for phantom that also allows unrepeatable reads. Unrepeatable reads is like you might have these flashes where you might not get the same read again. This is because someone else mutated that data. Whereas it's an unrepeatable read normally they'll use some snapshots so they'll copy over and have different versions of the data. But in repeatable and read committed they'll actually be touching the same data. And read uncommitted is like basically anything can happen. It's basically MongoDB. You don't get any transaction isolation. So any of these problems can happen and if you're using this you're using my SQL. All right. Yeah. Then let's briefly talk about durability because I'm also out of time. So as I mentioned earlier NODB has a double write buffer. It's one of the, this helps us to get durability because it's double write buffer in the storage area where NODB will flush the pages because you remember it's a buffer pool and it writes when the page is dirty it will write that page to disk. So if, and then it has double write buffer to kind of help you manage that and help you recover from, recover from, because we actually write it twice. They'll write the double write buffer and they also write the actual page. But one of the nice thing is that there's some optimizations to kind of minimize the cost of repeated, repeated F-syncs. So you don't need to actually F-syncs will not be as expensive because you can do, you can batch everything into a large sequential chunk. With that, I'm out of time. Just nice. So do you guys have questions? Thank you for the talk. I'd like to ask if there are any main differences of NODB engine from other engines or it's just very many small syncs, small optimizations better than others in some way. What is that secret? You're comparing a MySQL engine or you're comparing to like any other storage engine. Same MySQL engine. Right. There is a huge difference. So my ISAM I believe tries, the philosophy there is quite different. It's not as optimized for OLTP workload in particular, as my understanding. Whereas NODB is very much targeted towards OLTP kind of small transactions, but lots of them. So this does, there is a huge, huge difference. The code base is completely different and the design and architecture is also completely different. So if you compare NODB to other database engines, it's like for example Oracle. Oracle, the Oracle has their own storage engine, for example, and what you call, the N2 from IBM has their own storage engine. Do you compare NODB to those guys? The architecture? Definitely. But those guys, because they're enterprise, they'll have much more performance guarantees because they can spend a lot more engineering manpower. This was NODB's open source, so it doesn't get that much love. Yeah. Thank you. So I guess one of the other popular engines that's coming up in the MySQL world is MyRocks. Yes, that's a good one. So how do you contrast the two? Well, one of the things I know is MyRocksDB is optimized for this space because they can basically, their trade-off is you may get slower reads, faster writes, and you need less this space to store the same amount of data. However, some of the things I know is that you have to store the same amount of data. However, some certain types of queries are a lot slower in MyRocks, so you have to do the trade-off a bit carefully. So we use MyRocks in situations where we have a lot of data and we want to optimize on this space. MyRocks is pretty good for that. Hi. I understand that the NODB storage engine is optimized for handle transactions for RTP workloads. For example, you need to go in workloads and you need to handle, let's say, 100K TPS, you need to handle, for example, 100TB of data. How do you do that with the NODB storage engine? I mean, if you're bottlenecking on a single machine, you can shard it. Obviously, especially if you have a heavy transactional workload, you probably can't get that much TPS in one machine. So there are solutions for sharding. There's open-source solutions, like, for example, YouTube had one, I forgot the name. Yes, Vitess. So Vitess is pretty good for that. In our company, we have our own internal one. So I think most companies will do sharding of some kind. Well, that's all the time we have now. So I guess this is the most important time of the day, I guess. It's lunch. So thank you, everyone, for joining this morning.