 OK, can everyone hear me OK? I think it's a yes. All right, hi. My name is Morgan. I work for Pincap as a product and community manager. We build a database called TidyB. I also contribute to TidyB. It speaks the MySQL protocol. So here's what we're going to talk about today. It is positive, so I do want to focus on the tech side. But I do want to start with a single page version, just giving an introduction to TidyB and Pincap. As for some of you, it may be a fairly new project. And then we'll get into the tech details. And then later on in the talk, I'll cover benchmarks and I'll cover MySQL compatibility, right? Because it sounds very promising. It's MySQL compatible. How hard is that to do? Turns out it's pretty hard. So we'll go to the details of exactly what's achievable and what's not. So just a little bit of overview. Pincap develops TidyB commercially. The company and the project were started in 2015. Our three founders come from working as infrastructure engineers at some of China's largest websites. TidyB has been open source since day one under the Apache 2 license. The storage component, and we'll get into what that is in a second, because it's got a layered architecture, is a CNCF project. It was donated to the CNCF in 2018. And the company expanded to North America in the middle of 2018, which is why it might be fairly new to you, and that's when I joined the team. But it's absolutely massive in China. So I just got back from Beijing, where we had the annual developer conference. We had just under 800 people attend. Two other quick stats that are kind of interesting. There's over 300 production deployments of TidyB storing 15 petabytes of data. And there's over 200 contributors, I think just over 230 contributors, to just the TidyB server part of the project, which to my knowledge, it makes it the most contributed to on GitHub relational SQL database, which is awesome. And the contributor count has been growing steadily since 2015. OK, so let's get into the technical details. And I'll sort of start with a few disclaimers, like quick points to help you understand it. TidyB speaks the MySQL protocol, but it's not based on any MySQL source code. It's a re-implementation. So not a storage engine, not a fork, not even developed in the same language, sorry, as the MySQL server. The MySQL server is written in C++. TidyB is developed in, sorry, a little bit of adjusting. TidyB is developed in Go. And the storage part is developed in Rust. So it's not an OSQL database as well. It's not sort of like loosely-goosey in the guarantees that it provides as a system. It kind of fits into the buzzword that we call new SQL databases. Just know that it's ACID and it's consistent, and uses this layered architecture of components. So the TidyB server is just the SQL processing layer, which is completely stateless. And the storage is what we call TyKB. And both of these are horizontally scalable, and you can scale them independently, which is really nice to be able to scale your SQL processing as stateless. It's very easy to scale in cloud environments. And so the last clarification is that while it speaks to MySQL protocol, it is aiming to optimize for slightly different workloads than MySQL does. I think MySQL has very good support for OOTP, which is sort of like the simpler range of queries in high throughput. We're trying to optimize for the simpler queries and sort of the analytical queries at the same time in this emerging space that's called HSTAP. So with that disclaimer, let's look at the architecture. It's sort of a minimal setup of TidyB. So we have the TidyB server, which we said is the SQL processing part that's stateless. And then we have TyKB, which is doing the storage. So a few things to point out. TidyB will show you a global view of all of your daughter. But if you imagine a table is sort of like A to Z in the alphabet, A to C would be stored in one region. That's our word for shard. D to F might be stored in another region. And you can see the copies of the data are sort of replicated inside of the TyKB system. We use Raft internally to be able to replicate. So I'll use some MySQL terminology, and I'll say that each one of these TyKB servers is both a master and a slave at the same time but for different regions of the data. So we do compare to MySQL replication. The one thing that you'll notice is by this design, it's easy to add extra nodes because you can just incrementally shift 100 megabyte chunk of data to add new servers. And you get better utilization by being both a master and a slave. The only component that I haven't mentioned yet here is this PD server here. PD is like the manager of the whole cluster. And it can also do things like if it notices that one of these particular regions is particularly hot, like let's say region two leader is particularly hot, it could move it to be the leader is now the TyKB node number two. It has several things that it can do to sort of rebalance. It can do region splits, region merges, identify the hot regions. It's a nice property to be able to have this data split up in these chunks and to say both master and slave at the same time. So it is a strongly consistent system. As I mentioned, so if you modified data that existed on multiple regions, it'll use a two phase commit internally to make sure it updates that consistently as you cross those region boundaries. Okay, so that was the simplified architecture. Actually, I'll mention one more thing. I like to think of this as sort of like a layer of storage and then a layer of protocols on top where we just happened to be using MySQL. This is TyDB speaking your application. But actually with an architecture like this, one nice property is that you can add additional protocols on top. So we have a driver called TySpock, which allows you to have sort of a spark view into your data without having to route through TyDB. And we've had community contributors to be able to contribute a Redis protocol to be able to store data in TyKB. And we're building native drivers so that you can connect to TyKB directly as well. So it's actually a similar architecture to MySQL and DB cluster, but the use cases are a little bit different. And this sort of native driver would be analogous to something like the NDB API. Okay, so having mentioned that, I did say that we're trying to aim for HTAP as the queries that we're optimizing for. So this is sort of the next generation of architecture for TyDB, which is to be able to have our row storage with TyKB, which uses ROXDB internally to store the data. And then have TyFlash to have column the storage. So because we have our own TyDB server that has an optimizer, we'll be able to figure out if a query is best to route to the row storage or to the column storage inside of TyFlash. We'll be able to do this on a per query basis. So we already can do some analytics, even though this is sort of the next gen architecture inside of TyDB that's not possible inside of MySQL, because it already has parallel query, hash join, hash aggregation, sort merge join. So when we get to benchmarks, you'll see that it is actually already capable. But this is sort of the next generation to help for more specific analytics queries and also relieve the load from that you can have a little bit of quality service issue if you're doing analytics queries on the same data that you're doing OLAP query, OLTP queries. Okay. So let's deep dive on the TyDB. If you're familiar with MySQL, I would say that it's actually fairly similar, right? So if a query comes in via the MySQL network protocol, it has to be paused, has to be optimized to be able to figure out what's the best way to be able to retrieve the data, and then it has to retrieve the data using that sort of intended plan that the optimizer figured out was best. Just a couple of small differences. The MySQL optimized, it doesn't have to figure out the network cost. It sort of just has to do CPU and memory costs. So TyDB has to do a little bit more extra work there. And as I said, TyDB has a few more capabilities in terms of what execution methods it has with its parallelism and attached joints and so on. And then it's retrieving the data from TyKB. And MySQL has the same separation with the storage engine API. This is just a little bit more separated because this is a completely different layer of technology. All right, so let's deep dive onto TyKB. TyKB is the storage. And it says here that it's receiving requests from a client, which in this case would be the TyDB server. That communication internally is via GRPC. And a simple way of thinking of TyKB is that it's ROC-CB with a RAF layer on top to make sure that it's replicated. It is a little bit of a simplification because for efficiency we do have some extra features like what I didn't show on my previous slide is that we can push down parts of the query execution into that TyKB code processor. So we imagine a hypothetical query that say that we're doing select count star from my table. You know, a naive way of doing that would be that we retrieve all of those rows from TyKB, we pull them up into TyDB, and then we add up the total. And an optimized way of doing that would be you ask each one of these TyKB instances to give you what count it has, and then you stream aggregate that total inside of TyDB. And this is a lot closer to how it works with this coprocessor API. So it understands many of mySQL's built-in scalar functions and aggregate functions allows us to effectively push down part of the execution from an SQL query into the storage directly. And it minimizes network transfer, which is certainly useful. Just some tools in terms of what we offer. You can replicate from mySQL directly to TyDB. You can also replicate from TyDB to either another TyDB server or to mySQL. So our tool to do this is our binlog tool, and our tool to be able to replicate is something we call DM, data migration. We also have an optimized loader to be able to convert from a mySQL myDumper file directly into TyDB and do things like disable the redo log. So currently we can restore about 100 gigabytes an hour, and we think that we'll be able to increase that number over time, which helps your migrations significantly. Okay, so let's talk use cases. It's sort of like a generic product that speaks to mySQL protocol, it looks like mySQL. I know you're all technical people. You wanna know what's the best case, what are the catches? So I would say that today the best two use cases for why you would wanna look at TyDB is if you're approaching the maximum capacity of a single mySQL server, you're trying to think what's next, should I shod or shouldn't I shod? And depending on the vintage of your application, this may be easy or it may be difficult. Or you may choose as an organization that this adds extra complexity to your application, and complexity can make teams hard to scale, right? It's hard to scale up on an organization basis as you've sort of invested all of this infrastructure into an app that's quite specific to you. The other use case that is kind of interesting is, let's say that you've already shod at mySQL, and so your application, you've invested to be able to do that, but you have a hard time doing analytics on your data. So the traditional way that you do this is you might use mySQL for your OTP, and then you'll use ETL to maybe within the hour or the half hour, extract this information from mySQL and push it into some other system. If you have the data to replicating directly into TyDB, you can query it with less lead time. Maybe expand just a little bit on this first point. I think what's particularly interesting about when you approach the maximum size is if you're like a platform business, we have many ways that you access the data. So think of cases like if I was an online marketplace, I have buyers and I have sellers. If I choose to shod by the buyers, then I can run queries very effectively on the buyers, but I might not be able to run queries on the sellers. This is a little bit different by contrast to say I was a SaaS application where maybe all of my tenants are very disparate, I guess would be the word to describe that. So particularly in the case of the platform scenarios, I think it's very useful not to have to figure out all of the nuances of how you would implement shodding, just to be able to have a global view of the data. So I want to describe a customer that sort of fits this use case. I don't believe that they are in Brussels, but they're certainly very large in Berlin, which is MoBike. So MoBike has nine million of these smart bicycles that you can unlock with your phone, sort of all around the 200 cities that they're around the world. And I think of unlocking and locking as like a transaction, but they want to be able to run analytics and they want to be able to find bikes that might be faulty and they want to be able to do that immediately after. I think that sort of analytics to find a faulty bike would be like a query that's kind of on the dimension of the bike, but maybe if they want to find fraud, that's on the dimension of the user. So being able to have a more global view of the data is more useful. I think you can't necessarily shod by city either because they might want to allow users to kind of move around cities. So being able to have a global view of the data and being able to transact and then do analytics is a really powerful use case. And obviously to the point of how much data they're storing as well, this exceeds the single size of a MySQL server. So being able to have this global system that can grow is very useful. Okay. So yeah, I know benchmarks is a topic that interests you. I want to show you some independent benchmarks that were run by Alex Rubin at Pekona. And I think they're really interesting. But I want to describe the methodology a little bit first because it's not actually a recommended setup for TIDB. So in this case, Alex has set up a single MySQL server and a single TIDB server. So TIDB is a distributed database. This is not something that we're intending or optimizing for. But nonetheless, I do think it's interesting. So the data set size is about 70 gigabytes. It's a real life data set. He's imported US flight on time statistics. And to start with, he's showing the response time for running some analytics queries. So this is response time. So a lower number is better. And you can see that TIDB, you know, sort of true to my claim, is good at these sort of analytic style queries. It sort of routinely beats MySQL, except in kind of the cases where you have a very low CPU count. So so far, so good. The next benchmark is testing just a single row select, a point lookup with this bench. And so this is a throughput test, not a response test. So higher numbers are better. So we can see MySQL is red, and it's routinely beating TIDB by five to 10X. I would like TIDB to do better, but I think that this is expected behavior. You know, when you've got this layered architecture, it's a lot harder to sort of compete with MySQL on its home turf of OOTB queries, where the data set is entirely in memory. So I think the benchmark is useful. It's very useful because, you know, if you are considering sort of the first case of moving entirely to TIDB, it's a microbench that'll show you, you know, the worst situation that you may encounter. And so you can use it from that perspective. But if you're sort of purely single primary key lookup, which I don't think is necessarily indicative of all applications, you should expect this to be slower in TIDB. And the next benchmark, it has some commonality to the previous query. The previous query is point lookup. This is like a point write query to write an individual row. I think Alex did a good job of providing a disclaimer for why NODB was behaving or performing so much better here. When the data is in memory, ROX-DB is not necessarily expected to beat NODB. And with TIDB being distributed, or in this case, it's TIDB, it doesn't have the concept of like disabling the binary look. So it's not quite an apples to apples comparison. But I'll sort of limit my disclaimer to that and say, I don't expect that we should be performing as bad as this. This was really great feedback. So thank you to Alex and Picona. I think this is something that we'll be working on in the next year. And with that, I think I'll get to compatibility and then I'll see if we have time to questions. Okay, so compatibility, we like to be transparent. You know, I don't think it benefits anybody if we say that it just drops in perfectly. And then, you know, you try and use it and you find some sort of issues. So here's the manual page. I'll just show a quick video if I can. All right, I'll see if this works. Just to scroll through and show you an example of some of the sorts of things that are missing. Okay, I apologize for that. I do have a summary in my slides as well. For some reason, it's not loading. So the basic summary is that join subqueries, DML, DDL, all of the basic stuff works. You know, you can query data as it's located in multiple shards or as we call them regions. And we support all SQL modes that are available in MySQL. This I think is an interesting feat in itself because it's incredibly difficult to do that. We don't implement things like store procedures, triggers, events. We don't have current plans too. This is actually quite a lot of effort. But we do plan to implement things like CTEs, views and window functions. You know, this is very important to our analytics story even though our current compatibility is MySQL 5.7. You know, we do plan to rebase to MySQL 8 in our upcoming 3.0 release. Some nuances to this story. Some features do work a little bit differently. Two of the ones that I think will probably be the hardest are that order increment allocates in batches. An optimistic locking means that a commit could actually return an error. TidyB also has some recommendations around how many rows you update in a single commit. And my last point is just that most of the tools that you're used to do work, but be reasonable in your expectations, you know, top won't work. And this is obviously very specific to NODB. And that's it. Thank you. Isolation level, we report it as the same as MySQL, which is a repeatable read, but it's SI internally. Yeah. We also support read committed, but we recommend that you use repeatable read, which is SI. It's not the same repeatable read as in NODB. It's not the same. Yeah. Okay. You still have time for questions? Okay. Just repeat the question. Okay, sure. That question was, what's the isolation level? Any other questions? Okay. Thank you.