 The Carnegie Mellon Quarantine Database Talks are made possible by the Stephen Moy Foundation for Keeping It Real and by contributions from viewers like you. Thank you. Thank you everybody for coming. This is the kicking off the new semester of the Quarantine Database Talks. We're excited today to have Sagu, who's the co-founder and CTO of PlanetScale. Prior to starting PlanetScale, he helped build the tests at YouTube, which is the scale of my SQL infrastructure that they're commercializing now in PlanetScale. So again, we want to thank the Stephen Moy Foundation for Keeping It Real for sponsoring us and these sessions. And we'll do what we do every week. If you have any questions, please unmute your mic, say who you are, and just interrupt Sagu at any time. We want this to be as interactive as possible. Okay. So beautiful. The floor is yours. Go for it. Thank you very much, Andy. All right. Actually, Andy may not remember me. I think I first met him at a Percona conference when he was talking about electric sheep. Oh, okay. Yeah, okay. I'm terrible, yes. Sorry. Yeah. But yes, yeah. Good to see you again. So a little bit about me. So as Andy mentioned, I am the co-creator of Wittes. I have been working on it since 2010. So it's about 10 years now. In 2018, I left YouTube to start PlanetScale, mainly because Wittes adoption was growing and there was a need for a dedicated company to help work with that. I will talk more about that as we go. Some fun fact about me. I was hired by Elon Musk. He probably doesn't remember me anymore, but he did hire me at tax.com and we later merged with PayPal. That was fun adventure. I'm actually a big fan of consensus algorithms. This is also maybe a little known fact that I came up with the FlexPaxos idea right around the same time when Heidi and Dalia Malki came up with it. And we kind of came together. And if you actually go read their paper, you will see my name quoted in there. I am not an academic at all. I'm a pure engineer. I will tell you the story about how it happened. I'm actually right now working on a what I would call as a more unified approach because there are things that can be generalized beyond FlexPaxos. I'm actually publishing a blog series. I'll tell you why I'm so excited about this and how I ended up working in this area. I'm also a big fan of Go. I've been working in Go since pretty much the time when it was announced. And we decided to actually implement Wittes in Go. And Wittes was the first project to go into production using Go. And for the longest time, we were kind of the highest credibility. When you could say that every time you watched a video, you are using Go code that kind of set aside any doubts people had about Go's ability to run in production. I gave a presentation called how to write a parser in Go. If you Google for how to write a parser in Go, mine will be the first hit. So which is kind of cool. Although I have no formal compiler background or knowledge, I have worked on a C++ compiler back in 1987, I think, a long time ago. A proud moment is the last blog post that came out of Google open source cloud. Rob Pike quoted me directly saying like, if you want to know about why Go is good, please watch Sugur's video about it. It's kind of super cool, you know, Rob Pike talking about my video. Anyway, so that's like some cool stuff. Even though I'm an engineer, I am a fan of some academic people. I'm a fan of Stonebreaker. I'm a fan of Eric Brewer. I'm a fan of Leslie Lambert, obviously. And I have also seen Daniel Abadi and also watched many of Andy's presentations. So a little bit about me. Sorry, you had to say you like Mike and me like that's sort of no Stonebreaker. I have been a fan of Stonebreaker before you probably I would want to say you knew him. I've been a fan of Stonebreaker since year 1990. Seven when Informix acquired Elastra. So I was at Informix when they bought Elastra. And I've been a big fan of his data blades. But yes, yes, I actually still, the Vindex part of Vitas is kind of a steal from the data blade idea, which is where you can actually plug in shorting schemes. So that's directly inspired from the Elastra project. Okay, so what is Vitas? Vitas is a cloud native database. What that means is that Vitas can comfortably run in Kubernetes. What is cool about it is because people are still afraid to run storage on Kubernetes and Vitas is kind of a pioneer in this area. It is massively scalable. I will talk about what massive means by massive, we really, really mean massive. Highly available five nines of availability is something that Vitas can comfortably deliver no problem. And it is MySQL compatible, which means that it uses MySQL underneath and also speaks the MySQL protocol. So as far as you're concerned, it appears as if you're talking to a giant MySQL database. Actually, this is kind of a marketing slide. This is like what will Vitas do for me. Under the covers, Vitas, I would say has three pillars. One is the query serving part, which means that sharded query serving, you give it a query, it knows how to send that query into a sharded system and get you the results back. On the other side is cool innovation that it has, which we call V-replication, which we use for real-time materialization, resharding, and all kinds of data migration workflows that you can build using those. And the third, which is actually the cluster management part. And that is where my interest in consensus algorithms came about. So I have never knew about PAXOS for the longest time. When I think I might have learned about PAXOS in 2013, 14 or so. And I started studying it. The one question that popped up in my mind is, well, people are saying unless you run PAXOS, you cannot really have distributed durability. But the fact of the matter is, I mean, at YouTube, we had master databases that were connected to like 75 replicas. And we used to fail over, out of those, there were 15 of them that we used to fail over to. And we used to fail over quite frequently from one to another. And I hadn't heard of like us losing data. And I'm like, so on one side, somebody is saying PAXOS is the only way. On the other side, we are running this thing at massive scale, millions of QPS. And it's not like we are, we do have outages and stuff once in a while, but we are not really losing data. So that's when I got interested and started studying. And then I realized that we were running things with a different balance. And that's when I came up with that flex PAXOS idea. Maybe PAXOS can be improved. And then later, I get an email from Heidi and then Dahlia. And then they say, Hey, we are working on the same thing. We should put our forces together. But now what I'm doing actually in that area is actually found out a few more things that we were doing that don't exist in in the raft or PAXOS. So I'm actually coming up with something that is a little more. So kind of I found out some principles that were driving this design. So I'm, I'm started a blog post series about that, which I think is under planet scale. I just posted the first one, which is consensus algorithms at scale. So feel free to follow it if you are interested in this subject. All right. Continuing on the test is my present. So the test is a thriving open source project. It's actually growing. More and more adopters are coming on board. But the best part of more important than these numbers, the most exciting part is who are the people that are using the test. And you can see there's an impressive list of companies. And some of these are really, really large scale. But the most staggering one by any measure is JD.com. JD.com, just to give you a number, last single day sale, which is their whatever, their Black Friday, they finally revealed what was their QPA, peak QPS, 35 million QPS. When people talk about millions of QPS per day, this I was now I actually had to confirm multiple times. And then they said, well, that's what we run. No, that's that's how much we serve with that. How many machines is that? So that's it's all on Kubernetes, which is another miracle. And that is about the two years ago, they told me they were running about 40,000 parts of VT tablets. So my guess is that when they serve that QPS, they might have been closer to about 100,000 nodes or so. So that's yeah. But they're not the only one. Slack is pretty much mostly fully migrated to with us. And I'm pretty sure they comfortably serve like a seven figure QPS. And there's square, which is also a pretty, pretty big workload. I think they actually from what I heard is they grew many fold since the time they migrated to witness in terms of scale. So this is the witness architecture. I highlight just a few pieces here. First thing is what principles did we use to build this architecture? I would say we used three principles, simplicity, loose coupling, and survivability. Those are kind of the three principles we used. Simplicity means that should be as few components as possible, and no less. So in this case, we have the VT gate, which is one component, which we have the VT tablet and my sequel, which is another component, and there's the VT CTLD. And in the serving part, there is only VT gate and VT tablet. In terms of loose coupling, that was actually the hardest part, which is basically each piece does its own work and minimize the surface area that exists between components so that most of these components kind of work on their own. And survivability is basically a way of saying that there is no single point of failure. Any component can fail, and the system will continue to operate no matter what. So those are kind of the three principles. The way we have done this is VT gate is mostly stateless, which means that you can scale it up and down. It speaks to my sequel protocol. You can put a load balancer in the front, or the app server itself can do its own load balancing. It can connect to any VT gate, and to that app server, that VT gate will appear as if this entire sharded system in the back is a single database. It will give the semblance of it. I mean, it's a leaky abstraction of some sort, but generally, if you're just sending a query doing transactions and stuff, it'll look like it's a single database. And the VT tablet itself is kind of tightly coupled with my sequel. What it does is it's kind of a minder of that thing. It makes sure that it has a connection pool. It does DBA type of things that otherwise, like it makes sure that if a query runs for too long, it kills it. It also does housekeeping work like take backups, do restores. So kind of like built-in DBA for that MySQL, for each MySQL instance. And the way this system is tied together is every time a VT tablet comes up, it registers itself with the Stopa server. The Stopa server could be at CDE or ZooKeeper or console. And as soon as it registers, these VT gates are watching in. And then when they discover that there's a new VT tablet, they subscribe to it and start sending traffic. So this is kind of at a high level how the system works. So how do these VT gates know how to split a query, how to send a query to each shard? There is a metadata called Vschema in this VT gate that I'll talk about, which is parallel to a schema, but the Vschema describes your sharding scheme, how your tables are sharded, and that's what it uses as information to route queries. And I'll talk a little bit more about that, since that is what this talk is about. Any questions so far? The shard is a big data replica. What is that? Oh, yes. So this is actually something that we had to do at YouTube. There was actually workloads that were directly served from the web, where users would interact. We used to actually send many of those things to replicas, not to the master database. But there were internal workloads that were running OLAP queries. But those OLAP queries would constantly hit these replicas. As soon as this OLAP query would hit the replica, it will start lagging. And again, it will cause outage. So what we did was we actually categorized some of these MySQLs as just for running OLAP queries, where you can slam it into any query you like. If it lags, we don't care. But it won't affect the main website queries that are coming from the users. So that is what this big data replica is. And so they're being replicated through the MySQL log stream or what? It doesn't need to be 100% accurate. You can kind of stale. Correct. It can go stale. It may be lagging by half an hour or so. We had less tolerance for the actual replica queries, like 10 seconds was the tolerance. Whereas a big data replica lags all you want. It's mostly for daily reports and that kind of stuff. And this is what we are doing at PlanetScale. Is basically wrapping with us with a warm blanket of the three pillars that I talked about. We take over the two pillars, which is take care of resharding, take care of materializations and also take care of cluster management. The query serving path remains 100% open source. The idea is that you don't have to be locked in. If you don't like us, you can always pack your bags and move. But we hope that you'll like our service so much that you'll continue to use us. And we make all this available like an RDS, which means that you come in and you click a few buttons and you have a vTest instance up and running. So you don't have to worry about bringing up that. And we'll give you an endpoint and then you start sending queries. So that's basically what we are doing. So very briefly, that's what we are doing with PlanetScale. And we are actually the major contributor to vTest. And we love our community and continue to push that project along. Beyond, is YouTube still the other major contributor to vTest? Not anymore. YouTube has not contributed to vTest in two years. You left. Yes, I left. Even within YouTube, the last year, 2017, I left YouTube in 2018. All of 2017, I was the only person maintaining vTest until I left. That was another reason why I had to leave. It was not scalable that a single person maintain such a huge project. Beyond PlanetScale, who is the second most contributor? Beyond PlanetScale, who is the second most contributor? I would say there is a race between, so I have about five engineers that are contributing. There's a race between all of them. And as a company, I would say Slack is probably the second biggest contributor. Slack, Squire is another big contributor. Others come and go. Like GitHub came in and did a slew of contributions and went away. Nozzle did a slew of contributions. They keep coming and going. The difference is they are all selfish contributors. They need something, so they come in, they do their features for themselves and then they are happy. PlanetScale is the only unselfish contributor. Cool. All right. Seven hours. What does seven hours mean? It's how long it's going to take for me to explain how this query processing works. Obviously, I was kidding, but there is some truth. Here is somebody asked me how does, we call this the V3 design. And somebody asked me, I want to contribute to Vitesse. Can you explain how this query processing works? I said, okay, I'll explain it. And I started explaining, did one session, did another session, and basically, it ended up being seven full sessions. But I would recommend you not do it, because as you can see, this was the first session. And you see there are 1600 views. And if you look at the last session, there are 161 views. That is how boring these talks are. So it is not worth watching. But this explains end to end how Vitesse query processing works. But so I will try to package this up into something that's more palpable in the next few minutes. Oh, my God. This is not a good time for this to happen. It's okay. Because maybe you're logging into YouTube in your Gmail account and kicking out or? No, this is my monthly, you have to log in again. It just happened exactly at this time. Cool. All right. Okay, we are back on track. So that's obviously, it's not going to take seven hours. So to explain how this thing was built, I have to go back in time and show, tell you, talk to you about where we were. I don't know how many of you recognize this image. If you don't recognize, I would highly recommend you Google for MongoDB is web scale. You will find an extremely hilarious video. It was hilarious and because we were laughing at MongoDB then, but they laughed all the way to the bank is all I can say. Give me a shirt too. I have the shirt of the one guy says MongoDB's web scale. I have a shirt. You have a shirt. Okay. Yeah. So, but more importantly, it's more important to know who I was in 2014. In 2014, I was an applications engineer. I was helping YouTube scale. I did not know anything about databases. I knew how to optimize queries. I knew how to do joins. I knew subqueries. I could do a correlated subquery if you kind of put me on the spot. That's basically how I was. And the way we had sharded YouTube was actually put the sharding logic in the application where the application knew where to send a query. You'll get a user ID. It looked it up. It says, okay, I'm going to send the select to that particular shot. So this is where we were. And the state of the art was not very advanced either. We were distributed databases did not exist at this time. No, there was no cockroaches. There was no spanner. Distributed is the existence 1970s. Oh, yes. Yes. Practical. I mean, in production, distributed database. And open source. And open source. Yes. Yes. By the way, nothing has been invested invented in the last 40 years. We have just been recycling those ideas just in better form. Reimplementing IBM. Exactly. Exactly. So and you will see some of those things when I talk when I tell you. But the bottom line is I was approaching this problem from the outside. And that is a disclaimer later saying that I probably made a bunch of mistakes. And if you find them, please point them out because do not hesitate to call my work a piece of crap. I'd love it. Because at the end of the day, if you find something that can be fixed, I want it fixed. But this is more like, this is how we did it. This is where we came from. This is how we did it kind of thing. So you may not, you may see some some big holes in what I have done. But the fact is we did get it working. So that's that's the cool part. So the way the way I reasoned about this is, you know, if the app is sending query to a shard, there is something that the app is using to figure out where to send it. So how did the how does the app know where to send a query was kind of the first question I asked myself. And like the app receives a request from the web, it has a user ID, and it says based on the user ID, I'm going to apply a hashing algorithm and then send it to a shard. Well, if you're doing that, you're also including that user ID in a warehouse. So I said, what if we flipped it flipped this around, right? What if you just gave me the query, and I knew that you have a where user ID equal to x, I will apply the hashing and send you the query. So the idea like this is all like super new to us, you know, we were like systems engineers who were like worried about scalability. So the question is like, Oh, well, you got this query working. How about something else? How about this? How about videos? How about joints? I don't know. We should we should we should explore this idea. You know, that's what you are thinking. And how are you going to like write something generic that will work at YouTube scale? And those are these are the challenges. So I started studying know, this was my like, if your dad asked you like, how to draw like, can you draw me a house? This was my picture of how a database worked. And this was my first diagram. I said, Well, this is how a database works in my eyes. The CPU that is RAM, there is a schema that tells you what tables are there. And there are tables which are physical files that are index files. And this is my view of how a database works. The CPU says when a query comes, it looks up the schema figures out if that query is good. And using the schema, it knows the indexes, the indexes has some pointers into the tables that are constraints. And then what it does is it loads things into the RAM, as it reads them applies locking logic as needed to for your transactions and stuff. So this was my high level view. Of course, there's like logs and stuff which I'm not concerned with because the reason why I drew this picture is to compare with what I had with the test here. So this is how the test works. The test has this VT gate as you saw there, right? And behind VT gate, there's a bunch of shards. And then there are some lookup databases that points to these shards. And what if I took the schema idea and built a V schema that described how these sharding work. And now I'm saying, Oh my God, these two pictures are starting to look alike. So maybe this can be the start of a distributed database. So this is essentially how it was born. During those days, we debated using a centralized row cache. We never implemented it. But this architecture does allow for that. Maybe someday we will. Did you not have a benefit for engineering reasons or was there something fundamental about like, because getting a row cache right is tricky. Oh, yes. So we actually built, we actually had a row cache built, but it was at the VT tablet level. But the problem was going through two hops to reach a cache didn't make sense. So we actually deprecated that thing and with the intention of rebuilding it on this side. But we don't know if it'll actually work, whether it's solvable. It felt conceptually similar to this. So we think that maybe there is there is hope there, but we haven't spent time figuring out how it'll work out. But I do have hope because I got a row cache working at the VT tablet layer. So I think it'll work as long as it's an eventually consistent cache is how I got it working. So you have to basically say that I am doing a dirty read. But it was a cache that was invalidated by subscribing to the bin lock stream of the database. All right. So this, so this idea gave me hope. So I started out saying, and then what I decided to do was, I said, I am going to now read up, no, I'm going to learn how databases work. Guess what? There is not much information to read, or at least I could not find much information to read. I mean, there were a lot of theoretical things that talked about how an engine worked. But they were too theoretical for me. I tried to read the MySQL code base, the Postgres code base, SQLite code base. Like the problem is these people who designed these engines had didn't really write a design document about how they did it. It was hard to find. And so and then I think I found some literature using SQLite. But at the end of the day, my conclusion was they are too centered around how the database itself worked and wasn't applicable to what I was trying to build. So, okay, we must go deeper. So when deeper and started studying relational algebra, studied relational algebra, and I actually got some good insights from there, but it wasn't enough for me. It was enough for me to get started, but I had to come up with my own version. That's essentially what I ended up with. And which is basically this was my outcome. By the way, during this time, it took me about a month to learn all this knowledge, to absorb all this knowledge. I was mostly staring into space. It was kind of funny. And I came up with this design document, which you can find. Let me see if I can do it. And so this is basically my self brainstorm about what SQL is, what are the problems. See, there's SQL for dummies. This is how a query works. And eventually I said, you need nine operators to satisfy any select query. So and I said, this is very driven, very influenced by a select statement. But it is more practical about, if you have to send a query to a database, you need all these nine, using these nine operators, you can satisfy any query. There is probably overlap between them. But practically speaking, these nine, I felt that they were needed. And the other thing I noticed about, and the thing I got, this got me really excited, the relational algebra part is that these things are all interchangeable. You can do a select before a filter, a filter before a select. I started building all this commutativity and associativity rules about what the thing is because they all took results as input and produced results as input. So that got super exciting. The only exception was the table scan. The table scan did not take a result as input. So at the leaf node of all these things, where these table scans that produced your original, your first result, after which then you applied these transformations. But there was a twist, which is the relational primitives, the scalar primitives. SQL is not just relational operators, it has scalar primitives. And the stuff that really, really sucked here for me was subqueries. A subquery can be reduced into a scalar. And that just like my head just exploded. I could not contain this combined complexity. I just left that aside, and I will tell you how I managed to work with it. And so essentially my disclaimer is this may not be the best approach, but this is what I did to get this working in the test. But before I go into how I broke this problem apart, I had to give myself some design constraints. I hope I'm not running out of time. Okay, I think I still have time. Cool. So the one design constraint I some people have questioned me on this, but I think this has paid off is queries must be recognizable, which means that if application sent a query, and we test said, okay, this query essentially can run unmodified by a single shard or can be sent to all shards, that query should mostly be recognizable should mostly look like the query that the user sent. The other one was called what I call as a stable plan, which means that if a query if I determined that a query is potentially a multi shard query, but all the rules for that query can be satisfied by a single shard. The plan that I built was still a multi shard plan. And people challenged me on that say, why would you make a query that was satisfied that could be satisfied by a single shard, a multi shard plan, just because in principle logically, it is two different potential shards. The reason was actually simple, because you may be running this query in a test environment where there are only four shards. And the test during your test, it'll say, ah, I can satisfy this query by just running it in a single shard. Your production may have 256 card shards. So suddenly your profile is going to look very different when your query runs in a fully sharded environment. So for that, just to support that use case, I made the decision that it is safer to always treat potentially a query that could potentially go multi shard as a multi shard queries, because you don't want to get burned when you take this query into production. This is complex, right? What is the cutoff? Yes, so for this, the way I would do it is, so the with test model is actually, we have this, we define this key space, and we have key space IDs, and every row has a key space ID. So, and if two rows have the same key space ID, they are guaranteed to be in the same shard. If they have different key space IDs, even if they are in the same shard, I treat them as if they could be in different shards. So that was actually the tipping point. But then when you execute it, you would actually use... Oh, it will go to the same shard, but from a transactional perspective, from a joint perspective, it won't push that query down into a single joint inside that shard. It will fetch the query and then fetch the other one. Because in production, yeah, in production, you may actually end up doing it, and you want that behavior to be seen even when you are testing. So somebody asked the question, just to clarify on that, if a right query resolves to two different key space IDs, it will execute as a distributed transaction without perfect atomic guarantees. That is actually... So that is, at the transaction level, it still merges them together as a single transaction. Yes. But yes, but most likely, if you go in, if you take that into production, it will actually split, you are likely to split that into two transactions, and they will become different transactions, distributed as a transaction. And Vittas does not support two PC yet very well. We have an implementation, but we don't... There are some people who have tested it, said that it is good, but I haven't seen anyone use it in production. And the reason is, I won't go too much into the reasons, but the main reason is because Vittas allows you to group things that are related into a single shard, and that's how most people have avoided distributed transactions in Vittas. The other decision I made is that it has to be a single pass, which means that I start with the query, I go, go, go, go, go, go, and by the end, I am done. My plan is built. I never trace back, I don't redo. This was actually a weakness of mine. The reason why I did it this way was not actually a design constraint. It was how I implemented the simple table. When I tried to understand the simple table, I realized that it has so many scopes. When you analyze the from clause that is a scope, when you analyze the on expression that is a scope, the where clause has its own scope. And so the way I designed the simple table was that the simple table evolves over time. The symbols it resolves is always based on the latest scope that it has entered. So once you have entered the latest scope, you cannot ask it a query about something like when you're doing and you're resolving symbols of a group by, you cannot start resolving. You cannot go back to the scope that you were when you were analyzing a from clause. So because when you analyze the from clause, the symbols you saw were very different. So to simplify the design of the simple table, I made it a single pass, which means that as I kept analyzing, I analyzed in a very specific order where the simple table kept growing and changing its behavior. And the only way I did was by just moving forward and never going back. I don't know if that makes sense. It is probably a mistake. One of our latest people, Andres is saying, no, you don't need to have done this. I'll give you an alternate design for this. So I'm waiting for his proposal. to make this a multi-scoped symbol table. And the bigger complication I was dealing with, I was still trying to wrap my head around. What does it even mean to have a subquery? The subquery has symbol references symbols, those symbols can go to the outside query. And when and that subquery can be inside the where clause. If it's inside the where clause, it has its own scoping rules, like this is all I could deal with. I said, okay, I'm locking this down, keeping this simple. And this allowed me to get past the problem of dissolving symbols. Another cool thing I did, I think this was cool. But now, now we are changing it, is the question that I would ask myself is, if I, if you just threw a query at me, can I describe what it does without knowing what the schema is? The answer is mostly yes. Except if you did something like, if you did a select star from a join B, or select column from a join B, I wouldn't know whether this column came from A or B. But you could say, well, you could say select a dot call from a join B, then I would know what it is. So then I said, the query has to be self-explanatory, which means that if there was ambiguity, you need to resolve that ambiguity by qualifying those column names using the table name. So using that, the constraint I made was the VT gate does not need to know the schema of the target shard. It needs to be able to know what to do with that query without knowing the schema. And we got this working. It works. There are some queries that still fail, but there are some experiences that are not great. For example, if you did an insert, you had to give the column list. Otherwise, I won't know what is where. So there were some things that were difficult. The VT gate, like you're sending one, you're sending, like SQL query shows up, and then you send another SQL query to the shard. So the SQL query comes here from the app server. It goes to VT gate. And VT gate has no knowledge of what these VT tablets have. So you're not manipulating the SQL query, like the MemSQL guys, I think the Citus guys do something similar. You're just, you're basically just routing it, like you should go here. Exactly. Yeah, yeah. I am trying to figure out where to send the query. And can I figure out where to send the query without knowing the schema? Yeah, okay. So that is the constraint. So it wasn't, I got about 90% out there. And the other 10%, the people, the app, the application people had to rewrite their queries in such a way that VT gate can figure out what to do with it without knowing the schema. But now we are changing this where the VT tablets are going to start sending their schema to the VT gates so that they can be smarter about how to send this query. All right. So using, let's see. And finally, this was my final out. If it was too difficult, I said, well, I don't handle it. So we had this out. And there is still a bunch of queries that we don't handle. We actually just made a decision saying that we have now far enough that we can basically shoot for the finish line. We are going to actually make 100% of the queries work because we know how to make all of them work using this architecture. So we are going for the finish line here. Okay. So back to here. What strategy did I use? So the only replacement I made was a scan is replaced by what I call a route. What is a route? Route is a special primitive that works like a scan except that it can do all nine of those operations, which means that a route, because at the end of the day, what we have underneath is a full relational engine, right? Which means that you can make a route to any of these nine primitive works. What are the restrictions? The restrictions are that a route can only send a select query. It can only send SQL, which means that you cannot give it primitives. It doesn't know how to deal with primitives. It can only say execute the select statement to this database. And the scope of a route is limited to a single database. So with that, I replaced scan with route. And initially, I only implemented join and left join. And later we implemented select aggregate sort. We still have not implemented filter, merge, and scalars. And you will be surprised at how many of the existing workloads we can handle. The reason is because we have MySQL behind us. Anything that we don't know how to do, we usually just give it outsourced to MySQL, make MySQL do the work, and only what MySQL cannot do, we do, right? So we have been able to get away with VTGate not being a full engine for these many years, with these many people going into production and running with us, just because of the fact that we have a full relational engine behind us. So this is kind of my poor man's view of how a traditional engine does its planning. That is SQL. It converts it to an AST, takes the AST, builds some relational op codes, and then runs through an optimizer and builds some DB specific op codes that are not in the original relational list. The way I changed this was I'll start with an SQL, build an AST. For me, oops, I went to the next one. The way the Vitas AST is designed is SQL and AST are reversible. I can take an SQL, build an AST, and from the AST, I can go back to SQL. But the op codes, you could not go back to at least a decent SQL given a bunch of op codes, because they were a little more unwieldy. Even if you could go, I didn't have the confidence that the SQL that one would generate, the correct SQL that represented this operation by these op codes, may have selects from selects and that kind of stuff, which I didn't have the confidence that an engine would optimize that well. So the way I designed it was that I will go from SQL to AST, and while I did these op codes, I will keep those AST elements that brought about these op codes with those op codes. And then at the end, when I did the final generate, this AST will generate back the SQL. So that was kind of the high level approach. In reality, there was only one such hybrid op code, which was the route. Everything else were pure relational op codes. It's a realization that we had much later. Okay. So the other question that was like, how much, when I got a query, how much would we get to and how much of the work should we give to MySQL? The answer here was simple. If MySQL can do it, give it to MySQL. I mean, 25 years of tuning has gone into making MySQL really, really good. So do not reinvent the wheel. Give the work to MySQL as much as possible. I call this pushdown. And if it is single shard or single key space and everything there, the only time when you break a query up is if it's a join, it's a cross-shard join, in which case you have to split the work and only do the leftover wire up yourself, which is the third strategy. So determining the target shards. There are some easy answers, right? If somebody sent me a select query to an unshardered database, which means that there is only one shard, it's a branded thing, just send it to the unshardered route. If you send select star from a sharded table, this is also branded, basically send the queries to all shards and return all the results. It's a very simple one. It is another simple one. So the way the vschema works is the vschema allows you to define what people call in the key value store as a sharding key, but we are relational people, right? I call it the primary index, the primary with test index, which is like a primary key, which means that it has all the cool properties of a primary key, which means that every table has to have one and it has to have a unique value. It has to be non-null, so all those cool things, which means that the value of the primary key must be specified on an insert because that will be used to identify where this row is going to live and that's where the row goes. If ID was your primary index, but there is a difference because at the end of the day, a primary index is not the primary key. That's an extremely important property. It is the place where the row lives, but in reality, the primary key could be different and why is that important is because if you have a master detail table, like if you have users and their orders, you want the orders to live with their users. So the order table has an order ID and the user ID. The user table has user ID. So the primary index for a user table will be user ID. The primary index for the order table will also be the user ID, even though it's not the primary key. But that will ensure that the orders of a user live with their users. So if I then issue a join of a user with an order, then the entire join can be sent to that chart. So that is basically the secret behind the test charting. Does any query turn into multiple queries? If I have a message selected, when you go to your own op codes, are you breaking that apart and routing individual things or is it like, do you always either unshard it or these three modes? That's a very good question. How is massaging of the query are you doing? I try to do zero massage. My goal is to preserve the original query as much as possible. The SQL comes. I'll actually show you an animation at the end. The thing that tries hard is try really, really hard. Do I really, really have to break this query up into two parts? If I don't have to, I am not going to do it. If there is, for example, a sub query and if it is correlated and this correlation is on the primary index, that sub query need not be broken out, for example. If there is a join and if there is not just the primary index, any unique index. So Vittest has a primary index. It also has secondary indexes, which has cross chart indexes. And those could be unique. And if they are unique, that actually means that if I do a join on a unique secondary index, that is still a single chart query. And Vittest works very, very hard at making sure that your query is never broken into parts. Only when forced to, it does. Somebody actually threw me a challenge. Now looking back, I think they would have won, but they didn't win that time. It was that, oh, do you think you can figure out every query that I believe is a single chart query? Do you think you can figure out? I said, yeah, I think I can do it. But the use cases that they had, Vittest actually satisfied, but there were some cases where application knew more than the metadata in the Viskiva. Can you give an example about what aspect of it made it difficult to route? So the aspect that made it difficult to route is there are foreign key relationships that are maintained by the application, for example. The application knows that this is a foreign key to that. And if you don't specify that as a relationship, then Vittest doesn't know. Actually Pinterest had that situation where they had this use case. What they did was they created a fake unique index on those two columns that actually never resolved. And they used that index only for joints. And they managed to actually convince Vittest of these joints. So that's a cool thing that they did. All right, I have five minutes left. Let me rush quickly. So these are some more examples. Skip through this. Let me give you a cool animation. So here is a query. Here is a query where the input is a select statement. And I build an AST from it. The way this AST is represented is from clauses at the bottom. This is your on clause. This is your where clause. On the top right is your select list. And these are the post-processing group by having, order by, and limit. So this is kind of the AST that we built. And this is the process I go through. The best model I can describe to you is think of this as a Pachinko machine where you take this AST elements and throw them into this Pachinko box of primitives. And these AST elements try to make their way into one of these leaf nodes. If they get stuck, then they become a primitive themselves. If they don't get stuck, they will eventually end up in one of these routes. So in this case, I start with my analysis. I say, okay, here is my select. I'm starting with the from clause. And the from clause is a recursive one because a joint that is the left-hand side and the right-hand side. So the left-hand side says, I have analyzed your A. What can I derive from A? The only thing I can derive from A is that it's basically a full scatter query of all rows of the table. That's what you can, that's all you can do right now with the information that you have. Okay, so let's build a select statement which scans all rows from A. So my route is an opcode. It's a primitive. As you know, this is a super primitive that can perform any SQL. And yes, it's a scatter, which means that you send this query to all shards. The user is the name of the database. And the same analysis happens for B. And B says, here is a route. It's a full scatter scan of table B. And the two actually meet at join. So the join says, oh, I need to put these two together. So at that time, the join look at this join condition. And it says A.id is equal to B.id. This is a join on a primary index, which means that these two routes can be merged into one. So it says, all right, I'm going to put them together and push this join down. And now I've rewritten the select AST, where the join A, B, along with the on clause and the select have become one. This has now become a single route. What if I had a join that was not on A.id equal to B.id, right, on some random column, some completely unrelated expression. Then what it would have done is the Pachinko machine, right? It tries to push it down, but the two routes cannot be merged. Therefore, I now have to create a new primitive, which is a join primitive, which will then have to perform this join of A.id equal to B.id. This is where another one of the coolest inventions that we did with this was, should this join condition do A.id equal to B.id? We said no. What we are going to do is this entire on expression, we are going to push it down into the RHS, the right hand side of the query. So what we are going to do is, I'm going to push this A.id equal to B.id through this, and this route is going to do it. But this is invalid right now, which is why A.id is highlighted as red, because there is no A.id in this route. So we'll say we'll resolve this later, but for now, we are going to push that. I call this the rightmost rule, which means that any sub-expression that needs to be pushed down into a route, you always push it to the rightmost route that refers to it. So in that route, I'm now I go to the A.id equal to 5, says A.id equal to 5, the rightmost route says, oh, okay. So and here I did something, another important one is that this is now a single-shot query, single-shot query, because I'm saying this is essentially B.n equal to, sorry, no, it's not a single-shot query. It's a name query. So there is a secondary index, so I need to use that secondary index. So that's basically, this plan has improved, it is not a full scatter anymore. It's basically look up a secondary index and then use that here. And now I'm going to push A.id equal to 5. And A.id equal to 5, as you see, did not end up on the right-hand side, because the rightmost rule says that A.id equal to 5 belongs to the left-hand side of this route. And then as soon as I push this, I said, hey, I've added a constraint, which means that I can probably improve the plan. My plan is now, is that this, that used to be a scatter query becomes a single-shot query. I'm accelerating because I'm unfortunately out of time, a lot of time could be spent doing this. And then I go to the select statement. In the select statement, I just keep pushing them down. In this case, it's a straightforward, actually, it's not a straightforward pushdown. A.id and A.a1 go to the left, but B.b1 goes to the right. And this is essentially a nested loop join. It's going to fetch rows from the left-hand side. And what it is going to do is supply the value for A.n to the right-hand side. And that is the final wire that I have to do is after I have built all this, I go to these things and look for symbols that are unresolved. And when I look at an unresolved symbol, I said, oh, A.n is unresolved. Okay, let's go find out where that A.n is. By the rule of the rightmost, I know that it's one of the left-hand side expressions. So I go look at the left-hand side expressions and say, okay, do you have A.n? Okay, so you need to supply that to me. So then we find it in Route 0. So Route 0 provides A.n. And join says, please, when you get a row from here, populate the value for A.n. We are calling it a bind variable, which is A underscore n, and supply that here. And then this executes the join. It's not very efficient if it's a pure nested loop join. But that's what we do for now. We are now going to make some improvements, which is change this to a block nested loop join on one case. There are cases where if this constraint is a constraint that's likely to scan this entire table on the right-hand side, why not do a sorted merge? So that's another thing that we are looking at. Hash join is another one that we are going to do. So there's things that now that Andres, who has done these things, is suggesting all these cool improvements that we are going to do. To understand this, this would execute as two queries plus something in the test that's doing the nested loop join. So the way this is, the green stuff is vitigate. The query comes to vitigate. It says, I'm going to do a join. So this join, this is actually a horrible query if it's a no lab query. This is fine if you're fetching one or two rows. Because for every row it gets from the left, it does look up on the right. So it turns out that there are people who want just to fetch one or two rows like this. The vitigate is doing more than routing, right? Yes. It's doing a join. It's actually both a router and an engine. It works hard at being a router. And it's a very horrible inefficient engine, but that can still satisfy your query. Sorry. And the vitigate is doing all this without knowing the schema, right? And any other information regarding underlying shards there and et cetera? That it knows. So vitigate has the V schema. It knows which shards there are. It knows the shard ranges for each. So it knows that a query has to be sent to this shard. It knows, for example, where a table lives, right? It knows that a table is in this key space or is in this key space. And based on the where clause, it knows that this, based on this where clause, I need to send this query to that shard of the key space. So that is what it knows. But that's the reason why you see a.id here. Because if you had just said id, vitigate wouldn't know whether that id came from a or b. So actually, there is a little bit more. I talk about the symbol table, but it is, I would say it is not worth it. This part is actually what killed me, the data model. The juggling of three data structures, three different worlds of data structures that were pointing to each other was, I would say, extremely painful. But this is essentially the data model, the simplified data model of what I had to use to build this engine. So I can answer questions more specifically in here if you want. But if you think through it from basic principles, each one of these relationships would make sense. So that is all I had presentation wise. Some of some links here. I'm open to questions. Okay, awesome. I will applaud on behalf of everyone else. So Hardek, you, there's a question that they had during the talk. You want to ask that now? That person's still here? Yeah. So I don't have the question related to the query plan, but like initially when you touched on the big data part. And usually, so basically, what is your view on using MySQL for OLAP versus using databases which are optimized for OLAP? And at what scale do you say that, okay, MySQL don't do it anymore, and it is not cost effective to do that? So actually, that's awesome question. I have actually, you can quote me saying completely contradictory things about this at different points of time in my life. So there was a time when I said OLAP queries are best served by OLAP systems, columnar systems. And at some point of time, I said, why, what, you can use MySQL to serve any OLAP query. And I proved that at YouTube, we actually used to run MapReduces on these MySQL and satisfy anything that people wanted. Well, it is true. So in other words, you can use MySQL to do OLAP queries, but I don't think it was the best use of MySQL, not in all those cases. But today, looking back, the answer is there is actually a category of queries that even OLAP systems cannot satisfy, even as OLAP system, because of the fact that they are columnar. The big problem they have is they cannot do real time. If you're doing a high QPS throughput and you want to do real time reporting, you cannot get that from an OLAP system. This is where the third pillar, the second pillar of the test comes in. What the test can do is it can materialize your OLAP data into what you call folded, I'm trying to find the correct term, aggregated results, real time. Like, for example, it can keep a table up to date about what's your total sales per product. So that kind of stuff we test can do. Those fall actually between a full OLTP system and a full OLAP system. There's an in-between category. That is actually, it's a big gap and many people actually hate both worlds, because they cannot get an OLTP system to answer those queries. They cannot get an OLAP system to answer those queries. Those, I believe, fall back at us where I believe we just can fill that gap where it can do real time rollups. That's the word I was looking for. It can do real time rollups for you that can give you, using which you can build dashboards and stuff like that. But there is a lot of overlap, right? So using these real time rollups, it encroaches into OLAP territory. So if a single system can meet all your requirements, you can avoid the complexity of having a running a separate OLAP system. Also, one more benefit I feel is if you shard enough, you can actually get the power of parallel processing. If you shard enough, you can get the parallel processing power for OLAP queries. Correct. I think the bottom line is efficiency. Speed is not a problem. If you can provision, like you can shard it wide and it can go screaming fast. It's the efficiency. I mean, the OLAP engines are going to do compression and a bunch of things that MySQL is not going to do well. I mean, you could use the MariaDB column store stuff, but I don't know how that compares against, I don't think Vertica, but there are a bunch of the other sort of specialized OLAP systems. All right, we have time for one more question. I apologize for being late because of the class. Anybody else? Okay. Yeah, I got one question. Go for it, Mayna. So is AST transformed back to SQL at VT gate? Yes, we actually get convert SQL to AST back and forth in multiple places in the test. VT gate is only one of them. VT tablet has its own query processor, does its own things because it can do cool stuff. There are features that I didn't talk about that requires VT tablet to parse the query also. So we do some cool stuff there. So we convert back and forth SQL to AST and back all the time in Vittas in multiple places. Is transferring back to SQL required by the MySQL instance? Yes, MySQL only understands SQL. So we have to convert it back to SQL and MySQL is going to do its own parsing. But MySQL parsing is hyper efficient. It is almost negligible cost compared to all the other work it does. So it has never bothered us that it was doing it. Got it. Thanks. So this sounds similar to what MemSQL does. MemSQL query shows up. They figure how to break it up and send it to the shards. And then they cover it back to SQL, but then they have their own SQL dialect when they send it to the shard nodes that tells it to do things like don't send data back to the coordinator. In your case, the VGate, they send it to other nodes in the cluster to the new sideways information parsing. So I will ask you one last question that I'll ask everyone else. How stupid are your users? And you can be as vague as you want or very specific. Are you shocked at the stupidity of people using the test? Or do you find that your users are somewhat sophisticated? Because you have the sort of nodes, you need something like the test before you even start talking to them. They come in all categories. But the way I would qualify stupid is many of these users that come deal with legacy decisions that were made before. Very often, the request feels stupid. But then you go back to say, why are you doing this at all with your query, right? Say, I can't do it. It's been there and I have to make that work. That's fair. Okay, awesome. Again, I thank you for staying late with us. I really appreciate this. This is super awesome. This is a good exposure to what the test is doing. It's interesting to see what you can do purely at the router level or the peak level before you touch the data system. This is a different way to think about data. This is exciting. Again, thank you for being with us.