 All right, it's another quarantine talk because again, we're still under quarantine We're excited to have Karthik from Yugabite here with us speaking today So he is the co-founder and CTO of Yugabite prior to this He did undergrad at IIT Madras and a master's degree in UT Austin he was one of the original developers of Cassandra at Facebook and also was one of the team that brought Facebook into Facebook as well So now he's the co-founder of Yugabite And we're working on making Postgres actually be distributed. So again, the way we'll do this talk is that like like every week if you have any questions, you know, meet yourself, interrupt and Say here you are. Where are you coming from and then ask your question? Okay? All right, Karthik before is yours. Go for it. All right. Thank you Thanks, Andy. So hey guys, thanks for all being here. Like maybe we just jump right into it I'm just gonna give like a You know the rough overview is as follows is how the talk will go like I'll I'll give like a you know five minute or less intro to what Yugabite is for folks who may not be aware and then I'm gonna go right into Every decision point that we decided we're going to support in the database and a lot of it as you will see is centered around spanner versus Aurora and we're gonna go into for that decision point. What was the architecture architectural backing and how did we build the database, right? So again, like Andy said if any questions, please feel free to interrupt I have a lot of slides even in the in the in the extra section I'm happy to go into any area in more detail, but I'm just planning on touching on the various areas aspects of the system Okay, so and again I'm kind of assuming most folks have a fairly good understanding of the space like given You know your students of the area. So if if any place I'm just glossing over too much, then please let me know Okay, just not all my students. We get random people up the internet. Okay, great. Okay, great. And then People feel free to interrupt. I think is the moral of the story. It'll help everybody stay away concluding me. Yes, okay All right distributed sequel. So what is distributed sequel instead of Explaining this over and over again. We thought we'd just make it a simplified phrase It is retained sequel with transactions, which you all know are traditional rdb ms's with three extra properties Resilience, which means failure should not affect you or at least not a whole lot Scalability add more nodes in order to do more aggregate processing whether it's for a number of queries You're able to handle total storage volume, etc And geographic distribution which can be as simple as a multi-zone deployment where you're able to survive zone failures or More complicated like a multi-region multi-cloud hybrid deployment where you have a variety of different things, right? You want an entire region failure or maybe get data closer to your users to serve it with low latency, right? So that's really in a nutshell Distributed sequel so with that definition Yugabyte is built to be a distributed sequel database with a focus on high performance, which means CC plus plus throughout the stack the ability to do low latency serving or relatively low latency serving and Ability to do to deal with larger amount of CPU memory RAM keep, etc It's an old TP database though. So I just want to call that out I think I didn't put it anywhere because I just took it for granted, but it is an old TP database not a lab And it's cloud native so run anywhere don't have external dependencies So Kubernetes VMs bare metal what have you and it's fully open source the database itself So just like Postgres so our aim is we want to take a postgres like functionality and build it for the cloud native era Okay, very high-level architecture. We when deciding to build Yugabyte split it into two layers There's a query layer on top. It's actually a pluggable query layer which supports multiple APIs and we support two APIs the ysql API which is fully Inspired by Aurora and reuses postgres sequel and the ycql API which is Apache Cassandra Compatible in its wire protocol and has some extra features. I'm not going to talk about the ycql API I'm going to only focus on the ysql API, but it is built as a pluggable query layer and The lower half of the database of the storage layer is inspired by Google spanner. It's called docdb. It is a distributed transactional document store, right? so Let's keep going again very high-level overview if if you assume That's what a single node postgres looks like and you can kind of separate it out into the Logical query layer, which is parsing analysis read write planning executor We know the postman the postmaster all of that those processes and then there's the layers and Components that deal with disk like the right-ahead log writing component the background flusher vacuum, etc What Yugabyte does is it retains the upper half and it replaces the lower half with docdb as step one So you can actually execute the query on any node and you'll be able to distribute and read from any other node below as a Step two we said like hey Why don't we take these query plans that are generated by postgres and start pushing them down to the lower layers in order to do? Execution closer to disk right so you if you're doing for example the max or a sum or a count or all sorts of things you can start pushing down the various sub operations into the into the the respective set of subsets of data in order to do processing and The third phase is we already have most of what postgres offers in terms of table statistics Now these need to work well and not only with like you know larger distributed tables But also based on geographic location or network latency right so and and also rewriting the query plan to actually work in a distributed system So these are the three steps I'd say none of these steps are ever complete like that's the journey of a database So the steps have different levels of maturity and they would be in the order of those three steps Right the first step is most mature. This third step is least mature relatively, but all three are in play Okay So that was the intro like I said very brief now What was the I'm just going to now talk about why did we build like we started Yucca byte in? 2016 to paint you a picture Spanner was out Aurora was already out clouds were taking off for Olap the Enterprises were very afraid of the cloud for OLTP. They were like, I don't think it plays in our future and But the tech companies were all embracing the cloud right for OLTP as well So this is roughly how it stood back in 2016 when we started and so why did we build a database right? That's really the one question that anybody that's building a database needs to answer why build it so The journey was actually full of decisions and all sorts of decisions and because the landscape is actually Fairly nuanced and there's fairly advanced right so it's just an interesting journey for us also The first decision was hey, there's a lot of their spanner What are we building are we just building an open-source version of one of these or is it going to do anything better than one of these databases? So in trying to figure out, okay Let's say we had to go back to the drawing board and build the perfect distributed sequel database What would we do right? We knew we wanted sequel We wanted an API that was growing fast So people would actually use us and we found that postgres was actually an incredibly popular API and it's not any secret Everybody's picking postgres. So it's not really a revelation right here and on the other side. There was Arora and spanner Arora is I mean and these is just their own definitions a highly available my sequel and postgres compatible relational database And spanner is the first horizontally scalable and consistent relational database service, right? It turns out and it's I mean people can people may have their differences of opinion But Arora is definitely way more used than spanner right like that that was the state it became clear very quickly, right? So it's still true today. It's still true. It's still very true. It's still very true I think and spanner is actually a far more complicated piece of technology as well I probably I'm not like putting down the Arora guys. They've done great work But spanner is harder. So it's more expensive to it's more expensive to yes There's needs a lot of hardware is very expensive. Yes, seriously. Yeah, okay So, okay, so we said, okay, let's really compare these right like that was the high-level comparison But if you start looking at each of these rows, right? What did I'm Arora do that spanner did not and vice versa and why did Arora get picked? Well, the first thing is in the query layer Arora reuses postgres sequel and my sequel as its query layer So it said let me start with the code base Let me not build something new Google spanner said I'm going to build something new It's a distributed system. You need to learn to deal with a distributed system. I'm going to build something new Now the second thing was how do you achieve that right like the feature depth that is supported by Arora thanks Thanks to its reusing the code base is really high It supports most of the rdbms features that you would want right and Google spanner is pretty low right when it started out In fact, I don't know folks. No, it didn't have an insert statement You'd actually had to you know use APIs and they recently added Like foreign keys for example just a few months ago and these are considered very foundational to SQL right almost Fault tolerance. They both do fault tolerance That is the one thing that Arora exists to make it make it operationally easier and fault tolerant now on the horizontal scalability and distributed transactions front Google spanner actually really shines It's able to aggregate data onto a bunch of nodes and use all of their CPUs Arora sends all of the reads and transactions to a single head. No, right, which has to deal with it And finally the global transaction replication paradigm Arora is asynchronous So you can start doing async replication to remote databases Google spanner is synchronous with read replicas, which means you don't have to deal with say I mean the read replicas are still a part of the same cluster, but you know, it still sends data It's not a part of the raft or paxis Voting group, right? It's just an observer so The most interesting features of Arora that make it really take off like a rocket is the feature set and ecosystem adoption Like it just really supports the ecosystem You don't even though you may have to change some things because it's a different design point Most of your learnings as a developer and most of the stuff you're trying to do as an app just work, right? so we absolutely wanted to retain that and we said like We're going to build yoga by to not give up anything if possible So just get all the good stuff onto the right if possible and let's see if we can design a system that does that so that meant We wanted to keep the high that reuse of postgres slash mysql and we picked Postgres of course and we'll talk about it in a second and we Also said we wanted to reuse and offer most of the most of the features Almost all of the features are at least a very high degree of features that postgres has We want to keep spanners horizontal scalability and distributed transactions And we didn't want to give up on either because you know the old school And and we are open source database, but also a commercial company, right? If you go look at most of the rdbms deployments out there They use golden gate and master slave replication to decrease both read and write latency But you're okay with a little bit of eventual consistency across these two geographies, right? So we wanted to keep that and we wanted to keep it for keep the read replica and synchronous replication for the forward-looking Applications as well. So that that pretty much formed our, you know design doc kind of what we had to build What is the default for the web replication scheme you use and you go by today? The default is synchronous. So inside a cluster. It's synchronous across clusters. You can do asynchronous replication So, I mean, there's no default. You have to set it up in that sense So two independent yoga by clusters can asynchronously replicate data to each other and within a cluster You can add nodes that are read replicas. So we support all of that But the simplest deployment paradigm is a three node cluster with replication factor three That's the simplest recommended deployment pattern and presumably that's the bulk or your customers are using a synchronous Yeah, yeah, without synchronous I think nobody uses you go by today like it doesn't make sense But in addition to synchronous, there are people using both read replicas and async replications Both of it occurs, but that's always in addition. So if no synchronous, then I don't think you go by it's interesting rdbms is a really solid well, so I don't have a place there. So, okay um Okay, so decision number two right now that we decided this is what we want Do we reuse or rewrite postgres? I think it was very interesting that you know, and he was just mentioning about The cycle that goes through right? It's not an easy decision. So we actually started the rewrite path We spent five months on this and uh and said, you know what we're not going anywhere So we got to throw this away and start again But anyways that kind of gives you an answer on which way we went with it But Let's say you go you log into p sql and you do slash d right What do you think the command looks like the original The we thought like hey, we support a few data types, you know Support tables bingo and add one data type every month and we're in business in n months You'll get most of it completed right? Well, it turns out that the query to list a table looks like that right like so Oh, like okay, man. I I didn't know about too many case statements in sql I've never written one personally and uh, you know It has all the you know aliasing and you wear flaws in flaws, you know all sorts of built-in functions Or like, you know what it's going to take us like years to just support this one list tables command Which is not, you know, not very encouraging as a company so Well, we then looked peel the onion a bit more and we said like there's a heck of a lot of functions that people really care about From the rdbms word, right like starting from like, you know A dizzying array of all sorts of expressions operators built-in functions, you know All these expression and partial indexes to all the way to store procedures triggers user defined types to to you know Extensions foreign table wrappers like what have you right? So we said like yeah, we're not getting there in time We're not going to be able to build all of this stuff if we just rewrote the whole thing So we said we're going to reuse the whole thing, right? We're going to reuse the code base and since it's a very permissive code base, which makes the case to pick Which made the case for us to pick Like postgres over my sql and retrospectively it was a good decision because postgres sql supported actually much more advanced than What my sql has from what I understand Anyway, so yugabyte actually supports most of the features shown in the previous screen right today in a distributed scalable factor Now it may not all be performant like because postgres does various things And there's a lot of curveballs that we have hit through which makes an interesting talk in itself on What are the things that an rdbms does that doesn't work in a distributed world? And we had to go and block and tackle each one of those and we're still doing a bunch of those But we are able to support the feature set. So that's good. I mean, yes, I agree. That's an entire talk Can you give an example of the last slide? Like like from your checklist, which one was the hardest one to make distributed? I think um, let me see. Uh I think these were not hard. I'll I'll give you a it's always the stuff beneath the covers, right? Let's take store procedures for example, right or or even like yeah, it doesn't matter. It happens in a bunch of places So postgres has for example a bunch of system catalog tables like for example a table called pg statistic Yeah Every function call that you make it doesn't know if it is a built-in function or not If it is a built-in function it can cache that list if it's a user defined function It needs to go look it up from the catalog now postgres does a look up because it's local Right like now translate that look up every time it comes to a remote look up And postgres because it's local can do negative caching to say this doesn't exist until of course it does Now in a distributed system negative caching is a little harder to do because you know While you're negative caching somebody could have introduced the type and now you're screwed right so So that's an example of where for example a lot of queries going to one node would end up hitting a central Lookup of the you know of the same name because the distributed workload is looking up the same function call And you're going and looking it up in the catalog and the catalog actually happens to reside only on a single node Right and so now you get into the situation of how do you distribute and keep the cache consistent Versus do you do the look up so it's kind of one of the examples, right? There's a number of such examples where you know unintentionally just bad things happen in a distributed system There's actually a blog I wrote about it on even pushdowns like pushdowns Like if we didn't do it the right way by changing the postgres execution plan They have a huge impact. So if you can you can just search for Five pushdowns you go by it you'll be able to find like an account of five different things Including how it helps speed up performance So is your catalog what does your catalog look like is it just the regular postgres like tables? Or is this something that you extract? It's postgres like tables. There's postgres like tables that sit on top of a tablet I think we'll yeah, dr. B tablets, which are replicated and highly available. I think yeah That was it was another interesting realization there. We were just creating one tablet per table Right. I mean how bad could that be? It turns out. Can anybody guess how many Tables and relations postgres would have in its catalog? Can anybody guess? Oh, we know this will be implemented. Um, okay. Yeah, you guys know it. Okay. Never mind Not a fun audience. Come on It's 279 or something like that. Yeah It's like You know the database was barely able to keep up with its own catalog on day one So we had to build a feature to multiplex all of this into a single tablet because they each have maybe 20 30 rows so But it's I mean a bunch of them are views like pg tables is a view We'll talk to you class. So like that's right. It's not like you materialize all of them No, we don't need to materialize all but it's still a big number though Yeah, each each with just a handful of rows. So um, anyway, so So here's what we did. We said, let's take postgres We wanted to reuse the upper half of postgres and then rip out the lower half and replace it with our own storage So we said, okay We're going to have our doc db at the lower lower half Which is a distributed, you know document store and we'll map each of the tables to Document tables each row to a document so on and so forth underneath And we're going to make the changes to the various parts of the upper half of postgres, you know, that actually I mean is I mean this diagram is all pretty but reality is nowhere close and The changes are not restricted to those small boxes. It's actually a lot more invasive than that But but in any ways the the theory was let's go make these changes in order for it to natively start running on top of a distributed system So the aim is your app can now connect to any node The nodes are able to talk and push down code or pull up data Whatever they need to do in order to execute and if a node dies, you're okay because the This system underneath is fault tolerant and replicated like google spanner and the system above is it's kind of stateless So you could just continue your you know your query in on a different node Okay, so that was as far as getting the database running now decision three was we still needed to figure out how we're going to do single row Replication and we picked raft for that. I mean we had implemented practice before in our lives in facebook for a couple of things And this is before raft was was like well accepted and and a thing so and raft is far simpler So we said we're going to go with that the additional advantage from our point of view that raft gives is it it formalizes membership change Which is pretty critical if you're actually dealing with adding nodes removing nodes changing machine type a number of operations that you Have to make zero downtime So inside ukabyte every logical user table is split into a bunch of tablets How you do the split is actually things we give you control over like you can do a hash based Sharding or you can do a range based sharding. We give you the ability to do both um each shard is now replicated into A set of nodes as many nodes as the replication factor So if replication factor is three you take one of these shards call a tablet and then put it across three nodes And use raft to replicate data which will internally elect a leader And it's all the rights are going to hit the leader It's going to go to a majority in order for it to I mean just raft right standard raft so this gives you single role linearizability right like across And the thing about raft at least what it says is you need to establish your leadership before you serve reads Which means effectively to look the same as reaching to reaching out to all your peers getting one vote and then serving the read Which is incredibly slow. So we said like that's not going to work specifically for our geodistributed aspect of the database. So One of the changes we had to do was to implement leader releases Which raft actually paper actually calls out as in the in a line it says or you could do this But it turns out that's a heck of a lot of work to actually get it, right So this is something that we implemented a lot of interesting learnings Trying to implement that and still make sure we are reasonably defensible against you know clock skew and all of that stuff, right? So But anyways, yeah We look at how we used like the monotonic clock and we did a bunch of stuff in order to make sure we are a little resistant to clock skew changes and so on but Our reads also therefore get handled by the leader and because of the fact that it uses a lease and Among the other enhancements that were needed are these the usage of monotonic clocks. It's because your Software clock can jump right like because due to ntp adjustment. So we actually use monotonic clocks and Transmit only deadline times as opposed to absolute times So everybody computes time and then we adjust for the clock drift within a small time window Which is much safer than clocks queue, which is like a forever tracking thing, right? So So what is that window? It's it's two seconds The the lease time is two seconds So we're just correcting for clock drift within a two second window and we say that the And this is different from the time it takes to detect a failure, which is the number of missed heartbeats This is mostly the lease establishment period Right. So the missed heartbeat period for us is six missed heartbeats of 0.5 Seconds so it's about three seconds fail over time So it's still so you're still fine right like in that sense You don't hold on to the lease and screw things up a whole lot when it when this situation hits Okay, so group commit was the other thing without which you're like a sitting duck in the water right like so you have to do a lot to improve performance by Combining a lot of operations and that means inside the raft batch You'll have to order operations and do a bunch of stuff inside to make it work, right? So Group commits. It's just a couple of things. There's a whole a bigger list out there Okay, so you seem a bit flip about like, oh, yeah, it's raft. It's a couple things It's a major engineering undertaking. It's not it is it is a big clip about it. Like that's that's not easy No, that's not easy that actually we spent a lot of time on it to be honest And we have I have a partially written design doc that I'm intending to complete It's in our github repo again happy to share Some of these but but it is actually a difficult thing because some of these things are even hard to test for and Like, you know, it's just so probabilistic. You just have to keep exposing yourself to many many hours of run time, right? And and induced failures and so on, right? So yes. So yes, thanks. Thanks for that Andy keeping me straight. I'm just Yeah, that's right. So anyways, so the next decision was now we had like, I mean so far If you've just followed the train along like we said reuse postgres We're going to make it Like we're going to use raft for single row consistency Now, how are we going to store data in a single node, right? We wanted to optimize for ssds Which means rocks db was the thing to work with and we actually knew rocks db code fairly intimately because When we were at facebook We were the team behind hbase and rocks db took level db as a starting point took hbase algorithms and put it Inside level db in order for an hbase like system to run on ssds, right for our mysql tier I mean we tried hbase directly under the mysql tier and gc was just giving us hell because ssds were really fast So we decided java is not the way to go, right? So anyways, so the question was do we use it as a black box or not, right? And we said like after looking at what we needed to do we needed a number of enhancements to for Rocks db to work uniquely in the way we were using it Because a single node could have multiple tablets as shown below each tablet is backed by its own rocks db So now we needed many rocks db's to coexist on a single node and kind of share resources Amongst them while still behaving nicely as a part of a coherent system. That was issue number one Issue number two is we wanted to optimize rocks db uniquely for the access patterns that we were getting from the layer above Right, so instead just using it as a key value store, which doesn't Understand for example a column delete or a document delete or like a ttl expiry of data We wanted to get a little more invasive and start making those changes Finally, I mean that that was not it In addition, we wanted to convert the key to value To key to document right that was another change And we wanted to take out the right ahead log of rocks db completely because we had raft log on top So what is the point of a double right ahead log? But if you take out a right ahead log you take out mbcc because they're tied at the hip, right? So that means we hoisted both the mbcc and the right ahead log above So at this point we said reusing rocks db as a black box is not a good idea. So we went ahead and and Integrated that component deeply into the database and started making changes And we still are right so to some extent, but all of this is all of what you see here is done Okay so The doc db's local store Like effectively just deals in documents So it takes a document and then puts it into Some sort of a format and because it's a multi api Store like the store adheres to multiple apis it caters to it has a different way of laying out data Where you could have an init marker you may not need an init marker because you will replace the whole sub document You may not replace the whole sub document There's just like various things around how we play around with the exact representation of you know, whatever you're modeling on top, right? All right, so I mean this is a a little more in-depth view into how Like the like the data encoding is done at the doc db level I mean at the doc db level everything just looks like a document key and a document And because you want the ability to insert absurd or replace like at a fine-grained level in addition to a whole sub document or document being changed The way it's written is it is actually a doc key and then there are its sub keys or you know It's attribute and then the hybrid time like the time at which this document was written which actually ties to raft In some sense because you know hybrid time is what we use and we will talk about that in a little bit And finally the value itself is also encoded right like using some rocks db encoding And depending on the api that this flows through a lot of the bits of how to serialize data is a pass through possibly from the system on top So for example, if you're talking postgres sequel the value is encoded like postgres sequel right like the same Like a serialization format. So we don't have to keep serializing and deserializing every time However for the the primary key part the doc key part We do recerealize because we want to maintain a byte sortable order that is natively understood by the system Why why sort of the document why not why not a tuple like because you're not really What advantage are you getting but like what aspect of using the like a json document? Are you exploding in the system versus like a tuple? Um, I think the the main thing is like there's a few things In a inside a document you can we can start adding like indexes and short range scans and a few things. So that's number one Like Number two is there are areas where you have to delete Like a sequence of entries in the database below with a single right as opposed to multiple rights Like for example, you may have an array that you want to delete or you may have like Uh an entire sub document that you want to suppress So I think those are the reasons why we started going down the document path because we Wanted the ability to understand or skip or seek or operate on a bunch of kvs at a time Right. So as opposed to just a pure tuple where those things may get a little more complicated Hold on Tuples would be less complicated because you know exactly what they're you don't have to go Look at the document and figure out what's what schema does it actually have? Oh, I think okay, so I see you I see your question So in that sense we it is like a tuple. So in that sense, it's not a document the schemas enforced on top Uh, the the dog DB itself doesn't know about schema the schemas because it depends on the query layer You come from but the storage layer is neutral There is no notion of schema that it's done above But even within the document or the I guess it's a loose notion of a document like what what it really means Like I think what we're trying to say here is mainly the fact that these are not tuples that are completely devoid of each other The way they're laid out on this kind of way. They're serialized and written. They actually have a bigger meaning right as opposed to just it being a tuple and similarly which which pieces of this Of this representation do you put for example into a bloom filter or which pieces do you want to put into your index block Or just things like that that start to matter I think there's again, that's a that's orthogonal whether it's a tuple or a document You're right. I think Yeah, I think at a high level you could describe this as tuple also. I just I think the the the Maybe the way to explain it in that case is there is a relationship between a sequence of these tuples, right? Like as opposed to them all being independent tuples. So you could have a Like a doc I get it because you want you want to do the physical denormalization Exactly. Exactly. Exactly. Exactly. Yeah, exactly. So I get it. All right, so Okay, so that's that on the storage layer the next decision was like be like spanner, but don't be like spanner Don't require atomic clocks, right? Like people don't have those lying around so anyways So where does it where do atomic clocks come in right? Like we by use of raft and by use of the monotonic clock We're already okay for linearizability on a single row key path Uh, it really comes up only when you're starting to do multi tablet reads like a select star So anytime you hit multiple tablets inside a single operation, right? You can call it a transaction that either updates or reads or does anything across multiple tablets That's when this starts to to actually matter now one thing that we did inside Yugabyte like early on is the ability to distinguish between single row and multi row access paths and they're treated differently Right. So the single row access path actually takes what is called a fast path and goes straight and makes the updates on the tablets that it desires The multi row Like path takes a more generic distributed transactions path. We'll look at how distributed transactions work But I mean, I think folks may already know why but like you need the notion of time correlation across nodes in this case Why because when you're operating across keys that belong to different shards, which themselves could live across different nodes Um, you will perform updates that could go to different raft groups But you will need to tie it to some notion of Time, I mean, I'm calling it physical time, but it's some notion of time right and all nodes need to agree on time and that's where like, you know, some sort of a uh global time service starts to come in right so Using a physical a physical or an actual physical clock and too many overloads of time here But like using a hardware based solution for this would mandate you to have a couple of atomic clocks that you have to stick into your system I mean an atomic clock or true time service is nothing but the highly available and globally synchronized clock service with very tight error bounds Right so seven milliseconds in the case of spanner Or or something there about right so where you're okay waiting that that amount of time Now it turns out most people will tell you they don't really have atomic clocks So, you know, if you're if you're telling me I need to use an atomic clock I'm going to find a different database and most of the physical clocks that people have are never synchronized So if you look at ntp its synchronization window is like about 150 to 200 milliseconds Right and if you're building a high performance database that takes a minimum of 150 milliseconds to serve a response We're not going to go very far So we said we're going to adopt hybrid logical clocks, which means we'll let ntp do its thing 150 200 milliseconds And we will tack on inside the synchronization window a logical component in order to figure out, you know, how Like it's almost like a lamp or clock right like you kind of figure out the causality of events Now further that couple that with like and that's and and every time any node communicates with any other node for anything We will exchange time so that they can keep their times in like they can keep their logical times as well as their physical times kind of loosely correlated and Every time a transaction is done We will be very fine grained about exactly what nodes it needs to touch and what is the causal chain that it needs It will be as good as we can get in order to detect That we don't need to establish causal relationships for certain types of transactions without obviously sacrificing correctness Now in and kind of minimize the number of places where we cannot tell If there is a causal chain or not and it still happens in a few places So in those cases we'll have to deal with it as a conflict and we'll have to restart right So so that's that's really the the logic of how we go about it So it's just a standard hybrid logical clock and the rest of it is to figure out how to avoid conflicts as much as possible Is your hlc based on the buffalo guys one that copper tv also uses or you have something different? No, it's it's based on the same. It's based on the same. It's the buffalo. Yeah, same thing Okay Okay, the next other the next major decision was how were we going to do distributed transactions because Okay, so now you have some sort of a software defined Global clock right but but so what there's still an actual algorithm that we need to figure out So we picked google spanner like design like as opposed to a percolator like design because we didn't we wanted to be geographically distributed and It's based on a few assumptions. So Basically, the system uses a two-phase commit with hybrid logical clocks in order to commit data Like distributed transactions that where the clock is kind of distributed are better more scalable as opposed to going to a central issuing authority It's also better for multi-region deployments, which is actually the other big thing So built into this Are the assumptions on why we went this way the assumptions that we made where firstly global deployments will get very popular So picking a system where you have a timestamp oracle kind of give times Like time values for you is going to become a problem in these deployments And the second thing is clock sync in clouds will get better right like google built True time, but the other clouds are not going to sit around and and when we started That's what we hypothesized and they haven't they've started building stuff like amazon and azure have time sync service and so on And our our bet is that over time as these type of services get built it's just going to keep getting better and better And obviously we should not use atomic clocks, which we already looked at Okay so Yugobytes transaction manager is a fully decent or distributed transactions are fully decentralized So there's no single point of failure any node can act as a transaction manager And the transaction any ongoing transaction is tracked as a part of a transaction status table, which itself is a distributed table A transaction has three states pending committed or aborted And reads are served only for committed transactions if a transaction is aborted It's ignored if a transaction is pending then you have to do you cannot serve it in the reads But what you can do? I mean it depends right you just have to run through your conflict resolution rules to figure out What you can based on you know, what is the isolation level and so on and so forth of what is going on So Speaking of isolation levels Yugobytes supports serializable and snapshot isolation. I think for I think Assuming folks are familiar if not late, please do ask but anyway serializable detects read write and as well as write write conflicts and snapshot isolation detects write write conflicts only right and The default that we have so it maps to the various like serializable Obviously maps to serializable in postgres and the rest of them map to Like you know repeatable read read committed read and committed everything else maps to snapshot isolation, right? and The default in postgres is snapshot, which is the default in Yugobytes as well Did anybody ask you to get like for read and committed a lower lower one? No, not so much actually no Yeah, so and read only transactions are lock free Right, so because we use MVCC and so on so we don't really have to At least only a shared lock or an exclusive lock So this is a simple distributed transactions code path I mean like let's say the client wants to you know update a couple of keys key one equals k1 equals v1 k2 equals v2 First thing you do and this is the the more theoretical view not the optimized practical view So it hits any node which is the transaction manager Which will go and create a status record, which is replicated or we're using raft In order to track that transaction being in progress Now in the real world the transaction manager and the tablet the tablet server one and two Colocated on the same node because it picks a local Transaction tablet and the transaction status record is pre-created So he doesn't have to do that round trip also So all of this just goes away in the real world But in any case like It does that and then it starts acquiring provisional record It starts writing provisional records to kind of acquire locks on each of the rows it wants to update So it's going to make each of those updates and which are in turn replicated using raft And finally it's going to flip a bit in the transaction status table Which says hey this transaction is can be served now It's committed and at which point two things happen It responds to the client saying your transaction is successful Obviously assuming that it doesn't conflict with any other provisional record already written And finally in the background it starts flushing all of the I know the provisional records and find and making them finalized records So you don't have to keep so any read that hits a provisional record now is because it's in pen It's in a in the pending state has to go look up the status of the transaction to see is it committed aborted pending If it's pending it's going to go apply its conflict resolution rules If it is committed it'll serve it if it's aborted it'll skip it right And you don't want this to happen forever. So the step number six actually flushes this From the status tablet into each of the tablets so that it can then just serve reads from itself, right? The timestamp of when the transaction is visible is actually the timestamp when it's flipped to commit So step number four determines the commit time of the transaction Okay All right, so That's as far as how a synchronous Distributed transaction happens whether it's in a single region multi region multi zone doesn't matter now. I'm going to go to Cross cluster async replication, which is unique of Aurora, right? Like and so we thought we should bring it into this world because you know a lot of rdbms users still like this functionality So we call it x cluster because every other word seems to be overused and we just let's give it a different name That sounds different and you know people can just refer to that feature as such So cross cluster or x cluster replication It allows two independent yoga by clusters to like replicate data or one way two ways both ways whichever way right and The assumption here obviously is that the schema on both the clusters are synchronized externally And that's a limitation for today, right? Like that's something we hope to fix over like as we work on this feature more the Ddl replication is also something we hope to do but today that's not the case And this architecture is also the building block for our change data capture feature where you know an external system can just arbitrarily subscribe to changes But obviously in change data capture There are some nuances where change data capture is a more generalized feature compared to Just the cross cluster replication where you can get changes that are The after row image just the delta changes or the before and after you can get it in three formats Whereas the x cluster replication only desires the delta, right? It doesn't care about the others So but anyway, so it's just a building block for us to support that feature as well Okay, so Guarantees right firstly the first thing that we guarantee is an in order delivery of rows to a tablet, right? so or a row and It applies to So this is the guarantee as far as a non-distributed transaction goes So we'll talk about what we do for distributed transactions later Uh, obviously when two operations that are not touching the same set of rows occur on two different tablets. These are not sequenced Relative to each other when it hits the replica cluster, right? So so that means you could set x equals one and then set y equals two But in the replica cluster you could see y equals two first and then x equals one, right? And our belief is it's still usable in this form because you know It's a distributed system and people would try to synchronize changes. They really care about um So the third thing and so yeah, that's an okay trade-off to be made for a distributed system second thing is At least once delivery of changes, which means on failures will transparently retry But the system underneath like the way it works is the tablet leader or the query coordinator actually translates the changes to be made into a set of Attributes and documents that need to get applied which will subsequently become item poking, right? So you could keep applying that because it goes with its own timestamp And so it just like there's nothing wrong in doing in at least one's delivery for the internal consumption Obviously as a change data as a generalized change data capture It has its implications but not so much for this usage Uh, the next one is monotonic updates Which means that if one of the consumers receives a change For a row then it would no longer receive a change for an older change for the same row So that means you'll only get newer changes. You won't get older changes for a given row Um Periodically we send a no op in order to make sure that you know You can kind of keep bumping up the timestamp up to which you have seen data And so you kind of know up to when your system your target system your sink is has caught up, right? Um, so the next thing is transactions and conflict resolution Um, so this is the one where your right is not a single row, right? It's actually a distributed transaction But either single row or not like if there's a transaction the last writer wins So you could write x equals one in one cluster and x equals two in the other cluster You could have a bi-directional replication stream And they both will assign their own timestamps irrespective of the order in which these operations happen The one with the higher timestamp would win, right? And um, yeah, we picked this to simplify How apps are developed as opposed to expose this to the end user and somebody does some conflict resolution rules and so on and so forth Um on the if there is a constraint violation on the sink, right? Like then Your transaction apply fails like for example, there could have been an offending, right? For example, which violates a unique constraint as an example, right? Like now it will no longer apply and that will fail Um, these updates will be dropped. I mean obviously Exposing the error to the end user that that's an orthogonal thing But you'd have to make a decision and we just like drop it today Um, we decided against exposing conflict resolution to end users because the system just gets too hard It's not a practical use case. I mean, it's yeah, it's just like how people work today, right? so based on the the programming frameworks that they have and we are working on our Handling distributed atomicity. So on the sink cluster or the target If one row which is a part of a larger transaction Lands it's put into a waiting to commit state. It's not committed immediately and Once all of the operations that are a part of the transaction across all the tablets in the target sink cluster Are applied then that's now allowed to be visible and it moves forward So some of these features are still in progress specifically the last one about handling distributed atomicity across the asynchronous replication stream, right? So Okay, so the key components here every every node on the source cluster has a producer Every node on the target cluster has a consumer. I mean, this is all a part of the same process But logically speaking those are the roles that they play And I again, I'm not going to go into too many details because I just want to try to see if I can wrap up soon also But in any case, so there's a cdc stream and there's the notion of bootstrapping where you need to catch up on Like if you started a brand new cluster, that's a target It needs to you know, kind of bootstrap the data from the source cluster So a unidirectional Replication looks like this you could you could replicate from an n node cluster to an n node cluster No problem. All of the p's are producers on the source. All of the c's are consumers on the sink And they'll form their own pairings based on you know, which guy has what tablet and they've discovered who they should send data to and so on Right and if the data goes to the wrong node it can forward it But ideally we'd like them paired in a way that it is correct for the most part So if there's a failure and for a while you're going to the wrong guy That's okay, but if you eventually want the system to correct itself So yeah, so this is about establishing a unidirectional stream from one side to the other obviously with the Assumption that the you know table DDLs look the same and they've been externally synchronized at least for now and bi-directional is just like two independent Like a sink unidirectional streams with protection for the echo chamber problem Right like where you send something and it keeps coming back and going back at infinitum Right, that's the only protection that you need in this case Um the metadata and checkpoints are all stored inside the system inside the source cluster So each source cluster tracks each of the you know It has a system set of system catalogs in addition to what postgres has it has a uid for each stream It has a bunch of registered subscribers at checkpoints per subscriber It's a course checkpoint. So that's that's where the at least one supplies Sometimes you could have repeated data and so on and so forth and this replication is done in a batch and you know And what have you Okay, so Two quick questions. Is it you guys doing a push or a pull model? It's a pull model, but we could do either. I mean, it's just like Like we talked about this and we're doing a pull model right now But it's just moving the like it's just about like, you know, there's a subscriber and a producer, right? Like I'll go back to this diagram. It's just about Sorry, yeah If we moved all the consumers also to the left and co-located them with the subscribers It'll look like a push model, but we're not doing that. We're doing a pull model right now kind of And what is the binary series relation format are you using using protobufs or using flat buffers? What do you what do you guys use to send the data? It was a customer. It is It is The binary for this is protobuf. It's protobuf. Yeah, so it's protobuf internally Yeah, I think We have tried to talk to end users for there's the cdc to see how we they want they want it exposed And I think protobuf is not good for that kind of like an end user consumption model So we're thinking of extending sql itself to be able to allow that but Inside the cluster the far more performant way is protobufs for us to transfer this data Okay, cool. Awesome. Yeah, and but bootstrap doesn't go through protobuf bootstrap is just Disc file segments that moves to the initial bootstrap like when you have data in a cluster and you first start the async That's it's almost like a backup restore kind of Okay, so I'll go to testing yucca byte like what do we do? So we have a fairly sophisticated It looks like we put a lot of effort into testing like and this is like sometimes when I was writing this slide I realized like it's almost Starts rivaling how much work you do in the database so in any case like we have a reasonably long CICD pipeline So we once once you put your enhancements fixes etc It goes through this long pipeline before a release candidate comes out the other side after which There's specific manual testing done depending on the exact feature delivering etc etc There's like you can never I mean you can try to automate it all but it never happens in reality You can only do so much so some of the key aspects of our thing right like we have a lot of component unit tests and Like a Santisan and deterministic failures like known failures that we want to encode like what happens when this node goes down And that's right happens and before it acts this thing goes out So these are all kind of like processes inside a single system or threads that we try to quiet or kill or do the Do the bad stuff that would happen so that like, you know We are we are sure we kind of will work against the the known bad cases We want to protect against and if a new bug comes in we try to encode it in this kind of a system And sometimes you have to go and inject code in some critical paths Which are test only code paths which get you know kind of flipped on only in this unit test mode and so on So we've done a bunch to that kind of stuff um are These the the first three boxes here right they run upon every diff upload So like one of our engineers they he writes a diff and he uploads it And he has controls on selectively which subsets of tests he can select but all the details are side It's going to auto bid for a spot instance on gcp or aws I think right now we're using gcp. We moved from aws though So we're okay with either and it does spark based paralyze testing And it kind of reports the result and the failures and so on back to that diff right so that you know That you you absolutely have guarantees on the unit test having been wrong What does it mean to be spark based parallel testing? What does that mean? Oh, it's just that like our unit tests are so many now that like they take like many hours to run So we in order to parallelize that we had to actually change our unit test framework to run Like instead of like one unit test having a lot of subtests and taking a lot of time We had to chunk it up into many unit tests that are shorter running in at the same time And kind of distribute them across a set of nodes and every node in parallel runs a subset of unit tests So the whole thing finishes pretty quickly. So you're basically using spark as a poor man's coat kubernetes Exactly. It's a parallel. It's more like a parallelization framework. It's like a map or use of parallelization framework Yeah, so like if you have tests one to hundred one to five run here to six to ten and then and so on run in parallel through spark But and then this is the manual things you're saying like the developer says i'm going home for the day run this for me This is automated. Yes So he uploads a diff and doesn't do anything else and the whole thing happens and the results are out before the code reviewer is on it So no, but I'm saying like they have to manually upload the diff The diff has to be manually uploaded. Yes. So the developer says here's a diff I'm submitting it for code review, which is of course mandatory for us but The once the diff is uploaded and based on some tags he can suppress some of these but i'm just giving you the default flow Uh, these tests are automatically kicked off in the first three stages run We're using a so it bids for a spot instance runs this whole thing gets connects the results Puts it up in a in kind of a database and links it to the diff and says okay now It's ready to go and so by the time the reviewer is back. They can he's on the diff He can check. Hey, what's the status of your thing and then it happens in just like, you know A few minutes like it's not like hours and hours to get this from that was this spark point, right? Okay, cool Right. So the other thing that's good is we're a c++ system c++ equals high performance But c++ also equals a heck of a frustrating time debugging, you know memory issues in production So we have all of the address sanitizer thread sanitizer like we run it in multiple different variants using a couple of different compilers and different OSes including mac, which is a pain in the ass like ask me some other time I'll tell you about how you cannot put mac in a ci cd very easily. But anyways So we do all of that also. So which is which has helped us a lot to save our Saved our rears quite a few times um, okay, and uh, we run jepson in a loop like it's non deterministic So it's more exposure based but that's something that we do because it's you know, just safe And finally we have a end user style integration test where it's more a little more black box and like install the database bring it up Like, you know, create shrink expand run workload Just like the various basic operations that you would qualify. These are nowhere as intense as the ones before Like in terms of how much they penetrate and test but these are really useful to qualify high level user scenarios, right? So we do that and We have like it says 10 plus. It's a lot more now. There's an older slide But anyways, so we have a bunch of workloads that we keep testing that are run Like through performance regression to make sure our performance doesn't drop more than a certain amount Which is another pain in the ass if you're doing it on cloud machines, but that's a different problem. Like what is that amount? 10% Okay, that's yeah, that's right. Yeah, it's crazy as it's a lot though if you're looking at some workloads so This is what the hana guys told me this and 10% is whatever it says. Okay. Yeah 10% Yeah, I mean, but 10% could be a bad regression, right? So So anyways, yeah, so Yeah, so As an example, right like the amount of tests and why we had to build this park based pipelines Like we have a hell of a lot of tests that we run This is just the tests from kosher sequel and we have an equivalent amount for the rocks dv Set a test we have an equal amount for our raft based tests and then so you just see that these tests just add up right like so This is our coverage on postgres sequel's own tests that we have ported over by virtue of and and it's not because We cannot do more. It's just like time, right? How much can you do given the time? so As we port like this is our current state like we have now gone up a bunch from the previous percentages But our aim is to get almost every test and and also note that the postgres tests have to be ported in multiple flavors For us because it's a distributed system with Hash sharding and rain sharding and co-location and just like so many different like flavors in which you go by drones, right? So these are just some of the things that we do So with that all I'll say is like we're still early. We're a fast-growing project It's been fun working with a lot of customers and users like But you know if any of you guys are enthusiasts are interested You know stars on github like join our community slack, etc. Right, etc. Still early days for us. So, you know the Yeah, so Yeah, that's all I had Okay, awesome. Thank you. Right. So again, we're we're online. So I'll I'll applaud for everyone else um So I have a bunch of questions, but I'll open up to the floor Then really has any questions. I'll meet yourself and say who you are where you're coming from and ask it to Okay, nobody It's you Andy. Okay, um How long did you guys do any micro benchmark testing for the like smaller components just in cvS loss? Or is it only like the regressions for the end-to-end system? Um for performance you mean specifically. Yeah performance regressions Uh, no, we don't do too many micro benchmarks for performance regressions. Now that you mentioned it's not a bad idea like we we just do yeah We find that they're noisy at least Uh And that's what we like we started off with 10 percent. We bumped that up to for some of me that like Bumped up the 75 percent because they can for the really short things even google benchmark. We're seeing a lot of variation We don't know what is the machines or the for the um for the test itself. Um What were the complications you had getting the mac in your ci pipeline because we actually struggle with that as well Yeah, I think um for starters like every other flavor of os is easy to get either through docker kubernetes Vms or something mac is not like that's one thing that's been hard Second issue has been so the way we do this is is actually like I mean it's a janky system for which works for us But not a hundred percent reliably So the the best solution we've come up with is we got a bunch of mac minis and hook them up and made them a part of Intranet and then it's a part of the ci cd and that thing doesn't always respond to just a bunch of stuff So that's that's exactly what we did So yeah, then you don't I don't think anybody on this call is from apple what apple does not give any hardware They don't do any donations Are very cheap they don't they don't yeah, they don't yeah, so and yeah And then there's the mac versions and what you install It's just like it's just hard to spin up a pristine mac environment, right? Like so. Yes. Yeah, so All right, do you do you guys offer yugabyte as a service yet or is that then roadmap? We we offer like we're in beta with yugabyte as a service so on the commercial side We have two offerings We have a platform product which is the first one we built which is used by a lot of our commercial customers It it is something that can install in your aws gcb or whatever account and it can convert it into a managed service completely So you'll be able to get a dynamo dbr or a like experience because it can spin up nodes set up security groups do the whole thing for you So that's the one that our larger enterprise is gravitate towards and we just Recently announced a managed offering for a cloud. We have a free tier. That's all we have that's available for people And it's because we really have to figure out the pricing and usage It's just a whole list of auxiliary things to figure out like more on the business side rather than the cortex side But but yeah, but we did offer we did we do have one now So yeah, and we're in the path of getting it into ga and a paid offering. So okay and then I mean the most obvious comparison would for you guys would be something like cytos um I wonder if you can contrast sort of the design decisions that you made versus what they do I think I think the answer'd be, you know, they were strictly shared nothing Whereas you sort of split that storage and extrusion layer up again Yeah, I think it's interesting with cytos. I think uh in spirit in in functional spirit will be closer to cockroach db In terms of postgres reused we're closer to cytos. So you can think of us as a hybrid between the two Postgres reuses, uh, the lower half of postgres Sorry, cytos reuses the lower half of postgres sequel, which means the storage is exactly the same as postgres sequel You annotate a shard key that it splits across different postgres sequels And you go through like one of these coordinator nodes Which deals with transactions across these different postgres sequels and the replication is still postgres's own async replication So this is what I understand of cytos, right with yugabyte the storage format is completely different It's yugabyte's own format the entire cluster actually looks like a single database as opposed to you sharding it across these Many databases and stitching them on top. That's not what yugabyte does. It's a logical whole database But it reuses the upper part of postgres more so than the lower part in order to offer the functional richness So it's kind of those are the trade-off parts. Okay. All right. My last question is and then This is a new I'm gonna all for the davis vendors going forward. I want to ask the same question. So you unfortunately are the first one um If you want me to cut out anything, you know from the youtube video, we can do this so be candid as you want um How stupid are your users like how surprised are you? From the bullshit you have to deal with with them Yeah, that's uh, yeah, that's uh, I mean like I i'm only pausing because it's very hard to answer that question because The answer is like, yeah, I think as database enthusiasts builders or people knowledgeable of the database We think something and like like I did the same thing when I You know started building yugabyte or when I started building any of the databases for that matter It was the same with kassandra and hbase But man people do so many ridiculous things that you'd be surprised People don't know like for example, some of the things that didn't work for us Right is what I tell when people say they want to start an infrastructure company. What should I do right? Like don't go about asking people if they want consistency if they want transactions If they want their data to be corrected depending on who they are there They're going to say absolutely. Yes, or absolutely. No one neither means anything Um, like and people don't understand the nuances of limits of physics like when when we say we're a distributed database There's like, oh, you you're distributed. Give me low right low. I mean low latency reach right consistency everything But no, you can't do that But I thought you said you're a high performance distributed database like what do you mean? You can't do that stuff though So like there's another whole area of education there the third big realization I mean just one we realized a little early on thanks to our experience with hbase and kassandra Is to never build a different api than what the world knows because it's hard enough explaining what you did different about the database It's incredibly hard if you're doing a different api Unless you're of course mongo db, right? They're the only ones that were able to do it But in some sense, you're known you're not building a database for the first 10 years You're really evangelical about how to think about documents or whatever it is, right? So that's That's been I think that's real. I think so for us we have two apis and fortunately Um, so here's another tidbit right like uh, one of the behind the scenes mistakes that we did kind of We had three apis right the third api was a It was a simple api that we built only to make sure that we were able to build multiple apis And it was the redis api because it's really simple It's just weed off of a socket and you do stuff and it works really well as a database, right? You know Redis plus db you're going to combine the two anyway So we said like for the use cases where you don't have too many read operations. It's great Why don't we just give you this thing and that thing got so complicated everything Everybody kept comparing it to a cache and asking questions about all sorts of craziness that's very serious We said, you know what we don't support this anymore. This is not an api available to you. There's only two apis. That's it. So Got it. Yeah, so I had to update the the encyclopedia because it says you're compatible with Cassandra postgresive redis Yeah, we we dropped it in our I mean we yeah, we tell people it's in maintenance mode Which is not something we're working on because it's just too much cognitive friction All right, so I think you you could say that you go by Again, your character positions over they sound like they're in the middle They're not like, you know tripping over there with their own feet, but they're also not like Like the bolt db people were quite sophisticated because most people don't need you know five million transactions a second and sort of like a It was sort of selfish and you guys are sort of like, you know, it's I would say squarely in the middle Which is you feel good about Okay. Yes. Yes. Thanks. Yeah. Okay. Awesome guys. Uh, thank you again. Karthik. Thank you again for doing this We really appreciate you spending your afternoon with us