 The Carnegie Mellon Quarantine Database talks are made possible by the Stephen Moy Foundation for Keeping It Real and by contributions from viewers like you. Thank you. The pandemic continues, so let's start with databases. Today we're excited to have Matt Freels, he's the co-founder and former CTO of Fauna, but now he's the chief architect. Prior to that he was a technical lead at Twitter and that's where they learned about distributed databases and he said, okay let's go build a better one and that's how Fauna did start it. So again as always if you have any questions please unmute yourself, interrupt at any time, say who you are, where you're coming from and ask Matt a question. We want this to be interactive and as always we like to thank the Stephen Moy Foundation for Keeping It Real for sponsoring this event. Okay Matt, the floor is yours, go for it. Thank you. All right um yeah I'm super happy to talk to y'all here though um this is the first time I've given a talk on Zoom and I've heard everyone else like say that it's super weird not having any sort of feedback and I know I completely agree. So this is why I want people to interrupt you and ask questions as we go along so you don't talk for an hour and thinking like is anybody listening? At least I was trying to interrupt because I have questions. All right that sounds great. Okay um yeah so um I came uh to hear to talk to you about Fauna um which is a distributed database that we built um at um Fauna Inc. And um yeah so I guess the title is like I want to the lessons learned building a Calvin based system so you know if you're familiar with Calvin it's a distributed transaction protocol. I say I'll turn it a two phase commit and I guess we can just get right into it. So focus on that um and try to give a brief overview of Fauna to provide some context as to why I made this decision we did and then you know kind of get into the protocol and some of the details there as well. So um yeah so start up what is Fauna? So Fauna is um kind of like the short the short and short of is that Fauna's general purpose OLTP database as a service um kind of like our goal was to build an app you know the idea of like databases and application back end um and based on you know Evan my co-friend and my experience at Twitter um building these sorts of systems for um you know for Twitter itself um so kind of like to give you the shape of it. So it's no SQL database um you know the kind of the main corporators are documents plus global secondary indexes. Um I guess unique about Fauna is it's not based on SQL um right obviously it's no SQL data document is it's not but we have our own programming language our own query language called FUL um there's uh kind of like inspired by uh programming languages so it you know it has the it unifies uh querying transactions and stored procedures into one into one um into one language that you um program as an embedded DSL in your in your application code um and then recently you know last year we added a GraphQL API um which provides um like easier interop um you know especially on you know kind of like front environments and and kind of was like is seen especially suited for like basic cred use cases um that you may want to start with and then jump into you know FQL for more advanced logic. Um the other the other interesting aspect of Fauna is that it is um I would like to call it serverless now I know that's kind of kind of a weird you know scary buzzword it really I find it kind of reminiscent of cloud back in the days but I think the real idea behind serverless so the way we think about it is the the point is that you know it's it's it's a system design with the idea of removing that the the node or the server as an abstraction in your application architecture so um for Fauna that means we deliver as a global um API that's directly accessible over the open internet protected by keys and you know you know supports like you know cheap sessionless um you know connections and requests and things like that um again like you don't provision nodes um you provision abstract resources like databases and then the cons consumption itself is muted in terms of um in terms of um what we call ops so like you know abstract reads writes and and compute so I mean to kind of like the theory behind this is that we see is is is you know kind of like the in some ways the general industry moving more towards this like new architecture that we you know like to call client serverless and ideas as as you know you're kind of like your end user clients like your mobile apps your browser as we become more powerful that more of the presentation and you know and even much of the business logic can actually push you know to the edge as far as possible and then um ultimately um you know these other parameters like databases compute you know and like you know sms and things like that you know those apis get pushed more towards the edge and the benefits that we've had for a longer time you know static assets in the form of cds become available to the other to these other apis um but I think um I mean another another key goal is really you know to you know essentially power application developers you know we started with thought sort of as application developers who wanted a database to sue our needs and so we were very concerned with um you know essentially making it as easy to use as possible to enable productivity you know for you know for folks building applications that are consuming databases and trying to cater to those needs so um you know and that starts with some design choices like you know favoring uh we we we try to design fql to um kind of like guide the user towards predictable performance rather than relying on you know queer optimization which can you know sometimes feel like magic to give you know the performance that the user expects so you know like you have for example you have to create fauna you know via an index there's no declarative querying um that you know can potentially fall to table scare or you know potentially fall to a table scan like we have filtering and primitives and stuff like that but you know you kind of like you know what you're getting into when you use those um and there's ways to explicitly opt into you know um more more efficient uh querying um another big thing that kind of came out wasn't part of the original um plan but I think we really um I mean it's made a lot of sense as we've developed the functionality really led to a lot of the investment into uh transactions is the site that's drawn consistency I like to borrow a term that kind of like rust seems to popularize like this idea of fear those concurrency um I mean I I think that you know ultimately like you know rich transactional semantics um really are a productivity tool um more than anything else like we you know as an application developer you don't want to be a database developer and think about you know you know caching and where your data is located in consistency in concurrency like ideally you focus on your your business logic um you know and I think also um and we we you know the simplicity comes out to you in the interface like you transactions are implicit and fauna like you you send a query you make a request and that is the scope of the transaction um you know so so the interface is designed to kind of like be as transparent as possible in that sense um so you know kind of getting into the architecture so I mean kind of like to sum up what I talked about I think the key points are you know the school by default um you know we expect um you know clients to access the system over the open internet um and there's also you know kind of like you know highly and see connections between um between regions in the system and so you know as we were pursuing strong consistency you know we started looking around for you know different you know different multi-partition transaction protocols that we thought would allow us to you know kind of like you know provide those features well you know with as few tradeoffs as possible yeah so the goal you know was to minimize overall latency you know and it so it says goal is to minimize latency because like latency you know ultimately is an enemy in the kind of system that we're trying to build um and so I mean there's no getting around the speed of light there's no getting around that you know clients can be potentially very you know far away from you know the you know one of the regions that the system's deployed in so the way to you know reduce latency is by you know reducing round trips so um you know kind of not in scope for this talk but probably interesting later is is around like the interaction model between the client and the server um but like you know our query language is designed to pack as much as possible into a single request shift that to the database let it turn on it and then get a result back um rather than you know kind of like your typical accession transaction what tends to be very chatty between client and server like that goes really great if your client is very close to your database it's not so great if that um interaction is over you know a latent connection um but kind of the focus of what I want to talk today is is like you know what we saw in um you know you know what with this big idea in Kelvin um you know say minimizing coordination by you know but with its goal of trying to you know minimize you know the coordination of running a multi-partition transaction and kind of like how we could adapt that to this environment it's a little bit yeah in some ways it's a little bit different from I think the goals of the paper which is um um you know at least when I write you know when I first came across it like I've seen the focus on like high throughput um and stuff like that as opposed to I think specifically home and you know focusing in on the you know what what what what we saw as the advantages in a in a highly late in environment I mean because because he built that off like H3R stuff it was like how can you do better than what H3R can do that like in H3R was all about performance so that made sense yeah yeah um so you know I realized I didn't I guess Calvin needs an introduction or Lisa forgot it but you know for those who know Calvin um it was um um the the original paper uh I believe was published like 2012 and it was uh uh Alex Thompson and Daniel Boddy and some others um who were responsible for it there's the case I determined is and I think that was 2011 and then the Calvin paper itself was like it was 2012 yeah like it was Alex Thompson was the lead author and then he won the the Jim Gray dissertation award in like 2015 for it 2014 or like I don't remember when oh that's I didn't realize that that's cool yeah he won every I won um yeah okay so um big idea in Calvin minimizing coordination by predetermining order so I mean I Calvin was you know I think one of the first systems too to um like um you know really I think I think promote this idea of deterministic transactions and so kind of like uh compared to your classic two-face commit which mixes the um the uh you know in evaluation of the transaction and determining what effects are going to be committed with the process of coordinating cross nodes kind of all all in you know in an initial step followed by commit um you know Calvin at his core actually reduces that or actually um flips that so um in core Calvin you know a transaction is submitted to the log and then afterwards uh the outcome of that transaction is you know it's submitted to the log and ordered and then afterwards the outcome of the transaction is determined uh through uh you know essentially determining deterministic replay of the transaction effects and now you know kind of going forward like I'm gonna you know kind of like zip through hopefully as quickly as possible like you know Fauna's you know Fauna's you know I think overall transaction protocol but this does a bit more and then I'll try to point out some of the Calvin specific aspects and how we use it um anyway so in Fauna um the life of the transaction is really in three stages so in the first stage um you know so you know first we evaluate um the actual career language body expression we get um and then and then there's a step where um we do transaction effects sequencing where it gets committed to the log and then um and afterwards you know and then finally once that's done there's a deterministic effect um that you know we we determine to apply the effects of that uh the effects that transaction and then and then discover the outcome and walk through that a little more detail um but also on the side here I'll point out um you know our my my terminology just kind of like through you know what we've developed you know on the system you know over the years this has drifted a bit you know from you know the the terms in the papers so um like the paper the paper talks about the sequencer the schedule and the storage and storage um and I'm trying to label kind of like the way the way um we have these you know we have three roles in in Fauna you know the coordinator um the log and you know log and storage and you know they sort of correspond with what you know what the labels are in Calvin but not quite um and you probably hear dogs in the background the dogs are on cue the dogs are on cue I knew that was gonna happen um anyways all right so live a transaction so um so this is you know you know consider you consider your typical um three replica three partition uh cluster uh client comes in um and sends it's a quest sends a you know a query request to you know a random node in a random replica this happens to be you know our number one um so this this this node in this replica is going to adopt the coordinator role for this for this request um now I failed to mention on last slide the coordinator role it's it's it's a you know it's a it's a stateless transient role um you know we've talked about in the future like right now you know kind of implied in this graphic um like the the roles are effectively virtual you know you know virtual nodes in the system you know um ultimately I see I see Fauna growing up to be more of a service-oriented architecture and so this is you know this this is a la we pull out but um anyways so what the what the um you know the coordinator um um evaluate as actually you know fully responsible for valuing fql or or ql uh the first thing it does is it chooses a snapshot time um and it either uses the um so the the the client will pass in um the timestamp of the last scene transaction um that it's previously processed or um and then the the the node itself um has a clock which we call the reclock which is which is based on wall time minus um minus a delay which we'll get into later um based on that snapshot time it um it'll evaluate the reason the query so consider this query this query might be um you know I mean let's let's let's imagine your typical bank query where you're you know transferring you know some amount from one from one account to the other right so um um so the first thing we need to do is we need to actually like read the result you know read the read the initial balances of those of those two accounts um and let's say in this example um those two you know the documents representing those accounts happen to be on data partitions on by nodes one and two um in this replica um I should point out that like our our data partitioning is you know is a very similar you know is essentially the you know identical to the way the way Cassandra does it so each node has a separate um you know or sorry say each replica um has a um each node in each replica has a number of tokens those to get distributed around a token ring which correspond the ranges and then keys of documents and index terms gets hashed into that ring um and that's how the system assigns ownership anyways so back to the coordinator so the coordinator is going to do um you know very typical kind of like query about you know like query or like you know or or language interpretation evaluates reads um um the reads come back and they contain you know the value of the read at the snapshot time and also the um the last modified time um you know the time stamp of the transaction that last modified that um you know that document or or or index um or index term um additionally writes are buffered local you know writes are buffered inside the query evaluation context so there's no there's no um there's no transmission of write effects or anything that's all completely local at this point um and then what but the result of the result of query evaluation is um you know tentative result value that you know we could potentially just you know send back to the client depending on what happens later on as well as this tuple of um um um read keys plus their modified time stamps and the set of write effects that the uh that query about that you know that we want to apply to the system um if we can all right so this is um now we're now we're getting on to the to the um to the log role and so I think it's as we're pointing out that like while there are you know essentially kind of like three these three stages in the commit the important fact is that the only stage in in you know in transaction and in the transaction process that requires any global commit any any sort of like global communication is this um um is is what happens in the log in the you know in the logs part in the log stage of transaction processing so um um so the coordinator submits a transaction object that like the effect object that tuple two um two you know a random log partition um currently like the fauna tries to send um a send a transaction to the closest log leader so whatever you know so if there happens to be a log leader in that in that replica it'll send it there otherwise it might choose um it'll it'll choose one of the other replicas the log leader is responsible for um kind of like um the log leader is actually is is is and is the you know they're sorry the node with the role of log leader for proprietary partition is responsible for batching of transactions on a heartbeat and then and then submitting them to consensus so um the heartbeat you know we essentially it's it's literally just a thread that's ticking every 10 milliseconds um grabbing all pinning batches and then submitting to the log um it all it also chooses the wall time i'll get i'll get into like what what how we use uh time you know kind of later on but it um it it will write out um uh like a essentially tentative epic gadi you know based on that wall time so note note like you know obviously like wall time you know can go forwards or backwards you know you know especially if they have to find a node restart right and we don't rely on we don't rely on the timestamps chosen this point to be uh to be ordered that happens um um and i'll talk a bit about how like how we deal with that later um but anyways so batch gets submitted to um to a global consensus process um we have we use um you know it's essentially a modified version of raft that allows for um a quorum-based quorum-based acceptance so that we don't have to bottleneck on the leader for acknowledgement which saves us a hop in kind of that global communication path um and then so essentially once but anyways so once that batch is um you know once once the batch has been acknowledged committed by the consensus members um then we proceed on at this point durability is what you know the key thing here is the durability is achieved at this point you know transaction processing will will proceed um and an outcome will be determined um even if you know the client you know times out and goes away you know on the on the top end of the of the of the process all right so once um again and so you know and again this this stage is also kind of like this is kind of I would say like this stage is important to kind of pay attention to because this is I think the key you know kind of like core to how Calvin works is that um um so you know I there are way too many errors going on so I didn't make them all black but you notice there's some like other red arrows um in this log um there's a log layer so like you know our one partition is ticking and generating you know batches and transactions for for an epic every every heartbeat so is every other so is every other leader every other partition leader in the cluster so you have all these um all these log leaders you know you know ticking publishing batches um which may be empty every 10 milliseconds and it's very critical that the system continue to do this because it's that process that that um it's that process that actually drives time forward in the system and um and allows storage nodes to process the effects so um at the at the at the storage layer so which is um you know so in in in original calvin terms you know everything having a log layer is essentially you know like you know that falls into a sequencer role um in calvin um this next part is um is the role this is this role of schedulers so what happens is each each um each uh node in in you know each storage role will um has to um receive a a a a filtered batch from each log partition uh for for the given epic that it's that it's currently processing um and it's important it it's important that it actually pulls together a complete batch so this is this is um so if it's the case where like a a batch is is missed uh like for example um you know say say you know one one log node you know dies or something like that um but before the before the storage node can continue processing it has to actually refine a copy of that batch somewhere else um so but um I mean I think I don't really um if we have time we can talk about later but one interesting thing I think we want to hear is this well well I think for efficiency's sake it's important to think about um you know this process is a pipeline um like of like a streaming pipeline of of messages that could kind of like push through probably get all the way down but the system can fall back onto kind of like normal or you know the system can you know you know in the case of failure falls back to you know essentially request response model where it starts it starts uh requesting missing batches as um um if it does if it doesn't receive them from you know kind of like you know through the normal happy online path um anyways so once a storage node has like a batch from has a filter batch for given epic from every from every log partition it's at that point that I can actually um generate a a transaction time stamp for each transaction that it's going that it's going to apply and so what um and so at the sequencing stage we take transactions in that order um and then we sequence them be based on uh the keys that the keys that the sorry the keys that transaction touches so like you know two two transactions that don't have any key overlap can execute in parallel um another two that have any sort of have any key overlap or um or ordered um based on their transaction ID uh one kind of interesting note is it's like in in the Calvin you know literature on Calvin they talk about um the reordering transaction based on say like number of covered keys um you know to try to reduce um you know to try to to try to to increase parallelism I mean in practice I'm talking about this it doesn't actually matter because 10 milliseconds is such a small window that you know kind of like a normal you know normal course of operations that makes more sense to just focus on uh the efficiency of the pipeline and minimizing processing here than it is to really kind of like try to um try to reorder transactions in the 10 millisecond window um any contention there ends up being more of a problem than you know basically the reordering doesn't really buy you anything because you can't reorder outside that 10 millisecond boundary anyways and since the rest of the process is pipelining you're gonna you know you're gonna end up waiting on contention whether you can reorder or not so there's really not much you know it's a limited the reordering inside a batch is a limited uh usefulness um all right how long how long did it take you to figure out figure that out like what did you put that like immediately like did you guys try to implement it and say this sucks let's not do it or just like you just did the math and didn't work out yeah I mean we yeah we sort of like I mean first of all we didn't do it and then we decided that it wasn't worth it uh based on kind of just like um you know the throughput numbers you were getting and um yeah we did some experiments there but I think honestly I would we probably could explore a little more but I would say like you know the initial implementation to do any sort of reordering like the analysis of the pipeline was actually caused it to be slower than just like shoving everything through as fast as you could anyways so I mean I also imagine it's somewhat workable so I'm guessing from the workloads you see from like real production databases or applications it doesn't matter yeah exactly yeah um I mean I think we're more worried about you know like I mean of more concern is like you know um like so you know it's gonna be like there's gonna be one hot key that a lot of transactions evolve and that is that isn't being that like you know something more dominant so yeah it's like you know that the kind of the kind of workload that like reordering would help with would be where you have like one kind of like bulk transaction that runs through and you're trying to end up with a bunch of small ones and that's just like um you know I just um in practice in practice it's just you know they're rare enough that it's it's not really worth you know considering that there's better ways there are better ways to do that like break up a large transaction to the smaller sub transactions and interleave them with other you know other requests going on the system in kind of like a higher level fashion got it guys thanks um all right so once we've gotten to order we went through kind of sequencing at that point storage nodes can move on to actually applying the effect and you know applying transaction effects um so you know again in our example we had the two you know the two two two two document transaction um you know we have to we have to get the last month we have to get the we have to recheck the last modified time of each document to make sure that the document hasn't been modified um you know between when the coordinator really originally read it and and you know and determine the the tentative write effects and at this point you know in in serial in serial order so it's um you know in other words you know we're using Calvin as a you know we you know we're using Calvin to um you know perform optimistic concurrency control um so um where was I yeah so anyway so if if um if document if the if you know documents or terms if nothing has been modified since uh you know between read and you know at this point in the transaction processing um the system's allowed to you know we haven't you know we haven't violated serializability we're allowed to actually write this effect and so the rights are you know the rights are applied to storage and um the the coordinator um is only waiting for one um the coordinator is only waiting for one response from any anyone's so the first their first the first storage node that finishes you know processing its copy of the transaction um you know it sends it back to the coordinator okay here's the transaction time stamp that was actually generated and here's whether or not we could apply you know here's whether or not we could actually apply effects or not um and then based on that uh the coordinator can either retry the transaction you know we have we have you know a limited amount of like you know transparent retry on the face of contention um and you know and you know and if like the like the overall deadline has been exceeded or like you know that too much there's too much contention on one of the keys involved then we kick back failure to uh to the client um so do do do customers have to specify where the replicas are located ahead of time or you you just do you replicate it always or how does that work um right now so right now um no the the customers have any control um and right now yeah everything's geo replicated um and you know there's no the you know the how how documents are our how documents are distributed across uh nodes in a replica too is also um completely abstracted there is some control over you know how indexes are partition but one of the interesting things we found out is that like users just don't care um right because you know there there is a calculus between you know and this is a work in progress like I don't I don't think that we're fully we've fully got this right here but there's a calculus where like you know we've we've we've we provide an API that disincentivizes users from essentially caring about where how their data is laid out until they really get into the performance weeds and more often than not we end up caring before they do because we want you know to you know maintain the resilience of the system so so I I don't necessarily have a good answer there like how much control we give versus how you know compared to like how we actually incentivize users to you know lay out their data as efficiently as possible um what like maybe you'll get into this like what percentage of the of your customers are hitting the database within like the application is always in the same location like it's not like you know one day you like you know one day you see bunch of requests coming from Europe the next day they're coming in from South America like the same database like how often oh yeah that so we don't really see that as much what we see is um I mean to applications tend to be deployed in two different models so one is kind of more traditional where someone's talking to fauna from their back end um but like you know one of our major customers you know they've deployed fauna as you know um they're using fauna for like content personalization for cdn so they have like you know their app is is you know they've deployed fauna so that their end user browsers are hitting the service directly and so in that case you know we're getting um you know that one application where we get requests from from everywhere um and so that those are the kind of the the I mean those are the two workers we tend to see um and you know and and and also like you know sometimes there's a mix too and we encourage it where you know if if you have um and if you have some complex compute that's a good thing to throw into service or in you know you know a serverless um um environment you know like in behind AWS Lambda or something like that you know whatever's most convenient um and then um but like of course you know it's an option to you know hit the database directly from your client if that makes sense so cab you have a question why don't you unmute yourself yeah this question about the chart so I see replica one two and three replica one is on node one two and three uh a replica two and replica three also on node one two and three or are they on four five six seven eight nine oh yes sorry that's a good this uh yeah that's an oversight on my part yes there there's there on there are nodes uh you know four five six seven eight nine so nine node cluster um yeah and like you know you know for the purposes of discussion like you know consider replica one two and three being like geo distributed um so what is the average commit time for a transaction in fauna across across your entire fleet um in this environment um it is it's highly dependent on the topology um but I like it's right now globally I think it's like 150 milliseconds on average so I mean it's dominated by it's dominated by that raft commit process like you know the reality is it's like I mean that was interesting thing like you know one of the disadvantages on the surface of kelvin is it has this emotion of an epic right so there's a latency floor of transaction commit which is like you're just literally waiting for the next batch to kick over right so on average like five milliseconds but the reality is like you know you know you know you know the tradeoff I mean you know the tradeoff here is in order to eliminate that we'd have to you know go with something like you know like a two base commit process but that gets that you know that gets into um there's just some this there's disadvantages there in terms of the overall impact on the system you know the um I'll shift I'll flip over to the next slide but like for a for comparison like you know you know a read only like a read only transaction at serializability like we only ever have to hit you know one replica like you know for even for for the same key it doesn't matter you know for same key or same same document or same set of documents like I could you know you're going to get a serializable answer where you get about for one two and three and of course you make it sale reads you're not going to see the effect but you're going to get that serial you know you're going to get that serializability guarantee and you can opt into like higher levels if you want um or um I guess I have a slide so I just say this really but or you know you can pass a you know you can pass the read token across clients to get you know explicit coordination if you do have a case where like an actor is bouncing from one from like one one replica to another I think we have a question from Alexi sure hi I would like to know if you support interactive transactions like select from table one then maybe select from table two and then set to table three in different statements and in different requests yeah that's a good question so um we currently don't um I mean I would say I mean that's partly a design choice and it lack a lack of resources at this point um but there's there's no architectural reason why we couldn't support them it would just require a like a very different client interaction model to currently support which is very request oriented um you know so like for example like I talked about sequel support and I expect we'll get to eventually and you know of course we'll have session transactions at that point but like what one I mean one interesting aspect is like the nice thing about having um you know nice thing about non-active transactions is there less leakage to action so for example like transparently retrying in the face of contention is something you can't really do you know if you're doing session transactions because you know the application code um ends up being aware of those retries um so yeah but um it's interesting right because I think one one of the interesting things I've I've learned like talking about Calvin you know off and on over the years is that like yeah then there you know there's there's a there's a strong notion I think in the community that Calvin doesn't support uh dependent transactions or or not deterministic transactions and in the reality is it like yeah core cal the core process doesn't because you have to know you rewrite and set ahead of time but like there's there's um there's there's pretty straightforward ways around that you know one of them covered in the in the paper um we're using reconnaissance reads to discover a read and write set and then like committing on condition that not having changed okay um all right so that's kind of like brief rundown of the of of kind of like what the protocol itself is um I kind of want to talk about some this is just an interesting like tidbits and and details that I touched on a little you know have passively you know with that with the with the chart but um that we're worth highlighting so one thing that's important to note is um it's the first one so you know we you know we you know perhaps blazely talk about like timestamps and you know you know we talk about snapshot time transaction time and stuff like that the reality is like timestamps and fauna are not timestamps but they're logic they're they're logical transaction ids um so they're generated by um you know they're they're generated by um the log process and they they correspond with the um the transactions position in the total order in the system um now I mean it is however like you know because we're because we're you know trying to generate epoch ids um we're trying to generate these epoch indexes you know close to real time is possible you know there ends of being this best ever correlation with real time um which is very convenient for app developers and ourselves where we can talk you know where we can talk about timestamps and things like that you know even in you know so so we we can use them as timestamps you know because they're close enough even though um you know in actuality like we're we're providing harder guarantee with with our with their time stamp slash transaction id generation process um the second I touched on this too um the I think Calvin I mean um you know fauna provides um it's a nice options for tunable re you know tunable inconsistency like the baseline is serializable reads um you know you know stale reads can be a problem but like there's there's ways for um you know there's there's ways for users to kind of like either you know we also like you know guarantee you know region rights or like session session consistency by default too um and then there's there's options you know to to kind of like you know gain back stricter serializability and all the way up to like full strict serializability um you know based on what's going on your application um this you know what we found is this like um you know strict serializability for rights and serializability for reads we felt is like the you know kind of like the the the right trade-off between um you know latency and coordination or between between latency and consistency because for the most most of the time especially for the applications that we're you know we had in mind you know if you consider like you know your your like your your typical like you your typical like you know 90 percent read 10 percent right application workload you know you want reads to be fast um and and it's um and you want rights to be kind of interactively fast but um it's it's it's you you know a system which biases towards you know predictable like you know kind of stronger consistency and right and rights you know it's susceptible to fewer application level bugs um whereas like you know you can be a bit more relaxed on reads um and the the the model that kind of the application developer you know has to keep in mind is one that's very similar to what you get in a concurrent system where you're dealing with like multi-threading and stuff like that um and again I talked about before you know core calvin you know you know calvin at its core doesn't support non-deterministic transactions um but it's very it's very straightforward to layer them on top um you know we we use calvin you know the baseline in our system is calvin for optimistic concurrency control but what I think one of the interesting at we haven't really we haven't taken advantage of this and I'm you know looking forward to be able to do so but like um because it is like you know um there's a large space for optimization within you know within the context of the protocol to um you know to to you know avoid you know avoid the contention bottlenecks at OCC by you know by performing query analysis to determine where you have a static read and write set or or you know provide provide um you know provide the application developer like specialized functions that um you know they can opt into you know special you know specialized write effects that don't involve the OCC mechanism so I think I think the dynamo guys are pushing that approach as well I don't know how to keep their analysis as there's a way to sort of declare the rewrite set ahead of time so you're just actually proposing that right yeah yeah I mean I think I mean I think I think the the fast way to do yes to expose it to is expose it to developer right so hey like you know you know provide a special function and say hey this is this is your you know call it your atomic operation function you can do whatever you want here like if you don't escape this little you know you can't escape this little box whatever you do in this box gets pushed um directly to storage nodes and executes um you know I'd like to I'd like to you know I'd like to take it a little further and see what we can do via static analysis to be able to you know determine the ahead of time but I you know I just there's only there is only so much that is possible and I do want to be careful that we're like making those trade-offs explicit to the user rather than you know having to be too magical especially since you know the difference between a transaction which you know goes to OCC in a high in a highly contented environment one that can you know rely on the more pessimistic you know locking nature of Calvin is pretty stark so we want to make that trade-off clear um yeah and I think I like yeah I would say like I'm not quite like you know foundations just adding some these enhancements to where like they they're adding you know they can they can do this and so they're like bagging atomic operations to push you know kind of like you know targeted effects like closer to storage to reduce the coordination of the system okay um so onto there's some interesting bits about uh Fauna's assumptions on storage I think are I think are worth pointing out so um Calvin in contrast like you know I want one of the pitches of Calvin which I you know certainly was very appealing to you know to you know to me as like not you know not coming in through databases that coming in through distributed systems and applications was the idea of this was this notion of modularity um you know you read Calvin it's it's an extremely elegant paper because you know it's completely separated out you know the replication or sorry like the the log replication function and storage and you know essentially made them you know you know and it's built the system on top that using these black box abstractions um for for you know each of these components um however what we found in the stores layer is by kind of assuming more um it made the system overall a bit more resilient so um like the first one was um you know um like relying on storage um to be you know MVCC um so that um and the key benefit here is that like in um in vanilla Calvin like replicas uh proceed you know essentially lock step um where if a node um like if a node can't fall behind because if it does um the state of its peers um can if the state of peers you know of a node you know get get too far ahead then the node has no way to catch up because the state it relies on for reads goes away or you know gets replaced by newer versions so we um so in Calvin it's it's it's it's pretty clear that like kind of like the unit of failure is the entire replica like if one node goes down in replica you know effectively the whole a whole replica has to halt until that node is recovered or work fails over um by um by essentially relying you know by essentially relying on MVCC um at the storage layer it means that nodes can um you know potentially they're reading from other replicas and stuff like that you know nodes can advance based on the most recent state of data in the system but then there's all then but then there's the the existing history you know for each key that allows any stragglers to catch up so like one disadvantage of Calvin let's call it a paper where like you know the slowest node ends up causing um the slowest node in the system is you know caught and that ends up being that kind of like the bottleneck for transaction processing you know we can eliminate and and fauna by being able to read from multiple by being able to fail over at a granular level to or you know fail over reads the granular level to other replicas um and then and allow a lot of stragglers to catch up um the second um the second one that was interesting was um um taking a um assuming that um assuming that like a transaction effect application is um requiring it to be at least once right or a deputant so that um and the nice thing about that is that storage at this point you know because we're also getting reads based on our notion of like the last applied transaction um you know writes the storage itself and don't they they neither they need to be durable nor atomic um like a transaction can you know partially write um and then the node can fail um and even persist some of that to disk um but but since we never actually publish you know since we never actually fully successfully applied that transaction that partial data is unavailable to read so um you know the next time that node comes back up and and resumes transaction processing um as long as it can reapply those partial effects safely we end up with a correct copy of the data at the end and then um and um you know have you know not resulted in any any you know corrupt reads or anything like that but then you know of course like the nice thing here is that like checkpointing you know it can be like totally async and opportunistic based on memory pressure you know uh you know load creationist system and then we there's no need for any sort of like local write-ahead log or or durability um and so like the the storage implementation um you know our our our our demands on this on the storage implementation itself and you know or it can be it's much less demanding in that sense um the log the the the log layer is the is essentially the global write-ahead log for the entire system um all right so that's what I had on fawn itself um at least at least kind of like the core technical details um what I wanted to go into next was um maybe take a little step back and just talk more in general like some of the things that um we found important uh you know some some patterns that we found you know important for us you know building the system to be resilient that kind of gets beyond just uh the need for you know a protocol which guarantees consistency you know in in liveness and stuff like that um so the first I mean the first is I think just you know a bunch of things at all you know it's a bunch of small things that you apply to as many parts of the system and add it all together you know kind of make it your your system more resilient so like the first one is like building making sure that unit failure is the smallest possible um you know like I said like you know in original calvin like one of the flaws that you know you know they're one of the one of the things that we wanted to improve on you know you know immediately was this notion of like you know fail over being at the replica level um I mean not to call it foundation but like foundation like you know if you lose a node the entire system has to reconfigure itself um in order to continue processing and while that that can be you know I mean they're not those systems those systems don't have any single points of failure you know it increases the operational burden a bit to have to deal with you know when failure can essentially you know the failure of a single node in the system you know can amplify it to the rest of it um so like you know simple things like being able to restart a process or doing a sort of rolling upgrade it become a lot easier um you know if your unit of failure is as granular as possible I mean I think that you know that was one of the we spent a lot of time operating eventually consistent systems like Cassandra or you know Twitter's you know like the pre um sorry the pre Manhattan days that were you know sharded you know sharded my sequel and I think you know that aspect of eventually consistent systems where you know failure is very granular it was extremely advantageous so we wanted to maintain as much as possible um the second I think thing that has been pretty meaningful has been you know also avoiding failover itself as much as possible so I mean kind of the interesting thing about well fun itself the only the only um the only component in the system which has any notion of failover is in consensus where we have to have you know we have a notion of leaders and there's leader election you know because it's based on raft um and I mean you know it so there's no getting around that but at least I think we've minimized as much as possible whereas every other other part of the system there's no you know there's no failover there's no recovery mode um you know for for normal you know operating purpose you know you know in the course of normal operations and this what this means though is that like the system running in a degraded state and the system running in optimal state um you know you know those those two profiles are as close together as possible and so it again again this this comes in you know this um eases the operational burden um in a third point I think you know this is um I think I think most most full in when you've been working in distributed systems for probably about you know you very quickly learned that like you know retry because you know retry ends up being your enemy and because if not control it amplifies and again amplifies load when you're the system's under duress and so um you know we've pervasively applied a strategy of using hedge requests and and um and and score keeping internally so that we're so that nodes can you know we you know so the system can tolerate great failures but in those cases not um um sorry not not result in kind of like you know these these these spikes and load when when failure happens or spikes an internal load when like failure happens and of course you know you know back pressure is important it's it's it's you know especially a system that's pipeline like this it's really important for you know every every component of the stack to be able to push up and um every component of the pipeline to be able to push back on what's upstream to um you know in order to prevent from being overloaded um and I think the the I think the second kind of like theme I would say is this like there will be bugs um despite our best efforts um I mean I think we spent a ton of time a lot of work you know getting fought to like you know work in the context of Jepsen you know we have you know you know you know we have like a like a Monte Carlo based testing harness for our consensus algorithm and stuff like that um and you know you test and you know we're trying to fall you know it's the best for efforts you know all good engineering practices but the reality is like there are still bugs in the system and um and so you know and and we have to be resilient to those um I think and interesting I was I was looking at stuff to kind of you know confirm my biases here um but actually no I this thinking about this made me recall this paper um that um this showed up in the morning paper a while back which is like you know the title is empirical study of the correctness of formally verified systems now the conclusion was interesting because obviously like formally verified systems are far fewer bugs than those you know that didn't have any formal verification but there are still some pretty critical bugs that showed up in the shim parts of the system that were very difficult you know that weren't under formal verification um and and and I mean you have you know you know experts writing these systems you know you know we're all working very hard you know to make these systems because as possible the reality is um you know bugs bugs slip by and you know and oftentimes things that feel very obvious in hindsight you know where you know perhaps you know a mistakenness a mistaken and misplaced assumption leads to a critical critical bug um so um the only the other thing I would say about this too is it's like logic bugs I mean we've we've had issues in our system it's it's it's actually has not been you know it usually is not where they get a case where like something failed in a weird and wonderful way you know in terms of like I mean lost a node you know or you know we let we have some hardware failure we've never seen before it's it's it's almost always been something you know where you know we had an edge case that you know an edge case that um got through testing and and where you know I think our you know our our failure handling the the failure handling aspects of the system you know didn't work and that's because like the logic bugs tend to be correlated right you know if you have a bug the effects um you know if you have a bug that affects your encoding your encoding implementation it's going to affect every partition that um the um that touches the um you know the poison key or something like that um the other thing I would point I would I would if I were to take any lesson too but I wish I'd you know really turn on as earlier as this like you know the actual liability your system I guess what the actual liability system depends largely on how bug free it is how good you're monitoring and how well you protect it against the mere issues and problems it has so to me that means like um you know this is a great post by j crepes you know it's it's it's a little bit um you know it's it's a few years old at this point but like you know his major point is that's like you know this this notion of defense in depth and like investment operations ends up being um more in some ways more meaningful than then it's it's it's your biggest bang for your buck um so aside from testing you know you know I think you know what I see is the response you know to you know the strategy for making any systems more resilient you know you know essentially falls down to good practices in terms of operations you know making sure the investing you know in robust observability a big one that we learned late was um just um um I wish I learned earlier in my career was this notion of like reducing the blast freeze of areas in the system and you know reducing uh unnecessarily unnecessary coupling as much as possible um because you know that that way if something does happen you know you're not affecting you know your entire customer base you're not affecting a subset and so there's there's other there's things you know kind of outside the core system that you build that help increase resiliency here and I think that was that's an important one and again also um the last thing I'll say is like attribute detection that was something that you know I think is popularized by dynamo and Cassandra but this notion that you know there's always you know there's always a potential for permanent failure so it's important to you know build the functionality to to detect that um you know report it and then and ideally correct it you know as best you can automate in an automated fashion uh are your observability tools is that all in house are you relying on like open source packages like is that is it a mix of things is there any one sort of library you found to be most effective or most important oh yeah um actually like I mean there's some really good services these days I mean I think data dogs obvious one so when we use data dog um and like it started off as kind of like just basic metrics reporting um and we'd you know we'd kind of bounced around from like um you know you know our own our own home world stat site um uh recording framework to you know couple services before landing on data dog I mean data dog and I mean oh we can end this on a data dog plug but like you know they it's really become a you know it's become a whole suite of observability tools they have like you know metrics logging distributed tracing and stuff like that so and like it's either based on you know kind of open apis or close to open apis so it's pretty easy to um you know it I would say it's pretty easy to like use them or an equivalent system you know equivalent service or you know be able to mix and match but I think yeah I mean they've been a great one-stop shop I mean it would be my first you know my go-to choice this you know if we were to do this um you know the next time around um but the others there's plenty of options there um too um yeah I mean I think the other the other aspect though which I don't really have a good answer for how how how to do this but I think some of the the practices around like what you monitor what you log and like and stuff like that are things you just learn over time um but I don't know of any good resources that kind of provide more of it and an opinionated framework as far as that goes but I'm also not the most update so that's fair anyways that's that's all I had so um okay yeah so at this point any more questions all right so I will clap for my half everyone else because it's a pandemic we're up over zoom uh so I guess I'll open the floor and if anybody has additional questions okay um so yes um so sorry I kind of have a question because uh company's nagging at me uh you said you don't see benefits from reordering of transactions in practice uh which is kind of surprising to me do you maybe support secondary indexes and especially unique indexes uh in any way um yes we do support um we do support unique indexes okay and uh uh what I find strange is uh so suppose multiple transactions are botched into uh 10 millisecond batch and uh if you don't support reordering that means that your pipeline would be stolen because um before I mean if there is um checking of logs and uh reads across nodes that means that you if you don't support reordering you would have to uh wait for reads from other nodes before applying effects of the next transactions or even starting to read the data for the next next transactions so what I can kind of find so surprising if you don't support any reordering can just uh perform transactions in their coordinated order that would mean you would lose latency between these steps and uh don't you see this problem in practice so um yeah in practice no I would say I mean for two reasons so one um I mean the reality is the reality is that like um sorry um so I guess the first reason is it's like you know I think you know you only gain you only gain the benefits of reordering when um like when the when the transactions have overlapping key sets so like if if two transactions you know obviously don't overlap those can actually execute and you know as quickly as possible I mean the way Fawn worked we actually pipeline transaction execution across epics too so the reality is that like storage nodes are are racing as as as far you know ahead in the transaction pipeline as it can and as long as there is an overlap you know parallelization you know it the system paralyzes um that transaction application as much as possible on the flip side the reason why I said we didn't get any effect any benefit out of ordering wasn't wasn't due to the fact I think ordering would help in certain cases it's just that that 10 millisecond window in which you can reorder transactions is is is low is small enough that there's there's just not a lot of wiggle room there so so the reality is reality is just like you know even if you know even you know where we to implement reordering um there's just not there's just not as much as you can do there and so in practice in practice the compute costs for doing the analysis and then doing the reordering that window um at least you know like it wasn't worth it it was better it was better just to kind of keep things you know keep the transactions moving as as quickly as possible that's not to say that like you know I'm certain there's like things that are like woe woefully inefficient about implementation we need to prove so like you know it's probably you know it's something that I would expect us to continually revisit but um it's it's it's also like it wasn't it wasn't a benefit we saw in practice it might you know one thing that might change that too is it's like you know as we push further and further into like more atomic operations we might see more benefits there but at the very least in the OCC you know kind of like mechanism we have like that ends it being the most significant source of contention um for for things like you know rights to single documents and stuff like that is it is it because latency between nodes is small enough or is there something something else yeah the latency between nodes um within a replica you know for transaction processing is like is as low so that's designed the way the system is is is designed is so those reads within the reads that are required at the final step for the transaction application um like those those you know those have those should be fast to maintain throughput and um and like you know I didn't get into it but it's talked about a bit in the cow and paper but like you know this you know one of the strategies for keeping throughput up here is making sure those reads are as fast as possible it's like one strategy the paper talks about is like pre-warming cache so like sending reads ahead of time to you know pull transaction data off disk and make it available for you know these these these fast critical reads and you know by you know I think techniques like that reduce the you know will reduce the kind of like the the latency that you are incurring in that step and also because because these reads can be fed off the the the pipeline itself they they are transmitted you know as soon as a node knows that you know one of its peers needs a piece of data in order to process some transaction it will send it it doesn't need to wait for that node to request or anything like that so that also eliminates um that that that kind of like pipe that stream aspect of this part of the system you know also helps minimize latency here okay awesome uh we're at a time thank you for doing this my one student Abby requests that you send videos and pictures of your dogs she's very very adamant about this Abby where do you want him to send those things I don't know okay but I heard them so I was like please dogs okay I will do that yeah one of them is like a corgi uh a corgi lab mix and he's he's ridiculous so all right that sounds good okay Matt uh thank you for doing this I really appreciate it