 I'm extremely happy today to have Kyle Kingsbury from the Jepsen Project. Jepsen Project, a Jepsen company. What are you? Both. From Jepsen.io, how about that? Kyle, he did his undergraduate in physics from Carlton College and since 2013, you founded the Jepsen Project. It is my sort of personal opinion that Kyle does the highest quality research in distributed systems, concurrently told consensus protocols outside of academia today. His write-ups on the distributed databases that have hired him to basically torture their systems with his tool, his write-ups are phenomenal. Look, if I had the time and the energy, I would teach a seminar course on just like reading these write-ups. I mean, it's not gonna be done to us because it's an amazing thing. It's amazing work. So Kyle's actually somebody I've been wanting for a long time to come and give a talk at Carnegie Mellon. It's never really fit into the sort of categories out of seminars we've been having. This upcoming fall, we were gonna do distributed systems and Kyle would have been a showcase speaker for this, but here we are. It's a pandemic and it is what it is. We're super happy for Kyle to be here. So the way again we will do this is that everyone should remain muted and then if you have any questions, interrupt Kyle at any times, but be sure to say who you are and where you're coming from so that everyone knows who everyone is. And so again, before I begin to Kyle's talk, does anybody have any questions? Okay, so with that, go for a call. Okay, hello everyone. My name is Kyle Kingsbury and I do freelance database safety verification. And this is some research that I like to talk about which has been a collaboration with Peter Alvaro from UC Santa Cruz, a longtime friend and mentor. This project is called L and its job is to verify whether or not databases are serializable or repeatable read or snapshot isolated and to do so in a way which is actually practical. We'll talk about why it's impractical in the course of the talk. This talk was originally given last fall at high performance transaction systems. Those of you who are English speakers may notice that there's something wrong with this word transaction. In fact, there's a little bit missing from this letter A and the I and the T have switched places. Now, we as humans are really good at looking at collections of letters and deciding is this a legal word or not? We can spot these transpositions, these illegal sequences because our brains have some model of how letters are supposed to fit together. Wouldn't it be cool if you could do that for computer transactions automatically? For instance, one of my classic problems in my life is that somebody comes to me with a database and they unbox it from DBSRS and they get all excited about the fancy packaging and they open up manual and it says it's serializable and they say, well, is it? And I say, well, I don't know. We should try it out and see. But how do you actually try a database? How do you actually find out if it is serializable or snapshot isolated or some other property? One way to do that is to have yourself and a bunch of your friends coordinate to make requests against the database. So I could issue some requests to get some responses concurrently, someone else could issue requests and get responses. And then we'd all write down what we saw and combine those operations into some sort of data structure like a concurrent log over time. So here time flows down, we've got our four participants, our four actors across the top of the screen and the first participant will issue T1 while the third and fourth issue T4 and T6 and so on and so forth. Now with this structure, we know when every transaction began and completed, which gives us the concurrency windows which operations are concurrent which aren't. It also tells us the per process order. So we know that T1 preceded T2 and we also know what transactions happened, at least from the perspective of outside users of the database. So we know what every RPC was that we issued and what every response was that we got back. Now, in order to figure out if the history is serializable, we would need to find some path through all those transactions that completed successfully such that along that path, the transactions looked like they executed in linear order. So I might be able to try to find T3, T4, T5, T6, T7, T2, T1. Maybe that's an order that makes sense. You'll notice that we're jumping between timelines and all of the place. That's okay, because serializability doesn't actually say anything at all about what time you execute a transaction. It just says that you have to find some order of transactions. So this is fundamentally an ordering problem. We're looking for an order in which things are legal. And the trivial solution to serializability verification is simply to write down every single transaction you observed in every possible order and then to record, and then to examine each one of those orders and try it one by one. So can I execute T1 against the empty state? No, let's cross that off. Now, let's try T2, T1. Can I do that? Well, sure, T2 applies in the empty state. Maybe T1 applies on top of that. And then when I go to T3, I decide, oh, shoot, I made a mistake. Maybe T3 reads some value that T1 also wrote and they don't see the same thing. That implies that T3 can't have happened right after T1 and we have to cross that off. So we keep going for a while until we found some possible order in which all the transactions could be executed. And if that's legal, we know, ah, there is a serial execution of these transactions and therefore this history is equivalent to that order and it is therefore serializable. Problem solved, right? The problem, of course, is that the number of permutations of these transactions is n factorial, well, not exactly, but it scales that way. And n factorial is a really bad number for those of us who are computer scientists. It's the sort of number that makes you want to leave the field, in fact. So not to be deterred, I started another way of solving this problem called regression. It's a library that influenced a technique described by Seroni Bernardin-Gotsman in a 2015 paper. And they say instead of trying to find all the orders, let's decompose serializability into three properties. An internal consistency property, which says that inside of each transaction, you need to observe values that are consistent with prior reads and writes. There's an external consistency requirement, which says that you need to see the thing that was written by the previous transaction or the one before that, the one before that, whichever one most recently wrote some value, that's the one you need to see. And total visibility, which says that order of transactions need to be a total order, it can't be partial. So we can solve internal consistency in linear time. We just take each transaction and then play forward the pieces one by one and see do you read the values that you wrote earlier or that you read earlier. External transactions, our external consistency, we have to build some sort of dependency graph between transactions. So if I read the value three, I know that it had to come from some other, some other right of three. So I'll build this read dependency constraint problem and I'll express that in conjunctive normal form. And then I just need this constraint that the order that I'm solving for is total. This sounds a lot like a constraint programming problem. So I'm gonna pop this whole sucker into G code and hopefully it will spit out an answer. So to tell if transaction one precedes transaction two, if I see a read of X three and transaction two and a write of X three and transaction one, I know that transaction one had to execute before transaction two in the order. That's a constraint I could use to reduce the search space. But if there's two transactions that write X three, well, now I have a problem. I can't tell which one of them came before. So I have to generate this sort of gross Boolean expression, like either T one is less than T two or T three is less than T two. You don't know which. So this becomes this kind of ugly hypergraph problem. The other issue is that even if it's not that ugly, it's essentially reducible to SAT solving and that means it's NP complete. So this approach works, but it breaks down after about a hundred transactions or so. It also has this terrible feature that even if it tells you yes or no, the history is serializable or not, it tells you nothing about why. I mean, if it finds a solution, it can't spit that out. But if it doesn't find a solution, doesn't find an order, it just says, I couldn't solve these constraints. Sorry, boss, it's your problem. So are you aware of anybody that actually would just have any company or a major data system that was using like the SAT solver to do exactly what you're proposing here? Like you can do the transactions potentially in like windows so that you don't have to look at your entire schedule. In serializability, you would technically have to look at every transaction over the entire history because there's no time constraint. For strict serializability, yeah, you can box it with time windows. And then it scales as process concurrency. The problem is that as processes crash over time, concurrency rises because those crash processes are essentially concurrent for all time. And yes, I should note because this is, there are two serializability solving problems here. One of them happens inside the database to provide serializability. You're checking for potential conflicts, right? And that problem is a lot easier to solve than externally verifying it where we're trying to look at the whole history and decide if it's serializable or not. I'm talking about external verification from the outside of the database. Yeah, but it can keep down here. So let's try something else. If we can't solve this problem directly via SAT, maybe we could try a cycle detection. If you could find a cycle in the transaction dependency graph, then you know that there cannot possibly be a total order. And serializability is defined in terms of the existence of some order of transactions. So let's go ahead and try and solve this related problem instead. For instance, if I have these two transaction graphs where T1 happens before T2, T2 happens before T4, T3 happens for both T4 and T2, this transaction cluster on the left, that is serializable because I could execute T1, T3, T2, T4 and not contradict any arrows. But on the right, it's not serializable. There exists a cycle between T2, T4 and T3. And that means there can never be a total order which doesn't contradict the dependency graph at some point. So humans can look at these graphs and tell us their cycles. And computers could do that too, right? So let's try to map our problem into this world of dependency graphs. Our secret weapon is a thesis and a paper from a tool object in 1995. And the collaborative paper he did was with Barbara Liskov. And it defines the usual SQL transactional isolation levels in terms of dependency graphs and cycles over those graphs. There are three types of transaction dependencies to worry about. One of them is a write-read dependency. So transaction one writes some value of X, call it X sub i, and transaction two reads X sub i. Well, then we know that transaction two had to come later in order to observe that value X i. This assumes, oh, sorry, the second option is that we have a write-write dependency where transaction one writes X i, transaction two writes X j and X j replaces X i in the version history. This version order is a construct that Ajah introduces. It's sort of a priori knowledge that version one precedes version two. But we don't actually have that information from the outside. So this is something we'll have to come back to. There's also a read-write dependency, a so-called anti-dependency, where transaction one reads some value and transaction two installs the very next value in the history. And that too relies on this notion of a version order. So if we want to get these dependencies, we have to get the version order out of the database somehow, even though it's invisible. And even though it might not really exist, there's no requirement that the database know what this order is. We just need to prove that one could exist in order to show that the history is serializable. Now, if you do have these transactions and these dependencies, you could generate what's called a direct serialization graph. And the serialization graph will be comprised of write-write, read-write, and write-read edges. We could then solve for cycles. So if we see that there's this write-write and write-write cycle where all of the edges are comprised of writes after writes, we'll call that a G zero anomaly. And this is something that basically every SQL isolation level is supposed to prohibit. And in fact, there are more of these definitions. We have G1A and G1B, which are non-cyclic anomalies. That's where you read a value that was written by an imported transaction or read a value from the middle of a transaction that was later replaced. There's a cycle comprised of write-write and write-read edges. That's G1C, cyclic information flow. There's G single, also known as read-sq, and that's where you've got one read-write edge. And if you have an arbitrary number of read-write edges, that's a G2. So we could find these anomalies. In fact, we could actually categorize histories and say, ah, this thing exhibits G single, but not G1C, if we could find out these dependencies precisely. That's the goal. Great. So where do we get an object history in order to compute the serialization graph? After all, what we have as observers of a database is something like transaction one, wrote X1 and wrote Y1, and transaction two, read X1 and wrote Y2. Is this serializable? Well, we don't know if that read of X1 was produced by that specific write of X1, or if there was some other transaction, which also wrote X1. So we're gonna need to prove that there's only one write that could have resulted in that value. And we also don't have the version order between Y1 and Y2. We don't know which one overwrote the other. So there are two possible solutions here. And one version of this object history, where Y1 comes first, this is legal to serializable. In the other version where Y2 comes first, this is a cycle and is therefore non-serializable. So we're gonna have to figure out how to reconstruct that version order somehow. So I wanna decompose this into two properties. One of them I'll call recoverability. And the idea here, not to be confused with database crash and recovering, the idea is that if you have a read of some value, like you observe the value of X and you see it's currently three, that value is recoverable if we can trace it to a specific write of three. So I can prove, okay, this is the transaction that produced the value I read. I've recovered it from the history. That's a special property. And we do that by forcing some sort of uniqueness relationship between the arguments of our rights and the resulting values. And we also need the arguments to be unique or we can prove that the values themselves unique by the way. Now there's not a horrible constraint, right? Like generating unique values for rights pretty easy, shouldn't really break our test. It's still a very general type of history. But there's this other problem, which is that when you do a write to a register, you destroy its history, right? If you see a register containing four, you have no idea what happened before it. You don't know what versions XI and XJ were when you observe XK. So it would be nice if we could reconstruct those values somehow. Let's call this property traceability. When you see some value XK, you know specifically that it came from write XK and before that the write that produced XJ, before that the write that produced XI, all the way back to the initial state. If you can trace the evolution of the system for some data type, that gives you a prefix of the version order. So I mean, what you're voting up to is that you're going to show that your tool can create that for you in sort of user land. But if it's a multi version system that and you don't run the backing on garbage question, you would still have this. Or do you care about like within the transaction, like it didn't model updates to an object. Therefore you may not have that lineage within the transaction. So here's the trick. We don't actually care whether it's a locking or a multi version system whatsoever. What we're going to show here is that regardless of the concurrency control implementation, if there were to exist a total order of versions in order for there to be a serializable execution, then that order must be consistent with some observation of the universe. Okay. Locking a multi version are not mutually exclusive. Keep going. Yes, sorry. So both optimistic and pessimistic concurrency control methods, we do not care. Single version multi version, you don't care. Yes. All right, awesome. Keep going. So consider four possible data types. Let's generalize from registers and see if we could find a different data type that works better. Up top, imagine a register. So if you have a register containing five and you write three, the register's value is replaced by three. But if we did a counter, which was incremented by writes, then writing three on top of five would result in eight. This may not seem like it's a huge advantage, but if we always increment the positive numbers, then the values are going to be monotonically increasing and that might help us infer some information about which versions happen first. If we use the set of values and every write was added to that set, then we can prove whether any read happened after or before a write just by looking to see if that write is included in the set of values. So long as the set is grow only and never forgets, we can show that a set that contains one, two, three had to happen after the write of three. Again, assuming writes are unique. And finally, think about a list. If you have a list one, two, and you write three and your writes always append to the end of the list, then you know specifically what write produced that value, one, two, three. You know, it must have been a write of three because that's the last thing in the list. That's the definition of list append. There's a sense in which you can sort of undo and replay the consequences of writes. Think about it visually. A register is totally connected in a graph which relates versions to new versions via writes. So if I have version zero, I can go to the version one by writing one. I can go to version two by writing two. I can go to version three by writing three. And so on for every possible value. You can get anything from anything else in one hop. And this is awful when you're trying to reconstruct the causal history of some value. But if you have a counter and you always increment monotonically by positive numbers or by all negative numbers, I suppose, then you know that given two possible values, they have to be causally related if one is bigger than the other. But for a counter, we can't show whether the write of one resulted in the value zero, one, two, or three because it could have been any of those. There's no bijection there. So this doesn't give us recoverability, although it does give us some sort of partial constraint over versions. For sets, I can show that say a write of one had to precede the set one, the set one, two, the set one, two, three. But because the sets are unordered, I can't tell precisely which versions preceded some set. So if I read one, two, three, I don't know if the version one, two, or the version two, three happened previously. It could be either one of those. So working with sets requires that we have a read of every single cardinality in order to fix some specific line, some specific trace path through that possible graph of sets. But a list structure has this wonderful tree-like property which is that if you read say one, three, two, you know the most recent write was of two and it came from the version one, three. And before that, it was the version one. And before that was the version, the empty list. So given any value, you can always figure out exactly what values and what writes preceded it. And that is the traceable property we need to do this reconstruction of the dependency graph. So given the read one, three, two, we can infer all the previous values. That's traceability. How do folks feel about this? I see where you're going in. I'm excited. Okay. This is from Xanthes. Yeah, it sounds like you're building a lock that there's ability for recoverability in that recovery sets. It has exactly those properties that will work in dependency. That's a great way of thinking about it. What we're doing is instead of changing a single volume in place, we're getting an audit trail, a log of operations. It turns out that lists with append logs which insert at the end of the beginning and strings with concatenation. There's lots of data structures that are isomorphic to this sort of list-depend property. And in fact, text with the SQL concat function does this precisely with characters, right? So in the SQL standard, there is a data type that we can use to exploit this relationship. Okay, let's do some examples. I know this has been a little bit mathy. Imagine that we're choosing lists with append and you get transaction two and transaction one. First one writes X1, Y1, second one writes X2, Y2. We have no idea if this is serializable or not because we don't know the version orders. It could be non-serializable if Y is two preceded one and X is one preceded two or it could be serializable if both of them went the same order. Now let's add a third transaction. Let's add some transaction in which reads the state of X and sees one two and reads the state of Y and sees two one. Now we know the order of those writes and the order of the versions and we can prove that there are right-right dependencies between these two that form a cycle, specifically a G zero cycle. So this is a violation of dirty rights. Cool, right? By adding this extra read and by choosing this data type that encodes history as opposed to forgetting it, now we can reconstruct at least right-right edges and we can go farther. Let's say that we have transaction one which reads X01 writes X2, transaction two writes X1, writes X3 and we happen to have a read of X at some point in time which includes zero one two three or any longer version of that string. We then know that you had to proceed three which gives us the right-right edge on the right and because the right of X1 appear to the final position for the read of X01, we know that T2 also had to proceed T1. So T2 comes both before and after T1. There's a cycle. This is G1C cyclic information flow. This cannot be a repeatable read history. For T2, we do the same thing just for reads and writes. So if I have a read of X1, two and I'm choosing these in order without loss of generality just to make it easy to read, then T1 reading X, list one and then having this write of X2 and T2. Well, because the write of X2 doesn't appear in the read of X1 or alternatively because we have this read of X1, two either way we can prove that T1 had to proceed T2. But conversely, if T2 read the empty value of Y and then T1 wrote something to Y, we know that T1 had to happen after T2. So again, we have a cycle. This one is a anti-dependency cycle. It's therefore a violation of repeatable read. Cool. Well, this is all very well and good. How do we concretely do it? You get a whole bunch of transactions from the database and you throw them into a big bag and then you compute these dependency graphs by building those relationships. You look at all of the lists, you form a partial order over the lists or over the sets as they may be. You try to extract some maximal fragment of the dependency graph. Now we may not get every single edge, right? Like T12 here, it probably has dependency edges in any underlying system, but we may not have observed those dependency edges. Maybe we didn't get a read that can prove anything about T12. But so long as we read periodically and so long as our reads are observing every value that's been recently ish-written, then we should get enough information to cover most of the history. There should only be this small post fix at the end of the history, which is evolving that we don't get to see. To be clear, like where are you going with this is that like your tool is generating these transactions. Your tool is generating these queries, right? This is not like you're connecting to an existing application and observing what they're doing. Because you have to synthetically generate these queries that do the concatenation on the string so that you can check to see whether they violate the total listing. Yes, yes. So we're restricted in our choice of data type and our choice of history. Our histories have to be only over data types like lists with a pen. So why would you not know that T12 read something? For instance, imagine that we wrote some value of X and T12 and then we never do a read that observes that X with T12 right in it. We might not know whether T12 happened before or after T13 if we don't get to see the values as they turned out. Because you're fuzzing. And so the queries are random. There's no guarantee that you'll hit T, you'll have something that would be T12's right. Even if the queries were deterministic, the database is allowed to execute them arbitrarily because it's serializable. So it could, for example, buffer T12's right and apply it after T13's even if they weren't logically or physically concurrent. It could be reordered to be concurrently. Get it, get it, get it, okay, good. Yeah, although we'll get there. There's nothing trick we can do. Okay. So once we had this graph, we apply Tarjan's algorithm from 1972 to find strongly connecting components to the graph. And this is in linear time that costs vertices, which is number of transactions plus edges, which is probably the number of key value pairs we have in the system. And that's great because this linear time algorithm will let us identify small chunks of the huge graph which contain anomalies. So we might find these two chunks, maybe the one on the left. We could see the cycle where we have a right-right edge and a read-write and right-right edge on the other side. This gives us both a G0 and a Gsingle, depending on which one of those paths we choose. So we'll just pick a vertex. We'll do some breadth-first search. We'll constrain that search so that we're looking for only right-right edges first. That finds G0. Then we look for right-right and right-read edges and so on and so forth. On the right-hand side, we do exactly the same thing, do some breadth-first searching. We try to do BFS to find a minimal example. So we wanna find short cycles, even though there could be many, many cycles encoded in any one of these components. In practice, this works pretty well. Most cycles are on the order of two or three transactions. Although frequently we find strongly connected components which are on the order of several hundred transactions. And for those, BFS tends to blow up. However, all is not lost. We can always find some cycle in that by simply doing a depth-first search in linear time. And we're guaranteed to come back to ourselves if we just visit every single element. Why? Because it's totally connected. Everything reaches everything else. So there's a wonderful property that the worst case here is we find a cycle that's not minimal, but we do it in linear time. And the best case is we find some sort of actually minimal cycle example. With those cycles, we can then explain exactly what happened. So L can automatically generate textual explanations and show you visually on a graph with little arrows pointing to the rights and reads, showing you the dependencies. Oh, T1 happened with 42 because T2 observed T1's right of three to value X. T2 proceeded T3 because T2 didn't observe T3's right of five to Y. And maybe T3 happened before T1 because T1 appended three to Z with four, T3 appended seven. And we know these appends happened in this order because later on we saw three and then seven in some version of X, or some version of Z. And therefore there's a contradiction, there's a cycle here. So these are human readable explanations that you, a regular human being can look at and say, ah, yes, this looks like a contradiction. That's the goal of all of this. You know, with this basic tool, we can add additional constraints. So for example, if we're talking about that concurrent history of processes, if I add edges to the dependency graph for each individual processes operations. So here a person on the far left executes T1 and T2. I put an edge fin T1, T2, and I observe a cycle including that edge. I know that the system has to violate some kind of session guarantee if we're identifying person with session. Or it might violate something like a process order, like sequential consistency. If we can observe some cycle where different processes disagree on the order of values of some single object in database, OBS sequential violation. We can also do a real-time order. So we take advantage of that real-time precedence you were talking about. We look at the concurrency windows and say, ah, T1 ended before T3 even began. Therefore, T3 must execute later. And if we verify this augmented graph, then we're checked for strip serializability. Or we check for strong snapshot isolation. So we can lift any one of the cycle-based consistency models into a version which is consistent on an individual process and a version which is consistent in real-time just by adding these edges to the graph and just dig which treatment. All of this is automated. If you have a timestamp return from a transaction like the system actually tells you, I started a timestamp three in database world and I ended a timestamp seven, you could use those to recover the start order serialization graph and do snapshot isolation as defined by Ajo. What we actually do for snapshot isolation because most databases don't tell you this is to look for non-adjacent read-write edges in a cycle. We say if you ever find a case where there's a cycle which has at most, I'm gonna get this backwards. Anyway, you can look for what are called dangerous structures in the snapshot serializable isolation mechanism which consists of a pair of read-write edges adjacent to each other in the graph. You can see one of those things or if you don't, if you see a cycle which does not contain one of those structures then you know it violates central isolation. Finally, we can do version order. So if the database tells you this is version one of X then you can encode that in the graph and then prove something like the databases beliefs about its versions weren't consistent with say real-time order or weren't consistent with process order or weren't consistent with the serialization order. And sometimes in the case the databases mess up on what the version is the object. We check for other anomalies as well. In addition to all these cycles, we look for aborted reads. If you ever see two lists which disagree, they don't, neither one is a prefix of the other. Then you know they had to diverge from some common point and there cannot be a total order between them. The lack of a total order implies that there is no version order in the AJA formalism. So this either no AJA interpretation of this history exists or one of these reads is actually an aborted write. So you can infer the aborted read as the most charitable explanation. You can also observe aborted reads directly. So some write of X one aborts and then you see one inside of a committed read. Well, you know that had to be an aborted read. And if you see anything that descends from that aborted write, you can also infer there was not precisely an aborted read but something very much like it. There had to be some sort of write which incorporated aborted state into its new write. So even if a committed write reads something that's aborted and writes it back in the database sort of promoting aborted state into committed state, that's still illegal. It's still essentially aborted read. So we can find that too. We can also look for duplicate elements. This happens often when databases do internal retries. We look for garbage values where the database hands back things we never put in. This happens when there's memory errors or serialization bugs. Also happens when clients maybe connect to the wrong session, like a client opens a connection, it issues an RPC request and it gets an RPC response intended for some other client because the database messed up. How often does that happen? Twice in my most recent test on Redis. Oh, Redis, okay, sure. All right. But in terms of it is like a memory error, like some internal function when a write and you get garbage back. That's not the RPC example. How often have you seen that one? It happened in, it's been in a couple databases. One of them was a FANADB, which is based on Calvin. There was a serialization error where it, oh wait, no, I'm mistaken. I'm sorry, that was a different thing. I don't remember what Shavas was. Might have been Dgraph. You put in a number and do like a JSON parsing thing. The number that came back would have the wrong sign and wrong position or something like that. Interesting, okay, yeah. And then we look for internal inconsistencies, which we check as you'd expect. If you see a single transaction that writes X3, it does a bunch of operations on something of X and then reads X ending in two. You know that this can't possibly have been internally consistent. That read saw some value that didn't result from the most recent write. And we can always check this for any pair of writes and reads or pair of reads and reads. So I've told you this wonderful story. The question, of course, is does it work? And if you're familiar with my work, it does. TIDB, we tested at 217-308-1. Clean to be snapshot isolated, but we found cases of G2, G-Single, lost updates and aborted reads, all now fixed. In TIDB, they claimed that select for update would prevent read skew, but we've heard write skew. In fact, it did exhibit write skew, but only on newly created records. And the reason is that the concurrency control mechanism for acquiring locks in those records could only lock the records that existed, not the potential existence of a record that was about to be created. You could buy TIDB, version 131, claimed to be serializable, but we found G2 item when primary nodes were paused or crashed. Fonnabee, version 260, exhibited an internal inconsistency violation where when you read a value from the index, you could fail to observe your own prior writes. Redis raft and development builds claim to be serializable, but we found cases of slip rain, data loss, data corruption and still reads. Along with others, there were in total 21 issues in those builds. This is normal, they were early development builds. All of them are now fixed. And then D-Graph 111 claimed to be snapshot isolated, but we found cases of internal inconsistency violations and read skew. So you would read some value that instead of looking at nice snapshots of the database, you get some sort of cross where you see something from a new transaction something from old state. Also now fixed. Postgres, 12.3, the Venerable Postgres, turned out that there was actually a G2 item error which could happen under rare conditions in normal operation. So even without faults, if you did enough transactions against Postgres and you got this particular trio of transactions where like one updates the next and the other one does some sort of careful select on two elements, you can observe this cycle of dependencies. So that was a big deal when that was announced. But I forget whether you said this or not. How long did it take for these sort of four or five systems? Like in this case, this Postgres one sounds like you had to run it for a while before you hit this. Postgres, I mean, I got lucky. Initially, I got a lot of errors, but it was all my fault because I designed the test wrong. But after maybe five minutes or so of testing with the correct execution, then I started to see it. And I could get one of these every couple of minutes. So for the other systems, like, was it like, some of them sound like they'd be very immediate or other ones might be more obscure. Yeah, sometimes these errors will crop up as soon as you run one transaction. Like in FondaDB, I got through 10 transactions, not even current, and it immediately threw up an error for internal consistency. Yeah, yeah. Although FondaDB's other concurrency mechanisms were very robust, that particular error wasn't something we'd covered in previous tests. Yeah. Others, like some of those UGBIT tests, they only showed up when you had specific patterns of process crashes and predictions, and that took a few hours to get to. Got it, got it. The final thing I want to say is that this is fast. This is a graph of L, which is the checker I've been talking about, versus Gnosis, which is the linearizability checker I wrote for Jepsen. And that does basically a state-of-the-art-ish search over the concurrent graph of histories. L scales exponentially with the amount of concurrent, sorry, Gnosis scales exponentially with the amount of concurrency. So you can see that for different lengths of histories, all of them shoot up to the sky after a few processes get added to the system. Once you hit 40, 50, 60 concurrent processes, Gnosis basically throws up its hands and starts taking like two to the 32nd steps to execute. L, on the other hand, is happy to keep running up to hundreds of processes, and in fact, will test experimentally with thousands. In terms of history length, Gnosis also tends to blow up with history length, and the reason is that the more operations you have over time with processes are allowed to crash, concurrency slowly rises. So that state space tends to grow, whereas L basically is linear in the number of operations. So you just throw it a bunch of ops, it doesn't really care about how many there are, it just says cool, cool, chug, chug, we're done. So this is a breakthrough for me because practically speaking, I can now test hundreds of thousands of operations over a long concurrent test, lots of crashes, where before I could only test for a few seconds, maybe a hundred operations tops. If you remember one thing from this talk, it's that L is like geese. It is general, it works over patterns of transactions, which can be short, long, any kind of reads or writes, just as long as the writes are unique, and they're over these particular data structures, over lists, over sets, that sort of thing. L is effective, it finds errors in real databases, lots of them. It's efficient, it runs in linear time in the number of operations and dependency edges. It is sound, every time L tells you that there's an anomaly, it is, at least we informally believe. It is the case that every object history, which could have given rise to the experimental observations that you got, must contain that same type of anomaly in its dependency graph. And finally, L, unlike many other checkers, is explainable. It points specifically to a few transactions since the problem happened here. This is where your debugger needs to look. This is where you might find a problem concurrency and can justify it to humans, categorizing it automatically. That means that you can run one test, which will verify a whole slew of concurrency properties and it will spit out exactly which properties it didn't satisfy, telling you, this is at most repeatable read, but we know it is definitely not, say, snapshot isolated. I want to say thank you to Andy Palvillot and everyone who's come to listen to this talk. And in particular, I want to thank Peter Alvaro, my collaborator, Asha Kareem and Kit Patela, two people who I talked to a lot in the evolution of L over the last few years. Kit also wrote the initial implementation of L in the checker. At this point, I think I will turn it over to future questions. I should note before we start that one thing that's not in this talk is predicates. We've only talked about keys and values, predicates are an open problem and I really want to figure out how to do it. All right, thanks. Thank you, awesome. Thank you, yes. I will clap for Kyle. So again, the way we do this is, meet yourself, say who you are, where you're coming from, and then ask your question. Hey Kyle, thanks for the talk. So this is Alex from MIT. So actually I have a small question. Although I think we are being a bit evil. So suppose, again, I'm a evil guy that I have a, like, figure them out, build a database that provides service, but it's only for, like, for example, list operations in the SQL, which is not that reasonable because lists are, like, very long. For example, I'm using some tricks for those, like, fixed list data. And then those tricks, however, do not work correctly always. So in this case, if you do SQL queries with, I'm replacing all the read and writes with list operations, I suppose, just won't you take on such scenarios? Because, yeah. So I think I was the one, Alex, works with me on another project. So I think your question is, like, his examples are all about lists, but what if the system implemented different for another data type and therefore would not fall under the jurisdiction of L? Yes, you are absolutely right, that the choice of API, like whether we're using lists or if there are native lists, like lists versus strings or registers, that will affect the concurrency control mechanisms to some extent. What I found is that practically, a lot of the databases that I test with use the same concurrency control mechanisms for all data types, but you're right, that some don't, some have optimizations. And in those cases, we won't find errors unless we test those specific APIs. The goodness is that you can still run L with a regular register. It's just that the inferences you can draw are more limited. So the D-Graph test for L uses registers instead of lists. So it's always fixed place replacement of elements. Enter the hood, it's not, but that's the illusion. And then we infer version orders from constraints like process orders. So you assume every key is sequentially consistent. And then from that, you derive a version order graph. So that works for some degree of inference and still finds bugs, even though it's not as total as lists. I see, yeah, thanks for the answers. Hi, this is Lin. I'm a PhD student here at CMU. So I have a question that, so by the way, this is a very interesting work and it's a great talk. So my question is that since you are detecting this isolation level by formulating these dependencies and detecting cycles, et cetera, I wonder whether you can also reduce the constraints, reduce the edges to detect lower level of isolation as such as repeatable reads or read committed. It sounds like you can just totally detect all of them. And if that's the case, is there any isolation level that is not compatible with the way you are doing this? Yes. So inside of L, there's actually a formal model of isolation levels and it includes all of the dependencies implication points and it also includes a family of anomalies. Some L tests can detect more anomalies than others. So some tests can find up to repeatable read, others can only reach up to, you know, snapshot isolation or something like that. So our discrimination is not perfect. In particular, we have no formal model for predicates. So L can't tell you the difference in repeatable read and serializable just given individual key value reads. We can check for serializability and the way we do it is by expressing reads and writes as predicate updates in the actual client. So when you say write X, I'm actually saying like, set the value of every register which currently has something related to X in it. I'm using that as a proxy for the primary key and then I'm using that to infer whether or not the secondary indexing mechanism works correctly which gives us some measure of predicate safety. But yeah, so you're right that there are, we can infer a lot but there are some limits to what we can infer and I'm trying to expand that over time by getting more and more sneaky about how we find anomalies. Right, and also just to be clear, it sounds like you can, let's say, repeatable reads, right? Sounds like you can check a constraint that is somehow a little bit higher than repeatable reads but not exactly repeatable reads. What we check is exactly the property of repeatable read as defined by Audra in his formalization. Okay. That's not necessarily the colloquial understanding of repeatable read but it's also the one that makes the most sense, I think. It's the best definition. I mean, best in academic circles. Like I've had this argument with the Postgres team and they're very much opposed to it. Right, fair enough. But, you know, academics like it. Yes. Sounds great. Anyone else? Steven. One more foundation for keeping it real. Go for it. Hi, this is Steven. I have a question. When you test this, is this under multi-process single-node photo system or is it a multi-node single-process system? Both. Typically these databases are multi-threaded on the operating system and then I'm running many copies of them across different nodes. The Postgres case is actually a single process. Sorry. Postgres is a single Postgres install which has multiple processes, of course, internally. But our approach is insensitive. It doesn't care whether or not you have a concurrent distributed system. And in fact, it works over in memory data structures. So if you want to use this like test share memory implementations or transactional memory stuff, it would work for that too. So I have two questions. I'll ask the first one and if anybody wants to go again, they can interrupt. The, it's sort of related to Alice's question, but the first question about, oh, if it's not lists or strings you're concatenating to, do you miss that? And I agree with you that I think it's rare to have specialized cases for like different data types, different usage pattern insurers, but not different data types. But I think, so I guess my question is, what would it take to support or for what kind of things that would elmiss other than just predicates, things like if I create an index, is that index guaranteed to be serializable? Or if I have like sequences or other things that don't look like regular queries, but there's some logic internally in the databases and where it's updating state. Do you have a sense of like how hard it would be to support those different sort of extrusion models? I guess, you know, DDLs would be the first thing that I would sort of target. So the interesting news there is that DDLs tend to be non-transactional, we immediately find problems. So yes, I've done DDL tests. But they're slowly adding more and more support for transaction DDL. Yes, yes, which would be great. And we can check more of that. So the index question for designing tests over indices, what I tend to do is I'll have like a secondary key field, maybe not really a key, but like some sort of value in the schema. And then I'll do like primary key queries and then I'll have like a secondary key query. And that secondary key query either does a table scan or an index or some sort of join or some other weird thing. And I'll flip them across different tables as well. And I'll do weird hashing tricks to change their orders. So you can complicate the implementation of the test in a way that still looks like it's over keys and values, but which actually winds up verifying database systems that are predicates and so on. So that lets us find problems like that. It's kind of a predicate locking issue where I forget, was it TIDB or Yugabyte, which failed to lock a record which didn't exist yet. That's kind of a predicate, right? Yep. So that's one option. One thing we can't do well is deletes because all of these objects we talked about so far are a row only or they have uniqueness constraints. It's hard to model something that gets destroyed. And deletions are notoriously difficult in distributed systems. So I want to have a better story there. Again, if it's multi-version, you could recover the two-stone tuple, right? You could say, all right, but you won't know who deleted it and you don't know why. Yeah, maybe there could be some sort of metadata associated with it. Or maybe you could like drain it cleverly by doing some sort of like temporal thing where you say, okay, it had to happen in this time window. You assume the key is linearizable. I was thinking like every delete has to be an update then a delete. So like you keep track of who actually deleted it. That might actually solve it. Okay, so again, I'll open it back up to the audience. If anybody else had any other questions. Otherwise, I'll keep going. Kevin Kyle here is a big deal. If you guys are going to take advantage of I am, screw you all. So let's say I have, you know, hypothetically, a group of plucky, you know, four master students that are, you know, working on an academic project at a well-known school that rhymes with Marni, you know, Marni and Ellen, right? How far would it be for them to take the OMSR version to L and put it to a new database system? Like, what is the effort to actually do this? Surprising a little. You need to use both L and Jepsen. Jepsen provides the database automation, the installation, getting the OS set up. It does the scheduling of queries, the generation of queries. It will handle the recording, the history, and the analysis process, and all the error handling. The actual amount of work you have to do to write the test, probably 150 lines of closure tops. Maybe less if it's a simple database. But you already spent a post about so, say this database is supposed to be compatible in theory. You could, in theory, plug it right in, yeah. Yeah, okay. And anything that is JDBC-style, the dream is, of course, that all SQL implementations are alike, but what I've actually found is that in practice, I have to rewrite the test for each SQL database because every one of them interprets SQL. Of course, yeah, yeah. Even for simple things like updates and primary keys. But it sounds like, maybe like the bare minimum you need to support L would be SQL 92 with string and cat nation and begin commit or abort. And in fact, L also has, so sets I've described, but I have implemented, that's my current research project. I know how it's gonna work, it's just a matter of writing a bunch of really complicated code. But registers and lists are both there already. So if you have anything that looks like a register, which is like a key value store, you want it to go. L can also be used standalone. So if you have your own test suite, which talks to database and records history, you can shove it into something like JSON, you could pass it to a tiny binary around L as a library and read that JSON history and analyze it. So there's no reason that you have to actually use Jepsen if you don't want to. I like it, it's convenient, but you know, if you have other tools, you can use them too. But L is generating the queries. Like you could generate the queries yourself or you could ask L to give you a schedule. Got it, okay. But it's not, the only thing that matters is that the history that you give to L has that uniqueness property and that it's over lists. As long as it obeys this thing, that's pretty easy for you to do, but L will verify for you, then it doesn't matter where the history came from, it'll tell you if it's here or not. Got it, okay, that makes sense for me. And then is it like Jepsen where someone should implement this in their CI pipeline and therefore they're doing it all the time? I guess it is or yes, but like, does it make sense to run L 24-7 like a fuzzer? I have been doing that on databases and I found the results to be very productive. Often it doesn't need to run 24-7, often it finds a bug in a matter of seconds to minutes given the right schedule. So a lot of the tricky part of my work isn't really running L anymore, which is nice because it used to be I had to design every single checker and every single schedule by hand. So I would like stare at a bunch of transactions and I would hand prove that some invariant holds like the total of all account balances of constants. And then I would design tests around that specific pattern of transactions where I would say like, okay, what is a canonical example of a long fork anomaly? It's this particular pattern of reads and writes. I'm gonna repeat that pattern over and over again and hope that I get lucky. But maybe that approach doesn't find the randomized generated version where, oh, if you did an extra read at this time you can expose a bug. That's what happened with Postgres. They had existing tests that covered G2, but they didn't have one that tried this weird trio of transactions the way that I did. So I found the generative testing system is really good at that. Got it, okay. I'm sorry, this may not have been a direct answer to your question. No, no, no, no, no, it was absolutely, it's all gold. So it also makes me think that we should be also just trying out different drivers at JDBC just to see what any funky comes out of that. Yeah, oh yes. So a lot of the work is in just designing the faults that you injected the system and trying to make sure that you're actually creating the sorts of changes in behavior that you want. Did my fault actually cause a failover or was it transparently papered over by some sort of long lag time? That's a key problem. Also just trying lots of variants of SQL statements and all the different ways you can convince the database to look like a key value system. The formalism is this beautiful abstract map of keys devalues, but the real database is this horny thing with lots of different types and optimizations. Do you wanna hopefully cover more of them? Okay, so for the PostgreSQL one, it's not just strings. I mean, you support lists or arrays in strings. But like, I mean, sort of again, back to your question. Like it doesn't, you don't care that like, oh, you can't support floats because you can't append floats. I guess you can append to the array, but like. Yeah, for floats, I would check something more like a register, I think. Or I would have to find some sort of float encoding. For example, maybe I could design prime, I could do like prime factors where every operation multiplies by prime factor. And if I show that those are closed under float or the tick, which, oh my God. But for integers, you could imagine doing that and getting a set, right? You could make a set isomorphic to any bit pattern, right? Yeah, yeah. Via addition or via products. So you can do something like that. Okay, it is 5.30. We've had Kyle here for an hour, which has been awesome. Last chance, any other questions?