 The Carnegie Mellon Quarantine Database talks are made possible by the Stephen Moy Foundation for Keeping It Real. And by contributions from viewers like you. Thank you. All right, the pandemic rages on, so let's talk about databases. We're super happy to have Marcus Pillman today from Snowflake to talk about Family HDDB. Marcus joined Snowflake in 2016. Prior to that, he got his PhD from ETH Zurich under Donald Kostman, who now runs all of MSR. So, you know, that's awesome. Databases go everywhere. He recently moved to South Dakota with his wife. So he's living on a large plot of land that's not really a farm, but that happened before the IPO. This is something that he's been planning for a while. So again, the way we'll do this is that if you have any questions for Marcus, please unmute yourself and say who you are and where you're coming from and ask your question and feel free to interrupt any time during the talk. Would this be interactive? And of course, as always, we want to thank the Stephen Boy Foundation for Keeping It Real for sponsoring this event. All right, Marcus, the floor is yours. Go for it. Thank you, Andy. Thanks for the introduction. Hi, everyone. So I worked on distributed systems for a relatively long time for my age, at least. So I did research in distributed key value stores and then I joined Snowflake. And I want to talk a bit about that. But more specifically, I want to talk about testing a distributed database or more in general, like testing distributed systems in general. I think those of you who ever had the privilege to do that know that testing a distributed system is probably a better experience than sticking a fork into your eye but not by a whole lot, right? It's pretty miserable and it's a difficult problem. But before I go there, I want to give some amount of context here. So first, I want to say a few words about FoundationDB then I want to say a few words about FoundationDB within Snowflake and then we go into the dirty stuff. So FoundationDB is, for those of you who didn't hear of it yet, is a distributed key value store which provides very strong transactional guarantees. So it is strictly serializable across its whole key space. And strictly here means that we also guarantee for casual reads, which means that if you write something and then someone else after you commit someone else reads that thing, they are guaranteed to see or write immediately after you have committed it. And that is not something that only serializable systems have to provide. If you look at the definition of what serializable means you are allowed to read all the stuff and so like old reads is something that would be okay. The way FoundationDB achieves that and I won't go too much into details and there's an excellent talk on YouTube from Evan Johnnan who explains the whole architecture. So if you're interested in that, I would invite you to look at that. But it doesn't mix of optimistic concurrency control and snapshot isolation. And the way to think about it is a database has a state and whenever you commit that transaction you bring it into a new state. So you can think of it as a kind of a queue or log and you read at certain points in that log. And when we start a transaction I think I can get a pointer. You basically just figure out what's the current state and you stop time here and whenever you read from the database you only read from this state and you never read anything from afterwards and that's the snapshot isolation part. So you take a snapshot of the database. But obviously we cannot stop the world like you are running a transaction. So all the transactions will commit and they will therefore roll forward the state. And eventually you want to commit. So you try to write your own state. So this will give you a write version, right? I mean that's kind of FoundationDB calls that read version, write version. And everything that happened in between here is now your conflict. So if you can verify that nothing in here will violate any consistency guarantees then you can write. Otherwise it will abort your transaction. And the simple way of like one simple way of doing that is you make sure that everything you have read during your transaction like nothing that you have read was written in here. So you basically just have a bunch of key ranges that you have read through while your transaction was running and you check whether anything in here writes into one of your key ranges. And if that's not the case then you can commit otherwise you can abort. Which is a bit different from traditional snapshot isolation where you check for write write conflicts. But then you get into these write skew issues, et cetera. And then FoundationDB doesn't have them. So what do we use that for? Like Snowflake doesn't provide a OLTP like kind of transactional database. If you look at the Snowflake paper some of your blog posts you will see a picture like that that presents the whole Snowflake architecture. And the way to think about this is when you as a customer write a query it will go into this cloud services layer that's what we call this which has stuff like the SQL compiler. It infrastructure manager. It does some security. It provides a VAP GUI endpoints for OLTP for ODBC and JDBC drivers all these good things that you need. And then you have your warehouses which is basically just clusters of machines and they execute these queries. So the SQL compiler will generate an execution plan and then one of these clusters will execute this execution plan and it will read and write data from the storage and the storage in Snowflake's case is Amazon S3 or Azure Blob Storage depending on which cloud provider you're running on. But this layer above needs be able to store metadata and we want to access this metadata in a low latency kind of way and this is where our foundation DB is sitting. And the way we typically say this is the SQL compiler here that's the brains the virtual because it implements most of the logic the virtual warehouses or the execution platform is the muscles because they execute the whole work and foundation DB is kind of the heart keeping it all together. And so that means everything that needs and a transactional workload for everything for all these kind of things we use foundation DB. So that includes metadata stuff like username Passfield hashes and the assaults security groups encryption keys, query history, schema definition but then we also use it to all our transactions. Remember that Amazon S3 doesn't give you any transactions it's an eventual consistent storage. So you need to implement them on top of them and the and Snowflake implements some form of snapshot isolation and so it uses landlord clocks and it stores file locations I will go into that slightly but then also think of a snowflake region as like a huge distributed systems with thousands of machines running all the time. So we need to maintain machine topology we need to decide who is alive and who is dead and all these kind of things and we need to do service discovery and for all those things we use foundation DB. So one example very quick and this is simplified to a point that it's not really correct anymore but it should give an intuition of how snowflake works. So our customer data is written into files on Amazon S3 and it's written in a column of format so we use something like a Pax layout and in order to figure out where your data is we have some metadata in foundation DB and this metadata will point you to your data. So now if you wanna read something you first go to foundation DB and as per my files and then you will get a list of pointers basically and now you can fetch your data from S3. If you want to update we do a copy on write so in a first step you will rewrite the whole file this is also wrong reason like if you do single point updates in snowflake it probably won't be super fast. We will rewrite the whole file so that's maybe 16 megabytes in size and nobody will be able to read that because you see there's no pointer from foundation DB to this thing and the reason here is S3 has some weird consistency guarantees like basically if you read something for the first time and it was only created but never updated then you're guaranteed to read the most recent state however if you poll and you read something that doesn't exist and someone else creates it you will not be guaranteed to see that immediately so because then you have like caching going on and these kind of things. So we write this file and now if the transaction fails then nothing bad happens we waste some space we can like clean it up later but then at commit time we can atomically move this pointer to the new state and this is probably the most simplified version of how does snowflake work but this should give you an idea of how important foundation DB is to this whole architecture it is one of the most critical pieces here because if foundation DB ever becomes unavailable we cannot process any queries in fact users cannot even log into the Web GUI like nothing works anymore if we lose data which has never happened so far fingers crossed then we will lose customer data like because encryption keys are in there nobody will be ever able to read that data anymore like obviously we still have backup systems etc in place but that would be like really bad if we would corrupt the foundation DB database and it would make many people unhappy and to give you an idea how critical that is we had incidences where for example farmers in Minnesota were not able to repair their tractors anymore because the system for like ordering replacement parts relied on some snowflake database somewhere so if it goes down then they cannot order any replacement parts so these databases are like important pieces like important for many companies not only for us presumably you're running one foundation foundation DB cluster per customer right no for the entire fleet you're running a single cluster for a deployment and in most cases each region will have one deployment our larger regions will have more than one deployment so within one deployment there's multiple customers no there's one okay so so customer has multiple deployments and then within an deployment you have a single foundation DB no no it's a multi-tenant thingy right so think of for example snowflake Switzerland I have to take this as an example as a Swiss person right so snowflake Switzerland is exactly one FTB cluster running okay that makes sense okay and all customers will use that together got it okay all right keep going it's good okay so that means we have to test this properly now I like this quote from Dykstra Dykstra was not a big fan of testing in general he believes that you have to verify that your stuff is correct and we don't do that but he said that program testing can use to be to show the presence of box but you can never show the absence and this is very true like you can test like unless you test the whole surface of your program you will never be able to prove that through testing that your program is correct and testing the whole like every single possible input is usually impossible except for very very simple stuff that is typically not very interesting and then when you go to distributed systems like distributed systems is super hot and I just want to give you two examples of how a distributed system can fail so in these examples assume that you have box without box obviously it wouldn't fail and one of the surprisingly harder ones is distinguishing between a slow network and that machine you cannot do anything it's like if you have a friend and you send them a message if they never reply like eventually you have to assume they are dead or they don't want to be friends with you anymore like getting ghosted or whatever and so imagine you have these two machines and one machine is trying to send a message to this other machine but because networks are weird and unreliable this gets this packet gets lost in some time wall and it doesn't arrive there and so the only thing we can do about this is that after a while we just assume maybe this other machine is just not there anymore maybe it had a harder fault or something and so we mark it as down and we just assume that it doesn't participate anymore however this is typically not enough because what can happen that many seconds later and the second is a very long time in any system many seconds later this message arrives and if this machine now doesn't know that it's supposed to be dead then it could continue to participate in some protocols etc and therefore it could break stuff and then it can start fires and make people sad and just generally like it won't be a good experience if your code doesn't handle this correctly a more interesting one is message reordering like let's say you are sending out two messages and if you send two messages to another machine we all know that we cannot expect that they will arrive in the same order but because we are humans right we write into our code sent message a then send message B now if the receiver thinks that this ordering is part of the contract that stuff can happen right so we send the first message the first message disappears and then the second message immediately goes to the second machine and then sometime later the first message arrives and now you have broken an implicit contract that was not true and again bad things will happen now why is this such a hard problem and and it all boils down that a distributed system is not a pure function so we have some randomness here like whenever you run on hardware you have randomness this is also true for disks right a disk can break it can return wrong results these kind of things but the randomness in itself isn't that bad like debugging a randomized data structure isn't that hard like we can do that but for distributed systems this is somehow harder and there are two other problems here so one is failures are very rare like if you rely on message ordering you will get away with that 99.999% of the cases right so it will work fine in your testing environment but then you go to production and you get unlucky and you lose your customers data and something like that can destroy your business and but if you see it happening you might see it only once in your lifetime right a specific failure so you better get your tracing right from the beginning because otherwise there's just no way you can figure out what just happened you can also not debug it because changing timing will make differences here and this boils down and here the difference between a normal like a random system like a like a randomized binary tree and the distributed system is that we don't control the entropy right this randomness comes from the universe or from wherever but it doesn't come from us like we don't we don't control where these events are happening so how do we solve this there are multiple ways you can you can try to work with that and foundationally we as one and I really like this solution and I haven't I will seen anything better than that if you did then please let me know I will be very interested in that and so what we are doing is we do deterministic simulation and what that means is instead of having this external source of randomness we want to control this randomness right so we we we want to make sure that we control when is a fault happening and and that way we can also reproduce it as often as we want to and there are three main ingredients to make this happen the first one is single threaded concurrency I did a lot of googling before I prepared this talk to find a good definition of concurrency and the problem is it doesn't seem that people agree about that means so for the purpose of this talk we will use my understanding of concurrency and this will be by definition correct and we'll just run with that and maybe Andy needs to change his exams or something then we have the second part is simulated implementation of our external communication because we cannot control randomness if we use a real network and the third one is determinism so I want to go through all three of them so concurrency versus parallelism so parallelism on a on a very naive level just means that multiple things run at the same time so a classical parallel program which uses lock blocking IO could look a bit like this like you have three threads of execution let's say this one makes some complex calculations and then after a while the current decides that it doesn't get any CPU time anymore so it gets an interrupt and then after a millisecond so it gets scheduled again by the current and it continues to run this guy might start reading from a disk so it gets it gets blocked until the risk disk replies and continues or you can have network stuff like whatever and so what happens is that you consume some amount of CPU time on each of your threads of execution and you sleep in between concurrency is more the term of breaking up your execution into small pieces and then you switch between them now you can do both like there are many systems that are highly concurrent and parallel but foundation db does only single threaded concurrency so what we do is whenever we have to do something that will take a long time and block like a disk read or a network or sending a network message instead of blocking the process we will do an asynchronous call and we will immediately schedule another another process or we call them actor and so we do something like cooperative multitasking if you're old enough like me then you remember the times of like Windows 3.1 where you had cooperative multitasking and you only had one CPU so you never had parallelism and then sometimes you would have one of these blue guys just hogging your CPU never giving up control and then you could start moving your mouse around which would then cause the kernel to interrupt and everything would start working again which is why old people like me if stuff gets slow they automatically start moving the cursor around and the other people are like are you crazy this won't do anything and so this is roughly what we do and there are many ways of how you can implement this and this is not any kind of like rocket science or something like many many a lot of software does this however we do it a bit in a special way and the reason for that is so at the time where FoundationDB was implemented there was no co-routines in the C++ standard so instead what people typically still do in C++ for this is they use Boost SEO which is a open-source library that gives you event-based programming and so there if you want to do a system call or anything that would normally block your threat of execution you would make your call and you would pass a callback and then when your system call returns there's some main loop that will call back into your call callback and people often refer to that as a callback hell because this will not result in pretty code and it will not be very readable I mean you can make it kind of pretty but it's not a nice way of programming and so FoundationDB has its own programming language called Flow and what that does is it first of all implements stackless co-routines so these are basically still callbacks and it adds a few keywords to C++ like active, wait, wait next, etc. and if you're familiar with C-sharp, Python, Javascript I'm sure there are other languages to do that they typically call this async a wait you can mark a function as async and then you can call them and just place a wait in front of them and it will feel like you're doing blocking programming but behind the scenes the system the system uses callbacks or maybe stackful co-routines to make all of this happen and so this is implemented in a program called active compiler that is part of the FoundationDB code that takes one of these flow files and it just generates C++ code out of it and then we compile this down to machine code this is an example of how it looks like this is an actor that we actually use so you can mark your function with the actor keyword like you cannot do this in normal C++ the compiler will yell at you a future something like I don't want to go too much into asynchronous programming but it's basically just a result that hasn't happened yet so you can wait on that and the promise is something that you can send to another actor it's like it's not a stream but it's it's a single assignment variable and so this code looks very readable even though if you would do it with callbacks it probably wouldn't so we just call wait on this future and this is a key verb this is not an actual C++ function and it will basically callback like unless this future is ready it will go back into the main loop and all the stuff will happen it will just register a callback and as soon as this future has a value this gets woken up again and then we will just iterate through all our promises and send this value to all the other promises and then at the end the reason we need this void type is because you cannot have a future void with a lowercase v it's just like C++ that's how it is so that's kind of an ugly hack that you can see here oh yeah so that's the easy part so now we have the programming was flow part of the original version of FoundationDB before Apple bought them or is this something you guys added after the fact? no this has been there from day 0 I might actually even add that the simulator and including flow has been written before FoundationDB so the original company spent the first two years of its existence just writing a simulator and then they had a simulator so they started to write the database on top of that ok so then the next part is simulated implementation of stuff in general like with anything that could be non-deterministic so system calls in general are non-deterministic right? have you ever taken the time? like that's a very simple example like asking the system for the current time is not a deterministic thing and also behavioral changes between kernel versions just to make this even harder and then also like network is the obvious example sending a message or discover whatever and so this is what happens often here right? and what we have is we have these interfaces like iNetwork, iConnection iE as in file which are just pure abstract C++ classes that provide functionality for a network or for file system operations etc and because FoundationDB needs to run on macOS, Windows and Linux you kind of have to go down this route anyways because this will look different on every system and then you have an implementation for each operating system for example this one here implements kernel AIO and now you can just do the same in the simulator so now when you want to send a packet from one actor to another instead of sending it over the network you just pretend that you are network but you actually just copy bytes from one memory region into another so this is actually surprisingly simple like this codebase is relatively small and then the third ingredient and this one is a bit trickier to get right is determinism right? we wanted that everything we do is deterministic so what does that mean? when we start a simulation run and a simulation run will just be some kind of workloads that we are running on on a foundation db cluster we start by generating a random number and that will be our input that will be our seed and we feed that to a deterministic random number generator and so now whenever we make a decision about for example you send a network message we typically don't just copy the memory instead what we do is we will introduce a small delay or we might even just close the connection because these things can happen so we might introduce a failure so we make a lot of random decisions and we will use this deterministic random number generator to make these decisions and then at the end of the test run we return the last random number as an output so it's just like calling next on our deterministic random number generator and if you get everything right then two runs with the same input will generate the same output because you generate the same number of random numbers and now you have a pure function which is like really really nice because you obviously also have a test description as input but your test description plus your random seed will do the same thing no matter how often you run it and you can basically unit test this thing and that is something you cannot do with any other database that I'm aware of and if you now in theory if you wanted to prove that at least your workload is kind of correct in the context of your simulation you would just test every single seed there is but obviously that would require two to the power of 64 number of runs or something so we are not going to do that obviously so how do we achieve this determinism so we obviously and how do we make use of this right there's obviously other data that gets created so we have trace files that are probably the most important ones for us to look at but we also generate like disk use and b3 files and I mean that's almost it and at the end of the run we can go through our trace files and then make a decision whether the run is considered to be a success or not and that is a hard problem in itself a simple example is if you inject too many failures if you just say all my disks break like then you're going to lose data right so it's not a very interesting test failure because we don't promise that we will survive an event where all your disks at the same time decide to give up and but if you have for example an assertion failure then we would consider this to be a test failure and therefore it could point to a bug in the software there are some difficulties so I talked about what makes a program non-deterministic and this is a surprisingly large amount of things that will make your software non-deterministic the most obvious one is if you call into a non-deterministic random number generate right if you if you generate a true random number somewhere and then you do an if this model or two equals zero then that will do something else every time you run it time is non-deterministic because you cannot rely on your clock to be deterministic or to run at a certain speed or whatever CPU cycles consume different lengths some CPU cycles are faster than others if you ask anything about your system state ask how much disk space do I have we don't have any control over that this simulator might run on a notebook and someone might create a world document and then after that the disk space consumed will be different the memory footprint we rely on in certain parts on malloc and free and depending on the implementation different things can happen obviously disk latencies some like we don't control the file system and also other things like if you ever read uninitialized memory anything can happen so if you read memory that you did initialize before it will have some random value in there and what we do and so even though everybody in the foundation dp team is trained to see these things in the code before they happen we still get it wrong from time to time and someone might introduce a memory bug or something that makes this nondeterministic and what we do is we run roughly 5% of our tests we run them twice with the same seed and compare the on seed at the end of the test and that way if someone introduces a nondeterminism we will eventually catch it and someone has to debug that and debugging this is not a great experience how often are these things going to be introduced is it like every commit or is it like every new feature yeah this sounds painful to debug how often do you have to do it these days surprisingly it's surprisingly rare that we see them I would say every few months so it's a rare thing and what is typically the turnaround time to go stop the nondeterminism weeks worst case yes we had that once we once had a nondeterminism that took a long time in most cases it boils down to running wall grind on it and wall grind will catch that because most cases it's just a memory bug but I remember one case where it took several weeks of debugging until we found the problem hey Marcus can you explain what the unseed is yes so basic so we have this deterministic random number generator so what that does it's basically just a pure function you feed it a random number and it will generate another number and the output will look kind of random and you can do that as often as you want to and you always take your previous output as input again so you run into this loop so you can generate as many random numbers as you want to now let's say if you call this 10 times and you start off with the same number then the 11th number will generate the same in both cases and that is basically our unseed we assume that if our test is deterministic we will generate the same number of random numbers like the same amount of random numbers so we might generate 10,000 obviously it will be much more than 10 and so just after we finished our test we will call our deterministic random number generator one more time and that will be our unseed so it will just be another random number and so we can run the same test twice and compare this last random number and if everything goes well they are the same thanks cool so time I quickly want to talk about time that's an interesting topic so I think of simulation what we do is we simulate machines we simulate the network we simulate disks we simulate data centers so we don't but we do all of that with only one CPU call so that's one problem the other problem is we cannot rely on the system time because as I said it's not deterministic so we model time by ourselves and the way we do that is we start by assuming that each task so if you think back of this picture that I showed you with multiple colors of things that are running one after the other we assume that all of them take exactly 0 seconds to complete this is obviously not correct but it's in many cases a good enough model as long as you are bound and then whenever a task sleeps we follow time so the most trivial example is if you write some code that says something like wait and then sleep for one second we will just have a value in memory it's a double in seconds and we will just increment that by one and we sleep for multiple reasons so when we do a network IO as I said before if you send something you do your send but then in the background we will have another actor that basically just sleeps for a random amount of time and then it delivers the network packets to the receiving actor disk IO kind of the same thing like I will simulate a disk interface is just using blocking IO so we will write to the disk and it will kind of immediately return but then we will sleep for some amount of time to simulate the actual disk latency and then also background tasks there like some garbage collection stuff and a few other things that run like runs every second or something like that and so whenever we do that we roll time forward and what that means like one of the really nice properties of that is if you have a bug that causes your cluster to stop making progress you can just continue forwarding time because as long as you don't consume any CPU you are just incrementing values on a double so timeouts is something like 36,000 seconds or something like that or an hour like I don't remember it's roughly an hour until we time out and this hour can go over in like milliseconds if everything is just waiting there and so for certain test scenarios our simulations run actually much faster than real time does for some other tests where we are more CPU bound it might be slower but because we are injecting so many faults it quite often happens that many machines are just waiting on timeouts and they cannot make progress because they first need to mark other machines as degraded and things like that so there are also major limitations of testing like that one that has bitten us pretty badly in the past is that engineers start to rely on that like we have engineers who work in the way that they just write down some code they run 100,000 tests and if these tests succeed they assume that their code is correct and surprisingly in most cases this is the case like if you manage to get like all your tests succeeding there's a pretty good chance that you have written something that is correct but that also means that some of the designs that you will find in FDB are not something or something that only a madman would implement in any other system can you give an example or do you want to give an example? yes I can give you a great example and luckily we caught that one in testing but not in simulation testing sadly so foundation DB has its own metadata right stuff like where are the charts written to like which storage team is responsible for which data and if you lose this metadata you're pretty much screwed because now you don't know where your data is so you lose your cluster it's very hard to recover from something like that and the way this works is very very clever but in my opinion not in a good way clever so we have special processes called proxies that are responsible to commit transactions so a client will send a transaction to a proxy and the proxy will coordinate with other processes to do conflict resolution and then it will write it to a distributed log after the commit but this metadata is not in the storage servers instead each proxy will have a copy of all the metadata in memory and additionally we write it into a distributed transaction log the way this is implemented is when you commit something the first proxy will write this into an in-memory key value which will have a disk queue as its durable storage and then it will send it to all the other proxies and the other proxies will also write it into an in-memory with a disk queue as a durable storage but this disk queue is shared by all of them so what we do together around this is we rely on determinism so we say only the committing proxy will actually write it to the disk and all the other proxies will create all the disk writes but before they write they will just decide to throw them away so now if all of them do exactly the same thing although then this is working great but if one of them decides does something non-deterministic in any way then they have a different idea how the disk queue could look like and so initially nothing bad will happen but as you continue and you continue writing they will start to overwrite that stuff and they will do it in a very non-grateful way in a very bad way so they will overwrite they will have a wrong offset for example and then eventually something in your system crashes you need to recover and you need to read your disk queue again but at this time it's not readable anymore and so you have lost your whole state and there was a bug with that where we didn't basically one proxy did not commit something but everybody else thought that it would be committed and now instead of just corrupting this part of the data it would make the whole metadata unreadable and as soon as this happens the cluster was basically gone we couldn't recover from that ever and you can break it in other surprising stuff you can just break it by making for example snapshotting in the in-memory key value star non-deterministic which is not a non-reasonable thing to do right and if you are not like familiar enough with FoundationDB you might do that and then be surprised that that stuff feels like in a really bad way so that's one of the more complicated examples and I think it's a very clever solution but it's not a good solution probably something more naive would probably work better I know this was a complex example I'm sorry about that maybe I should have thought about a better one first it made sense okay good another problem is we find more that's the type of this should mean we find more box the longer we test right so we are running hundreds of thousands of tests per day probably in the millions Apple is doing the same thing but sometimes someone introduces a bug that is extremely subtle so you need 200,000 tests to find this bug in the first place so now it's very hard to keep your main clean right because you can run 20,000 tests before you commit no problem but that might not actually find that bug so after a while you end up with something that has just like three failures out of 500,000 promise and now the question is who's going to debug this and it probably won't be the person who introduced the bug because we have no idea who that was and the solution to that is we have a few people who are very very good at debugging this and they will then just like once a month or so they will sit down for two days and find these very raw failures and then fix them do they want to do this or is it like picking the short straw currently it's mostly one person who's extremely good at doing this and this person works for doesn't work for Apple so I have been happy to just give him these things but now we since very recently we have a rotation and so we just have the most all the most senior people in the team or in the rotation and so you will like once a week you will have to spend the day debugging these kind of failures there's obviously a risk that we that our models are wrong you can have certain failures in systems and if you don't simulate then you will be surprised when they happen in production one example could be there's a very rare bug in certain disks that when you write something the disk will acknowledge the right but not do anything at all and if something like that happens you're in a pretty tough spot to detect that because all your check sums will still be correct but you will end up in kind of a real situation where like storage copies have like three storages will have different copies of the same data we also have chaos testing obviously so we have real clusters that check failures but they kind of have the same problem so sadly this one here means we are finding sometimes problems in productions and luckily they always were problems that cause availability loss and never any data corruption or anything like that then oh I guess my camera is overheating okay you're back I'm back for now and then another thing is the simulator assumes that the CPU is infinitely fast so we do um and that is obviously not true and this is typically okay as long as we are IO bound but once in a while someone introduces code that isn't running in constant time and this can cause weird problems so if because everything runs in a single thread if you are hogging the CPU for two seconds for example then other machines will assume that you're dead because you stop like heartbeating all these kind of things typically they will not cause any major issues because it will like we test very well for like stuff not responding quickly but it can make a system unstable if you catch it catch it in production and we have certain things that we do to catch those so whenever we run a task we count CPU cycles and if we go above a certain threshold we write the trace line and says this is a slow task maybe you should look at that and that something happens and obviously you can get them just because of the context switch in the operating system but if you see a certain task being slow very often even in simulation then you probably should look at it and the last one is really hard when our gray failures so because of the strong consistency guarantees we don't guarantee progress right if the whole network goes down obviously but what are we going to do right we don't make progress and so most of our tests inject certain kind of failures and then at the end they just verify that everything is being able to come back up and everything is like all the data is still in a consistent state and nothing that happens in between but we can run into real networking issues where for example the leader like we have one leader process that coordinates a lot of stuff that gets elected through some pexus like algorithm we saw bugs where this but problems where this leader would be elected on a machine that couldn't talk to any other machines except for the ones that are responsible for election we could have problems where a single disk being extremely slow but sadly not bad would cost the whole system to slow down and these are obvious problems that we need to fix because we want to have as much availability as possible but simulation is not a great way of finding these and so we use chaos testing for these kind of things so very quickly this is how this will then look like so we have one single machine and this single machine runs everything right it runs a client like it runs all the clients it runs all the cells it has simulated disk simulated machine simulated data center simulated everything and this allows us to quickly inject failures and I quickly want to show how one test looks like so we have workloads that are compiled into the binary I'm not sure that you can read this it's a bit small I can see it okay okay but these workloads so for example this is a workload that runs on the client and each client will execute the transactional workload and what they do is basically the data forms a circle so where each value is the key for another key value pair and if you walk all of them then you should walk in a circle which is something that is pretty easy to verify as long as nobody writes to the database so at the end of the test run we can verify that we still have a circle and if we don't then it typically means that something in the transaction subsystem is broken and then we have a workload that introduces clogging so it introduces random network partitions between random nodes rollback basically just forces the situation where you commit something and then this commit gets rolled back because of a failure and now if that happens storages might need to forget about this commit so this has to be tested and iteration is going to just kill machines so this has kill up to 10 machines but make sure that you don't kill that you have at least 3 that survive because if you kill all of them you just get unavailable or lose data or whatever and then also bring them back so we reboot them in this particular case and this is one of the simple tests that I found but the benefit here is because we have these workloads we can combine them as we want to so this is kind of a test description and all of them run in parallel you can also define to have certain ones that run in succession to each other there are all the disasters that we simulate sometimes it's baked into the simulator and it does it all the time sometimes it's instructed through a workload like as I said broken disks broken machines clogging nukes like this like processes that just lose all their data at once DOM system administrators so we try to break our configuration and make sure that we don't lose data so if you change the configuration to something that is invalid it might run into a situation where the cluster becomes unavailable but then if you fix your configuration it should come back up and be happy and not all of them obviously are fixed so we know of luckily all the companies that managed to configure their cluster in a way that would break this one is incredibly hard basically system administrators are more creative when it comes to breaking stuff than we were and in that case we we might need to fix it also in every simulation test because everything runs in one process we have a global view of our data so we can verify that if we acknowledge the transaction back we can verify that it never gets rolled back we will always at the end of the test remove all the loads and make sure that the cluster quiets down so that it doesn't run like some workload some phantom workload in the end and very quick we are actually out of time I mentioned this before we have a Kubernetes cluster that runs these tests for hundreds of thousands of time like hundreds of thousands of those why do I have this again I don't know there is some future work we try to use I will see pure time more efficiently for this we try to identify tests that are better at finding bugs than others and run them more often this might be machine learning problem or it might just be very simple statistics although these days it seems that statistics and machine learning is the same thing I don't know I'm old for that we want to find blind spots better so that might be have a specific team that only works on simulation testing and doesn't communicate directly with the engineers we have some ideas how we can achieve this and the last one and I want to finish with that is like one of our founders when we found this horrible bug that I quickly mentioned before but he told me was it's always easy to blame testing or code review whenever something goes horribly wrong but a properly architected system will not run into catastrophic failures even if there is a bug and there is a truth to that if your architecture is very robust then even if something misbehaves horribly in an unexpected way you might survive with a black guy or something so these are called Byzantine failures in some regards and we can make certain things more robust against these kind of failures and that's it I'm sorry of running too late I'm still here for questions if you have any absolutely so I'll clap on behalf of everyone else I'll just say Marcus I think next time don't go cheap on the camera I think you should have got one that doesn't overheat yes I've never heard a camera overheating I'm assuming it's not just a webcam it's a bit more complicated it's a mirrorless camera the sensor is pretty large and because of that it can overheat okay again we'll open it up do you have any questions for Marcus hi I'm Chad I'm from CMU thanks for the cool talk just to confirm what I think I saw is the infrastructure for doing this type of testing part of the open source foundation it looks like it's in the test slash slow I saw the clog test in Tomo files yes and no so sadly not everything like the simulator is open source right so you can download FD like check out this source code and you can run a simulation test on your MacBook the infrastructure part which we call Joshua which runs which sets up a cluster for you and then is able to run like a million of these tests because you don't want to run a million tests on your notebook like you can but it will take a very long time that part is not open sourced yet I was and we don't own it like Apple is owning it and we have access to it through a source code kind of agreement but the plan is to get this open sourced okay this is the question for someone on chat it's late where they are so they can't unmute themselves they ask what if there's another what if there is another actor based model based database which is written in C and how hard would it be to port float to C from C++ basically there's already existing database based on the actor model but it's written in C would it be a major rewrite to introduce flow and then make it work in C I mean yes we rely a lot on C++ functionality so it would mean that we would have to rewrite the whole code basically okay we can go on hi I'm Juan I'm one of the anti-phisicians going to ask have you looked at Mozilla's like RR before it seems to share a lot of principles for like if you control the randomness then you can make get deterministic and it would also have some functionality that like maybe I didn't catch it that like it could do chaos mode scheduling where some traits have higher priority some say some traits have lower priority yeah I'm just curious to see if you like have explored Mozilla RR so you mean the Mozilla thing yes we did so we found that RR is much slower and more memory intensive than I was simulatal so if you if a simulation run consumes 8 gigabytes of memory then it probably won't run on my notebook in our like kind of like that that's probably the major drawback here like just the speed but people have used it they have run the simulator in there but we have never run an actual cluster within RR so I don't know how well that would work so I cannot really comment on that okay okay I think we're over time so we'll stop here again Marcus thank you so much for spending time with us and this has been super insightful this is I think I've never heard anybody doing the kind of stuff you're doing in foundation for testing so this is super exciting