 Welcome to the talk on the hand of for HCD data inconsistencies. Data inconsistencies are pretty is unique animal. So finding them is more than an art than strict science, at least yet. So hopefully after the talk you will have a better understanding of the topic and be able to find inconsistencies in your own system. I'm Mark Cherkovich, I work at Google and I'm one of the HCD maintainers. So topic for today is I would like to define what are data inconsistencies, what tools we use to hand them and how HCD has adapted to find the problems that we had with inconsistencies. So at the end I will do a short demo showing you how it works in practice. So HCD implements so-called distributed consensus. It means that multiple units, processes work as a, or multiple processes of HCD server run as a single unit, a cluster, that can consistently respond to user requests. So any user can observe the same data, any write by a user can be observed by all the users. So what is the inconsistency? Inconsistency is when one of the instances breaks loose and starts spilling nonsense. No matter how cool your free head dragon is, it doesn't, it's stopped working when one of the heads goes berserk. To give you more concrete example, here is a real production case that I noticed and tried to document where HCD inconsistency caused problems with HCD cluster, Kubernetes cluster using it. In this case, HCD or Kubernetes nodes were flapping from status ready and not ready. There were random failures, random timeouts. Authorization didn't work, sometimes it worked. So how do you know what is happening? Add-ons were crashing. Yeah, basically because of architecture, one of the API servers was totally misbehaving. And what was the root cause? It was just missing one, exactly one write. So as you see on the graph, missing one write can cause the revision of HCD to totally diverge. This is because how Kubernetes uses HCD and its revision is pretty crucial for Kubernetes correctness. To explain, revision is like global counter for each change that is happening in the cluster and revision is used by Kubernetes for optimistic concurrency control. So the example above was unrelated, but why I'm talking to you today is about the state of HCD 3.5.0. The release was done after a long time with a lot of changes in the project, including total change of maintainers and loss of a lot of knowledge that was unwritten between maintainers and not past. Addie, the work that motivated HCD to look into the problem further was two, let's say inconsistencies or correctness issues. One was data inconsistency on crash. So your instance of HCD server could be happily running and then it ooms or your VM runs of memory or it ooms or is killed. HCD could make incorrect write that results in this one instance being inconsistent with other members. In second case, apparently no one noticed that HCD for a long time didn't provide durability in some cases. Yeah, and we got and are still mostly not have not full explanation for multiple reports that are currently very hard to reproduce and understand. We are trying to build tools to understand them, but at the current state we have multiple reports that are unconfirmed, unverified, but also not fully understood. The problem underlying this is HCD doesn't have a test suit that is capable to detect these class of issues. I expect most of the projects and databases don't have these kinds of tests. But compared to most databases, they don't build for full ecosystem. So no matter what anyone says, HCD is the building block for Kubernetes, which is the building block for full cloud native infrastructure. Like we cannot deny it. We need to do better, like as a project, like we cannot let this slide. So how do we hunt for inconsistencies? What is the previous art? So previously, HCD was using functional tests. So functional tests that are pretty simple. They just run HCD instance. They do some failure injection and maybe when they finish verify if the state is consistent between the members. There were multiple problems with them. Like one main thing, it was they were written by one person and when this person left the project, no one knew how to use it, no one understood them. They didn't work, like they run. Like most of the time they flaked also a lot, but it was not solving the issue directly as, it didn't solve like the issue end to end. It just give us a signal that was very flaking. That if you go into state of the art, you can find Jepsen. So Kyle Kingsbury is person that is really interested in correctness of databases. So he built whole I would say project and community that validates safety of distributed databases and verify like checks if there are guarantees that they are saying that they're giving checks is those guarantees truth or is the database lying or selling there or is it selling more than the relative is? And there were multiple times used successfully to validate HCD project. But this solution is also not perfect for an open source project that is written in Go has limited capacity and doesn't want to learn new language be forced to work on AWS or learn a whole new domain, the domain like from distributed systems which are known to be pretty hard. They were also never designed to run in CI. So if they are, if at CD changes anything in API and it breaks the tests, it cannot be easily verified. Someone needs to run them manually on every time we make a change. So they cannot be used for continuous validation. And at the end, like maybe personal opinion, but we couldn't reach out to even ask CNCF to pay the owner of the system to help us. So what the tools that we need need what are the requirements? So what we need to build is something that is adequate. By this, I mean, not only reproduces all historical issues that we know of and in every, like reproduces them easily and every time and we can use that to always validate correctness of the tests not only, but also is able to go through all generic properties of HCD and find new issues that we haven't thought of. The tools need to be also accurate. So there cannot be, they cannot be just run and not return anything. And maybe they work, maybe they don't work. They need to be strict. So every time that we validate killing a machine or killing the process, they need to validate that it happened and the red seed responded because I know someone changes the, some code and it no longer skills the process and process that lives and we stop validating which happened to functionality tests that at the end they didn't validate because sometimes they didn't test or run the disruption that they were meant to do. And accurate, we need to know who if there is a failure, we need to be able to attribute it. Is it a problem with the test or is it the problem with HCD? Like if we don't have it, like we flip a coin and in finding the issue or interpreting or picking the issue and we flip a coin when when someone like spends a lot of time to try to reproduce it or understand it instead of getting a clear signal, is it, what, who is to blame? The tools also need to be maintainable. So they need to be run always by everyone. They can, they need to be run on each PR. They need to be able to be run every day for a long time. Part of the HCD testing is using the end-to-end suite which is just like wrapper for setting a process with configuration, but reusing the existing framework would save us huge amount of maintenance costs because we just need to run HCD. Like we don't need magic. We just need to have an HCD instance and check if it behaves correctly. Yeah, so for that I would want to go to robustness which is how I want to define what we are trying to validate. So robustness is ability of the system to maintain correctness under any condition. So correctness is any guarantee that we are saying saying to our users that we give them. And the any condition is any condition. Is it cloud failure? Is it process problem or disk or bytes get flipped? We don't care. This is, we just need a correct system. So to give a detail, here is like high level or overview with guessing how frequent the issues could be happening, but this is the level of granularity that we want to consider. So outside of normal operations which are the trivial case, we need to know if there is a problem if there is a normal packet loss that happens every day. We need to know if during upgrades is there an issue. And are we still correct even if people pick unsupported or not as obvious upgrade path? We need to also take account of standard failures like people just shutting down their machines. It's not always obvious that at every code line if the process is killed at any code line it will be still correct. And going into more obscure failures that happen pretty rare or even this is more number from on-premise, but when multiple bit flips occur you can get data corruptions of memory that fulfill the CRC code. So they seem from process correct. They could be random, sometimes randomly flipped. What does correctness mean here? At CD promises to the user two types of guarantees. One is about key value API and second one is about watch. They are separated because one is request response like and second one is more subscribing to some changes. So one is by delayed by definition. So key value changes need to be atomic. So either a transaction succeeds fully or it never happened. Key value changes needs to be durable. So if I make a right, this right is permanent forever. And no matter if your desk like disk fails or yeah, if process shutdown knew if I restored the HCD it should have the data that I wrote. And API needs to be linearizable. I will go into that further but it basically means that every change is ordered by the real time counterpart. So you can order the changes and buy time. The watch guarantees mean that watch guarantees gives you an global order. So at CD, all the changes to at CD have a revision number which is a global number and all the changes on the watch should be ordered by this number. At CD, watch needs to be reliable. So it should never drop any events or any part of stream within so you should get all the revisions. And watch needs to be atomic which means if there is transaction that changes a multiple keys, they should be sent as one unit within one response instead of being split between multiple responses. So basically you should observe every time you get an event or response, you should observe all the events that happened within this revision. So how do we validate correctness? We can take the part from the functional tests which is we take some scenario, we run at CD, we inject some failures and we send some traffic but how do we validate the correctness? This is the hard part. What do we take or what do we read from at CD or failures or traffic to make a decision how do we validate it? So this comes to broader topic of fact we cannot use traditional testing that is scripted. So tests like unit tests, functional tests, integration tests, they all follow a trajectory and they all follow some script that someone defined and can never derive. And the moment that the tests or what you are testing derives from what you are specced, the test fails. So we can't use those. How do we test generic properties? So for help, here comes exploratory testing. So if you ever did fast testing or ever did property testing, this is what I'm talking about. So not testing a scenario, testing some invariant, testing some property of the system. But the problem is how do we like, okay, so we have an approach, how do we do validation? Here comes model-based testing. So a model is a simplified implementation that is easy to understand by anyone should fit in a couple, let's say 100 lines and behaves like that full system, so full fancy, multi-distributed from multiple nodes system should at the end behave like a simple 100 line code structure or class. A model tests or using model testing requires us to collect operation history and from real system and valid replay it on the model. So if the model represents correct and desired behavior of the system, replaying the same, the history should give us answer. Is the system behaving as it's supposed to be? So for HCD, HCD is simple key value store. Like why cannot we use a hash map with a counter? That's all. But there is another problem. Which order, like here we have set of requests that were executed by multiple users. Multiple, there are concurrent requests that are done. There are different types of requests. How do we know what order they came in? HCD knows, but when we test we don't know what is happening in HCD. So how do we know? And here comes the linearizability testing or linearizability tracker, which is a tool that can take operation history and the model and find the order. So it basically, for the image here, it basically goes through every operation and creates a line that is consistent with how model would behave. So we don't know what order of the operations happened in HCD. We can just, based on history, derive one way it could do this, that is correct. If there is no way to connect the lines that is correct, it means HCD is incorrect. Or at least the test fail, and we can then go there and validate is the HCD or the model incorrect. So this gives us a full solution. The HCD robustness test is, as described, the part that starts the cluster, injects some failures, generates some traffic, then collects the history of both operations, which are key value put, get, and watch history. The operations combined with the model can be passed to linearizability tracker, which gives us the answer, is it correct or is it not? Watch history. It's simpler because it's ordered. We already know the order of these results. We can just write simple functions to validate it. So when we started this, or when I started this effort, we quickly figured out that no one knows how HCD is used by Kubernetes. So me, with Han Kang, we worked on defining the contract because without the exact contract, how do we know what to test? There were unwritten assumptions done long time ago that are technically correct, but no one wrote them down. And we discovered properties like renovability or watch being renovable, which means that Kubernetes supports bookmarks and it uses it to save progress. But if it's those safe or bookmarks need to be sent when they are sent, they need to guarantee that all the events before the bookmark were sent. And if not, Kubernetes will start from incorrect point. It will go either forward or far before instead of the last change. And we discovered that it was broken in HCD in some cases. So what are the results? We found, as of today, three issues most related to different parts. Surprisingly, there was one behavior that no one ever tested, which is HCD recovering its state from other members and we found out that watch can travel back in time. We found duplicated events. We found that diffrack can cause inconsistency. It was just really, really rare. So what we will be doing next or what I'm working on, what is the next steps for us? We had results, what we should do next. We need to codify the whole contract. We just touched it. It's not nearly done. We need to make the tests or failure injections much more advanced. We don't test disk connection or basically, yeah, mostly disk crushing and kernel losing the data that was unsinked. We are already working on Bebolt, which is the embedded key value store to have the same test suite or the same approach for testing to find really, really obscure failures that look like style and data corruptions, which means it's just a problem that most people don't discover. And, yeah, we can also validate the history. We can use the history validation in every test because every test that is an API test, so with tests goes to API or end-to-end tests go to API, is doing some operations, generating some history. We can validate the history that the tests are correct. We can check if our tests are correct or not. And the last thing, when we have contract and fully tested on XD, we can bring this contract, implement it in the model, have a model that fully implements the contract, maybe with some changes to make it more efficient. We can move it to Kubernetes and have Kubernetes verified back what is happening. And, thus, speeding up Kubernetes testing much more because it doesn't need to start full at CD instance. It can be just a process-level test. So, let's see how it works in practice. So, here I have a ready command that I can run which basically does three things, as described, and it's so fast. So, what is happening here is we are starting an XD server, we are checking its health, we inject the fail point, which is maintainers put in the XD codebase. Fail points in the codebase or comments that work as fail points in XD codebase that are in critical points and can be later used by this test suite with use of special library, they can be used to say to XD, hey, crash at this point. And we have many of those. And we then do a strict validation. If we set up the fail point exact here in raft before saved, which is a raft someplace when we call raft, we expect it to exit. And if it doesn't exit, the test also fails because we want to test that fail point worked. So, we should be strict here. The member exited as expected and this can be recorded history. We can take recorded history here 2,400 something operations. And we have some average traffic so we can simulate and require minimal traffic, QPS, so not only test low traffic, like playing around with XD, but also verify that traffic that we are getting is as high that we've seen in some cases and then validate it. And what the test does is model is not linearizable, which is incorrect. It means that there is a problem with XD. Here, I specifically took an existing issue and existing and XD version that was vulnerable to this issue and I reproduced it with one command and it took five seconds. At the end of the test, the test reports all the important information that anyone can use to validate. Is this the problem with XD or is it problem with the test suite or the model? So, we can take this URL that is here. I have it prepared but for demo sake, I will refresh it and we get this visualization. We can click jump to first error and we see that visualization has reported an issue. Here, we are using on the top or the name of the page is Percupine. This is the linear resolution checker that we are using and gives us this HTML file and it shows the durability issue I described, which is there was a put request that sets revision 601. Next put request sets revision 600 true and then it was never persisted. So, all the following requests have revision that is lower than that what client recorded, which means the model doesn't allow us to find any connection to the next operation and proves or gives a proof that a human can verify and read that there is a problem. Okay, so when you should use model best tasting, we can validate it's great for testing generic approaches to correctness. It separates validation phase from the execution phase, which means we can generate the report, verify that the operations or the model is the model incorrect or the HCD. If the HCD is incorrect, we have a proof. If it's the model, we can fix the model and rerun it on the same history. It should work. We can verify that we fix the issue. And as I mentioned, model is reusable so we can plug it in any one that wants HCD fast. Of course, not everyone needs 100% correctness. I mean, I hope you do, but I don't know. Model, I simplified the model. It can get really complicated mainly because it needs to assume that client can observe a data loss and this data loss not necessarily is lost. Server can crash before it responds the client even though it persisted the response. This means that as the next point mentioned, it's an NP complete problem. The state increases exponentially or validating linearizability. Cost of this increases exponentially, making the test pretty fragile if you have any bugs or any unoptimization. So it's not like one tool for everything. So if anything in this presentation make you interested in the topic, this is public. You can go to HCD code, read it and contribute new fail points. At the end, maybe I would think about generalizing it so not only HCD can be validated. That's the end of presentation. If you want to give me any feedback, you're welcome to make a screenshot and send it and let me know. That's all, thank you. Do we have any questions? Hi, Mike. Thijs, I was in the question. What's the maximum size you test HCD against in a number of objects? What's the size? Is it 10,000? Is it 100,000? The size doesn't really matter. It's about QPS. So if you have SSD, the more QPS you simulate, the higher the probability you will find the issue. So you can test even in memory. It needs to be persistent assuming that memory is... you trust the memory that is persistent. So it could be really, really fast, but I don't care about how, because it could be fully starting memory to make the test faster. Thank you. The mic is on this side, so if you have questions, it would be faster. Okay, sir. Hi, Mike. Thank you so much. I was just wondering whether you have already any plan for testing network latency and things like that, because I see that right now there are four points that you inject into the code base, but what about testing these other external conditions which are not necessarily in the code, like rashes? We are testing latency. I mean, latency is not an interesting part because raft is validated to be correct, but on an interesting point, we've found problems related to... it's more behaving like data manipulation, because if the packet is half-le-cast-cut, the packet matches the block, and then you lose one block because it's JSON, you could... it's correct, but there is no right. The missing... I expect this is the case with missing one right. You could cut... there's probably that traffic is cut in a way that allows requests to be missing, and RAV doesn't account for that. We have a... as they implemented already a proxy with very naive implementation, it's not great, and we... Yeah, I'm... I would want to improve to something generic. There is a custom solution which is not very advanced, but yeah, we are simulating network partition delaying and packets dropped. Hello. Thank you a lot for your speech. Next question. There is one slide which told us that our non-AWS setup was broken, and the question is, is there a more stable setup, configuration, or something else? I think... so we needed to fix... or I had a contributor, and that was trying to use Docker, and they apparently made... because they knew no closure, they fixed some parts of it, and apparently they were able to get results, but because they changed their project, I don't know if they left the... like, at CD community, so I don't know if they contributed the changes in... like, Jepsen infrastructure setup back to the Docker. So, like, AWS apparently worked great, but nothing else. Like, it's proven too. Hi there. I'd just like to ask about performance with ATD with splitting out. We're running non-prem, where... what's the name? All our ATD is running on one disk, and we've been advised to look at splitting out the well and data into separate drives, because it's all VMs, we can't exactly split that onto SSDs. What other performance tuning tips are there? I'm... Most of my work is in correctness. I cannot say about performance. There is a talk on KubeCon about tuning at CD. I would encourage to go there. I mean, I could give you examples from... So, what I give is usually look at Kubernetes scalability tests. They support 5,000 nodes. If someone claims that HCD is in performance, they haven't even read open source code that gives you ready HCD configuration that is... Kubernetes guarantees 5,000 nodes. PR release will be not cut if it doesn't support 5,000 nodes. Like, if you need more, we can ask Kubernetes, but it's ready configuration. There is no magic. That's some... Thank you. SQLite is famous for having one of the most comprehensive testing for correctness CI. I don't know, I wonder if you looked at the difference in practice if you could get inspiration from that or if it's too different. I read the call, or I watched the presentation by Kyle. I read some code. I understood some part of it. Collecting history is the same. What is done or what traffic is sent and what is verified is a little bit different. It's more advanced. Even in some ways, it's more advanced because it's not as fragile as the model-based testing I described. It more validates, let's say... For a limited model, there are some distributed system properties. And if it can... Because of limited traffic type or limited operation, I can send you to the... On the page of HCD or reports from HCD, the limited traffic allows you to make some assumptions about setting a variable or append-only traffic and give you properties that you can verify that if there is append-only operations, that means that it should always grow and some traffic's suffix properties. But outside of that, it's more... It's more advanced and more reliable, but I don't know if it's... Yeah. That's all I know. Okay. Thank you. If you have any questions, I can still answer them here.