 I'm Mark Sharcovich, one of the LCD maintainers. Unfortunately, Benjamin is not able to attend because of COVID restrictions. But he prepared a video that I will play during the presentation. But today, I wanted to talk to you about the LCD inconsistency issues. So earlier this year, there was an event that shocked the cloud native ecosystem at all. The latest LCD release had critical issue, that LCD, the component that powers the whole, that many LCD cloud native solutions, including Kubernetes, it could corrupt your data. Every single administrator needed to take an action or risk their system becoming unrecoverable or requiring even direct byte fixes. With the popularity of Kubernetes, this means that millions of clusters needed to mitigate the issue. I think this is a serious problem. I hope you agree with me. And I think the community needs to make a conscious and direct actions to prevent such issues in the future because we cannot build a cloud native ecosystem on flaky and unreliable foundations. So today's session will take a form of post-mortem review. Doing a post-mortem review in a blame-ess analysis is at the foundation of the SRE culture. And I believe that also as an open source community, we should use or follow the same approach to make sure we as a whole learn from events like this and when can improve for better future. So today's agenda, I will go through the, introduce you to LCD, at least for you to understand what happened. I will go shortly about data inconsistency, how it can occur, how LCD protects against that. Then I will play, Benjamin will tell you about how it really happened in case of LCD. And at the end, I will give you some lessons learned or what actions we will, as a LCD, to fix those. And yeah, and maybe some other personal learnings. So you may know that LCD is a distributed key value store that powers Kubernetes, but also other solutions. What distributed here means that there are multiple instances that work together as a single unit, a cluster. So no matter which instance you talk to, or read from, or write, you should get the same result. So if you write to any instance, it should be visible on all of them. This LCD itself has built in redundancy and resilience to failure, but this is not a hard problem. So to provide those strong guarantees, LCD uses algorithm called RAFT. RAFT itself is so-called consensus algorithm that allows multiple servers, or instances, or members to agree on a single or on values. It's a full consensus is fundamental in distributed systems like LCD. So it's a part of the core of LCD code. So every single decision that LCD makes is negotiated by RAFT. What RAFT does for LCD is takes an unordered set of events that happen that can go to any LCD instance, reorders them, and ensures that they are available on each of the instances in the same order. So the ordered events, the RAFT outputs, it's called RAFT log. So if we look how it works in LCD, let's say looking at three layers of API, the RAFT implementation, and the storage, if there is multiple requests trying to set the same key to different values, which should be resolved first. And this is handled by RAFT. So LCD starts this by creating an entry, or message, or you could say in the RAFT proposal, of setting the value from each of the clients, or each server does it for each client, and then passes it to RAFT. RAFT takes those entries and put them in one long, you could say, queue, giving it a global order. And then each of those events are propagated to each server. And thanks to given, like thanks that, based on the stable order of the events, if every server applies them in the same order, they will end up with the same result. So in our case, each server will first set X to three, then to one, and at the end, when the entry for setting two will be appended, it will set it to two. So now let's look like how changes to the storage are applied. So to make persisting LCD state fast, it needs to be done in multiple stages. RAFT log is stored memory. So when a new entry is added to it, we can immediately persist it to disk in so-called write ahead log, or wall. This is really fast because LCD operations are just appending, and you end entry and doing consecutive writes to the disk. A part of the wall, LCD also stores the database file. The database file contains an aggregated state. So all the events that are squeezed into values. So basically, they preserve one state, or last state of all keys. And also it has consistent index. Consistent index is basically a pointer between the storage and the wall. So let's look at what happened when LCD starts processing the second RAFT entry. So first, the entry is applied to in-memory version of the database. It sets the new key value to 2. And because at this point, LCD can already return results to the user because we persisted the state to the wall. And we also have applied the change to the in-memory state. So we know that the results from the write. This allows us to make the flash to the database asynchronous. So LCD does it every five seconds. The state in memory is flashed to the disk. And with that, when the change is flashed, it will update the value of the key and also change the consistent index to point to the new value or exactly the entry that was applied. So what's then is the data inconsistency. If consensus is about multiple instances agreeing on the value, the data inconsistency is when it breaks. So no matter how cool is your free head dragon, if one head has its mind of its own and starts returning different results, this whole system is flawed. So if LCD uses raft and raft is correct, what can happen to cause the data inconsistency? Like you can see, there are a lot of causes. So of course, hardware failures, the disk write can fail or not be fully persisted. It can happen. You should always account of that. Of course, there can be bugs in raft. There could be problems with raft log being not consistently applied. So a bug in apply method. So the third case for bugs would be if the database is desynchronized from the raft log. So if anything touches the consistent index and passes their incorrect values, LCD could apply some entries multiple times or totally skipped some of them. And the third one, the reason why it's so hard to support upgrades and algorithms in LCD is incompatibility of versions. So based on how LCD works, there can be no changes that make no any change can make difference in how the state of LCD ends up. So if there is any backwards incompatible request or during downgrade, the older version of the LCD cannot interpret the entry differently because it will end up in different states. This is one of the reasons why it doesn't support multiple or versions cube larger than one because we as a maintainers, we don't guarantee or don't check more than one version forward if there is any incompatibility. So of course, LCD has some methods to protect against this. This is that there are two methods up to version 3.5.4. And the newest version has a new method introduced. So both methods are not enabled by default. But if you enable them, LCD will check the database hash when it joins clusters. So basically, it prevents you to invite members from joining the cluster. And second one is periodic check. So LCD will go to other members, ask them, hey, what values you have for any specific revision? And they will respond. And if there is anything wrong or they got bad answer, LCD will raise the alarm and require an administrator. It will notify administrators that they should look it up. So knowing that, let's look at how the data inconsistency looked in LCD. This will be here. I have a video from Benjamin. William, I cannot go to optical on-site due to China's co-editor. I'm Benjamin Wang. Coming from William, I cannot go to optical on-site due to China's co-editing policy. So I have to deliver the presentation by video. Sorry for that. I will have a deep dive into the data inconsistent issue in this video. But I won't go into too many annoying details. Instead, I will present the key points using a color diagram so as to be more clear and more direct. The issue number is 1,7006. I also provided a similar summary on this issue. Please feel free to read it offline. The issue isn't easy to reproduce. When LCD cluster is under high loads and one member crashes, then the member's data might be inconsistent with the other members after it restarts again. It might be a little difficult to understand now. But no worry, I will show you the root cause in the next few slides. First of all, we need to understand the transactional use in LCD. I believe everyone knows the concept of database transaction is guaranteed. In LCD, there are two kinds of transactions. The LCD backend transaction and the board-to-be transaction. You know, LCD used board-to-be as the storage engine. When a board-to-be transaction is committed, then the data is guaranteed to be processed on disk. But committing transaction is an expensive operation because it needs to think data to disk. In order to have better performance, LCD commits the board-to-be transaction periodically instead of on each request. The default interval is 100 milliseconds. LCD transaction is just a logical concept on top of board-to-be transaction. A board-to-be transaction might span multiple LCD transactions. When an LCD transaction is committed, it doesn't mean the data is successfully processed on disk. But from user's or client's perspective, the LCD transaction also guarantees the board property, the ACID. You may challenge what if an LCD instance crashes after it commits, but before the board-to-be transaction commits? The answer is the data will not be lost because the data is already processed in wall files. And LCD will replay the committed entries from wall files on startup. The board-to-be transaction and the LCD transaction are independent. Which means they are created and committed in separate coroutines. The only limitation is that they are mutually exclusive. A board-to-be transaction can be committed in need of any LCD transaction. OK, this is the relationship between LCD transaction and board-to-be transaction. It's the base to understand the data inconsistent issue. The direct reason for the data inconsistent issue is the value of consistent index isn't correct. It doesn't match the applied data. The consistent index is actually the latest applied log index. Previously, in LCD Feudal 4, LCD saves the consistent index into board-to-be at and or each LCD transaction. There's no any issue in this diagram. But an improvement was added in LCD Feudal 2.0 Pre-committed hook was added, which is called each time before committing and board-to-be transaction. LCD doesn't save the consistent index at the end of each LCD transaction anymore. Instead, it saves the value in the pre-committed hook. You know, a member may process hundreds of LCD transactions during each board-to-be transaction. So this solution reduced the hundreds of operations to only one. On the server, it's improved the efficiency or improvements. But the enhanced solution cost the data inconsistent issue. You know, each pair of updating consistent index and the following LCD transaction belong to the same applied workflow. It is supposed to be atomic, but otherwise it isn't. Let's work with an example. Assuming the board-to-be transaction is committed after LCD updated the consistent index. And then the consistent index is successfully processed on disk by the pre-committed hook. But LCD crashed for whatever reason before the corresponding LCD transaction starts. Then the LCD member ran into the dead inconsistent issue because the consistent index is saved, but the related data isn't processed yet. So in short, the atomicity of the applied workflow is broken and it's exactly the root cause of the dead inconsistent issue. Finally, we made two changes to resolve this issue. Firstly, we just saved the consistent index into a staging variable before each LCD transaction. Secondly, we added a pre-lock, post-lock hook on the LCD transaction. And the hook is called each time right after an LCD transaction starts. And the consistent index is updated in the post-lock hook. No matter when the board-to-be transaction is committed, it guarantees that the consistent index is always consistent with the applied data. But the solution is too complicated and harder to maintain. So we added automatic verification to make sure any following changes by any new computer do not break anything. Let's put all the three diagrams in one slide to compare them again. The enhanced solution isn't correct because it breaks the admissiveness of an applied workflow. Actually, the enhancement isn't unnecessary because I don't think it improved the performance. You know, any operation during a board-to-be transaction is actually updating the data in memory. So the old hat should be negligible. During each board-to-be transaction, it needs to load the related page into memory on the mouse in the very first time. It might have some old hats, but it should also be negligible because LCD may process hundreds of logic transactions in one board-to-be transaction. And only the very first operation in the first logic-actual transaction needs to load the page. So the ratio is very low. Another point is that there are also some outfields which are saved in the same bucket as consistent index. Even without consistent index, LCD also needs to load the related page. So the consistent index doesn't add any actual overhead. So in summary, the enhancement isn't another story. The final solution is also too complicated and hard to maintain. I'm planning to remove all the hook, including the post-lock hook and pre-commit hook. Eventually, the solution might be very similar to the original solution. Of course, I need to evaluate the performance impact by the. Yeah, that's all from my side. If you have any concern for questions, please feel free to reach out to me by Slack or by email. Thank you, bye. Let's now look into the lessons learned. So maybe it was not clear, but what led to the inconsistency was a simple refactor just to make the data inconsistency code clearer, but it resulted in basically one edge case or that could cause a race. Yeah, so simple change that resulted in big consequences. If you look at the, if we will look at what went wrong when we were trying to fix the issue, this issue was not cut by the release qualification. We figure, we found out that with maintainer's turn, the new maintainer that took over release was not aware of all the release qualification that should be made. And it vastly lowered the quality or time we spent of qualifying it. HCD didn't have a test suite that could even cut this class of issues. We were working on it, but this work was unfortunately, the prioritized and the person that was working on it left the project. There were already existing data in constancy detection mechanism that was available for users to use so we could find the issue earlier and start and make the fix and push the fix much faster instead of overall timeline tanking us over a year. But unfortunately, there was no user adoption. The reason was the feature was not graduated and was still marked as experimental. Only the code fix took us two weeks and I hope it was not because us but the code was complicated enough that we needed multiple attempts to maybe clean up and catch all the edge cases. Now I would like to go through how we are trying to address it. I will try to categorize the action items in a couple of ways. So first we need to make sure to build in some mechanism to prevent this issue so it never went to anyone production. We should have some actions if it goes into production or it's there, users should be able to detect the issue fast and in reliable way. And in the end when the issue is detected there should be a way to mitigate the problem. And of course as we cannot do everything at all or everything at once and we have limited resources we need to prioritize. So the critical part we would want to address is at CD testing we need to be able to prevent regressions and have a system of testing that can handle this class of issues and can reproduce them reliably not randomly. We also need to have detection mechanism enabled by default. And for that we need to put much more work into stabilizing and making the mechanism quality of the mechanism good enough that every cluster can run it. For important issues we need to improve the quality of testing. They should be instead of flaky they should be high quality easy to maintain and we should also be able to quickly expand them to cover new edge cases that we discover. Problem with fixing the issue we should make the apply code easier to understand and validate correctness because it was really hard for us to validate if what we are doing is correct. The critical at CD issues should never be abandoned. We should have a tracking mechanism and prioritize them accordingly. And at CD should be continuously qualified. On detection mechanism we need to have the detection of the data inconsistency should be reliable and also validate the snapshot send between leader and followers. Mitigation before this event we didn't even have documentation about how to recover from data inconsistency. Now we have documentation we need to make sure that it works and we should test that. Long term we would want at CD to be able to automatically recover from these cases. We have ideas, we just need people to help make them true. So what's the, we are having so many ideas or so great plans, what is the status of the work? There is a new data inconsistency check introduced that can be enabled now. It's much more reliable by handling slow followers and it's much cheaper making it easy enough to run every minute instead of hours. And we are running already running linearizability test which is a general system property that it behaves as it should. So, and we can easily reproduce all the issues that we encountered this year. Maybe my personal learnings. One thing with the churn of maintainers, I think that the contributor documentation is as important as the user one. We lost many or much of the knowledge that we had was lost just because maintainers moved to new job. They changed their projects. They cannot be reached by any means, other I think or by emails. We need to stop to depend on one person knowing all the stuff, we need documentation and consistent execution of processes. And maybe second learning, the prioritization. There are many contributors that want to give us great new changes, make the HCD faster. We should be able to say no. It's really helped the project to make it 1% faster. It's much better if everyone can trust us that project is reliable as it's even stated on the page. There's performance is not a priority for HCD. Then we should focus on reliability work and make sure that we are not trying to maintain too many performance improvements. And that's all I have prepared for today. Thank you for listening. And if there you have any questions, there is a mic. Can you summarize all the issues you have seen which caused the inconsistency caused different zones of the data? No, I'm just wondering if you can summarize the next of reasons, more technical detail level that contribute to the inconsistency in data. In HCD case, so the question was what is the main reason for inconsistency that we've seen? What I've seen, the problem is still, we are gradually fixing it one by one, but most of the problems we have with apply code, which is the code that technically takes there what draft gives as history of what should happen and then it's HCD needs to apply it. So making sure that consistency index is updated at the same time, making sure that the code catches correctly errors and handles authorization, which is different type of error, is the main source of issues that we've seen and we've seen multiple tries and edge cases that needed to be solved independently because as we were fixing one, we were finding another issue in some other part of the big huge switch statement to handle all the cases. We, the maybe more recent issue with durability was caused by not by draft itself, back in draft, but lack of documentation in draft that explained how to use draft and at some point someone misused draft in some way. So we haven't seen a back in draft, it's hard to reproduce, we are just building tools that we could find issues in draft. It's more about implementation of HCD and there was a bug of implementation because draft didn't document, like there was no documentation about one edge case. Hi, this is one of the best post-mortem I have ever heard. So that's the compliment. I have a question for you, it's not directly related, but I hope maybe you have a chance to cover it later. We noticed that HCD is very sensitive to latency, meaning that if for some reason that there is a latency occurred, then HCD will have trouble or a response in time, application will notice the issue. Is it a known issue? There were a couple of bugs related to, for example, latency causing leader elections or members needed or some members being disconnected and going back to the cluster and causing more latency or forcing leader election. To my knowledge, most of them they're fixed, but only on the latest versions. Thank you. Hello, I think you mentioned is some of the potential forward-looking solutions is adding tests that would be able to exercise the code to determine whether or not it successfully does this algorithm. Has anyone attempted to write those tests and it seems like a very difficult thing to test given the complexity, it's not just like a simple unit test. So I'm curious, A, has anyone attempted to write it or B, are there any existing tests that are very similar in terms of determining whether the algorithm results in consistency for many different cases? Yes, the answer is yes. If you're interested in that, there is Project Jepsen which is maintained by one person that does a lot of contract work for databases. HCD was from time to time major resources were verified with Jepsen. The main two issues for HCD is we don't have maintainers that know closure and I don't think we will get to work on it. Maybe also, so we would need to build a lot of things by ourselves. And yes, second, it was not built to be run as part of integration. So it's something that you can come in and run and maybe you'll find an issue. You want something that we can run consistently as part of our CI and yeah, like automate it so people don't need to think about issue or not need to think about where is the tool that protects or how to qualify? The HCD should be qualified out of the box. Any other questions? Okay, thank you, wonderful seminar.