 Hi, my name is Ulinic and for the past four years, I've done various jobs related to software engineering. From being part of a blockchain startup, developing deep learning models for master thesis, all the way to being around Kubernetes, like creating operators, managing them, managing clusters, converting clients to Kubernetes, etc. Currently, I'm employed in a company called GRIEI and today I'm going to present Kubernetes operators safety first from oil checkers. Let's begin slowly with some motivation. I mean, you have Kubernetes handling everything for you, his handles storage, networking, etc. So, of course, one would assume that with Kubernetes, building operators is easy, right? I mean, you can Google operator horror stories like Redis operator crashing everything, or known issues in MongoDB operator. I mean, you would assume that naturally, Kubernetes operators are kind of simple. They don't have a complex architecture. Something like a Postgres operator shouldn't be that complicated. Yet, you need a whole chart to describe it. Therefore, we established that Kubernetes doesn't solve all our problems for us, and building operators still has some kind of a learning curve and adds its own complexity. So let's abstractly look how we build software. We build a product which has some quality, quality to our users, to consumers, to every stakeholder. The more time we invest into our software and in turn pay the developers to build it, the quality should monetarily increase up to some maximum value. So this is the first graph. As we gain more experience building software, we can deliver better quality software in a shorter time. We achieve this via better tools. For example, better IDE support, better debuggers, better observability. We can have better processes. For example, how do you commit code? What needs to be some checklist you need to fill in before you commit code? Code reviews, etc. Or there could be some external stuff outside of the software which makes writing more quality software easier. For example, some regulation was released, you don't have to do it, etc. This brings us to a typical conventional approach for improving the quality of our software. So this is a typical thing you would do when you build a distributed system on any kind of software. First, you have some design, some informal design specification. You talk with your core leaks and you have a deep design review. Does this make sense in this context? Is there any obvious shortcomings to this design? Like you will just crash the minute you put it in production. You try to catch as many of these design bugs early. Then you code it and you do traffic code analysis. Where you have some tools which you annotate somehow your code and things might or might not blow up. You push it and you stress test it. Or you inject faults in places. For example, if you have a network service, you eject it for a quest to time out or fail and see how your operators behave. You write unit tests for much more smaller testing pieces. And finally you do code reviews where the implementation is checked by your peers for any obvious defects and possible code improvements. So this is all the conventional approaches we all know and love. Most of us have come across to them. Yet there are enough. There is like more tools, more approaches. And this talk introduces one of these new approaches called model checkers. In essence, model checker is relatively straightforward like a brute force search over all possible system states. So in some kind of formal language, you describe your design as a state machine. You have a state of your system. You define a traditional function which are valid. So from which state of the system you can go next, even not deterministically. You define, okay, what are my assumptions? For this design to be right, what needs to hold? Those are invariants. You want to check or rules you want to check and properties you want to check. So all this is what you are checking when you have a model. And the system performs more or less a brute force search over all possible states which are reachable from the initial state. Does those invariants rule and properties hold? It's like that easy. It's mostly based on set algebra. You have a set of states, set of reachable states, and predicates on those states, part of the set of all states, are the true or false. So far so good. And if you take anything from this talk, please read this paper. Use of formal methods at Amazon Web Services from 2014. Most of this talk is based on this white paper. It's amazing. It's immediately written. It has both the pros and cons in more depth, as well as a cool story I'm going to read and paraphrase. So how did they come to this conclusion? Hey, we need formal methods. We need a model check. This is amazing. It's like a crime thriller. So in the paper, they actually shorten the authors' initials when they talk about personal experience as just initials. So CN and TR. And the first one who came across this formal model checkers is somebody called Chris Newcomb. So why is it building this new system super hard? Well, to first approximation, we can say that accidents are almost always the result of incorrect estimates of the likelihood of one or more things. So when we build some system, some program, we have issues estimating how likely is some error that this likelihood is close to zero. How likelihood is some scenario to occur if it's one in a million? That one in a million sounds like pretty rare, but if your service runs a million requests per second, it happens approximately every second. Or how many times would this happen over the lifetime of your service or solution, which can measure years or even decades? I mean, there are all GDS global distribution systems being run on super old hardware or there are even mainframe PCs staring around cool from the 80s or 60s, like somebody maintains the Fortran code or cobalt code. Therefore, likelihood just increases. In the end, CN, our main actor and MN protagonist, was dissatisfied with the quality of services he was written. I mean, despite all the best practices, they had bugs, they had production outages, they had everything we know and hate. So he dismissed formal methods because the two various myths about them, because he thought it takes a long time to learn, which is true for some of them. I had a class in university about formal methods and some of them were pretty complicated and pretty less powerful than this one, less applicable. A second misconception is that only a small fraction of real-life problems fit this paradigm. So he thought his rate over investment for learning model checkers would be pretty small or insignificant. This which begins to the third point. Vietnamese engineers have pretty limited time in our work schedules. We start working at some age between 10, 15 or 25 every Finnish university, most of us at least, and finish working around time with your 67 and retired, hopefully happily retired. So if you still learn some methods for two weeks, this is like opportunity costly, maybe you could learn something else, maybe Haskell or whatever. Therefore, return on investment is important. He also thought they are impractical, that they are hard to use, that tooling is horrible, which isn't true at all. It's fine actually, not perfect, but fine. And he found some white paper, a mystic paper appears. It says the title, using lightweight modeling to understand ChordCord. ChordCord is a disability hash table algorithm, pretty famous one, and they fought the hate, like design was super peer reviewed, everybody read the white papers, everybody agree this is like correct. Then some guy, some woman actually, Pamela Ezeiv, she decided to model it using alloy, one of the model checkers. And when she modeled it using alloy, she found that this design has a bug in it. It actually doesn't have all the properties of safety she wanted from a disability hash table. And this paper stood the test of time. It had a 10-year test of time award at SIGOM 2011. So actor, main protagonist, Cian Reddit, he was like, ooh, this is cool. This can help me write better systems, better software, like this is super cool. And he evaluated alloy and compared it like what he needs this model checker to do on his problems. He found it limited expressivity. Other has a similar story as TLA+, but it's a bit more limited. So he looked further and further and further. He found this monitoring paper called Fast Paxos by none other than Leslie Lampert, a 3rd-in-goal winner, also the inventor of TLA+. So Paxos is a seminal paper from the disability system architecture, systems history, like super important paper. And that was a model Paxos, a TLA plus specification for Paxos. A design, just more formal design, you can actually check up to some state size. Okay, this is good. This is something we can use. And he was like, yes, we need to use this. This is a famous grade two, like, please, guys, use it. It's amazing. Please try it out. Let me know what you think. And people were like, okay, I'm busy. I have a pull request to review or something, something, something. So remember those misconceptions from like five minutes ago? Other engineers hate as well. So he was like, yeah, I need some really hard issue like for these algorithms to prove it and for somebody else to become interested in super enthusiastic like I am. So he found a project called DynamoDB, which at the time it was in his infancy. And here our second actor comes into play, engineer TR. He enters the story with DynamoDB. So he performed all the classical testing approaches, but still had some issues and he wasn't like super happy how the thing went. A couple of weeks with TLA plus. So just a couple of weeks. He learned it, he tried it out. He found a bug in his design acquiring 35 high level steps, plus two more bugs. He found just a couple of weeks of TLA plus. These bugs surface in production in less than a few months under high load because 45 high level steps sounds like a lot. But when you're dealing with huge number of requests per second, it actually isn't. Remember the NASA guy previously under an estimated likelihood of really bad bugs occurring. In his retrospective memory in this paper I mentioned, he was like, I wish I knew this sooner. Like this is amazing. This would help me design better systems faster and quicker. He was like, yes, please, yes. And it spread. Like many other systems that, yes, are using this technique. So they wanted to somehow present it to other engineers. You know, now it's a marketing problem. We know we have a good product. It's like amazing technique which can improve us to write more quality software in short of time. But marketing is like formal method. It's like do not touch it. It's like, no, no, don't go there. This is like complicated. This is what hardware people do, not software. So they called it the buggy designs. It's much more friendly and assigning theorem than formal methods. It's the buggy designs. Even to make it more clear to the average programmer, even a novice one, it's exhaustively testable pseudo code, which is exactly what it is. You have some kind of pseudo code which is a formal language. You can test whether your assumptions hold in this system. You design. Amazing marketing. And they use it in S3, DynamoDB, EBS, internal logistic solution manager, whatever. You have complicated systems. You use it. It's amazing. And they found multiple bugs. They found some further bug improvements. They found some performance improvements. We will talk about the benefits later at the end of this section, at the end of this historic section. So let's go to Microsoft. So remember that Telia Plus inventor? He wrote a book, Specifying Systems in 2002, where he introduced this tool he wrote, Telia Plus. Around the same time, he joined Microsoft of Research, did some things there, but not that related to Model of Czechoslovakia Telia Plus. At least, not that directed as used in production systems. And then AWS was, hey, we're using formal methods. Look at us. It was April 2015. So they are internally in Microsoft. They were like, hey, we have this guy who meted this stuff. So maybe something, something should happen. And there was like back and forth, back and forth. At December 26, 2015, at least according to my research, Satya Nadala, CEO of Microsoft at the time, he sent an email, like, Telia Plus is awesome. Just do whatever you can with it. Which was okay. Let's do it. So they had a bunch of internal trainings, a bunch of knowledge sharing, bunch of dissemination of the knowledge and etc. And they use it a lot. Like in service fabric, in batch of Azure, from batch storage, networking, Audi Hub. And in all these instances, Telia Plus has uncovered a safety violation even in their most constant implementation. So even after multiple senior, super principle engineers, like look at their code, look at the designs, they couldn't find this bug, which this simple technique found. I mean, I sound like a snake, like, hey, you don't need this, your doctors don't want you to know about this tool, but it's like, yeah, they want to know you because it makes you a better software. And previously, other people used it as well. Intel for hardware stuff, Microsoft for Xbox and Azure, AWS, all they mentioned. There is an OpenCom real-time operating system, which is a real-time kernel, which runs in the satellites, like in space. You need to have code which runs in space for a long time. And they managed to reduce their code base by 10 times because Telia Plus helped them design this system better. It's in elastic search for some 7.0 onwards. It's MongoDB even has some specification for their distributed stuff, so it's used. Now let's cover the benefits. First one, you have improved design quality. Well, how do you get improved design quality by just writing it down? Exactly that. When we write stuff down, usually we think more about them because we express our assumptions or etc. So this is like writing what you mean so other people understand you, so you need to be more explicit. And then you recover assumptions you held which are not necessarily true but you have better design quality, you have less bugs. Because you can verify some assumptions, some invariance, some rules, you can perform optimizations you wouldn't necessarily dare to do because you would be scared of safety properties. For example, sometimes you don't need to fully synchronize some operations. You can do it in a synchronous manner which will lead to performance optimization because synchronizing stuff is expensive, meetings are expensive but you need to have safety property of the whole system so you need to know whether it's safe to actually release some assumptions, release some constraints. By improving design quality, having less bugs, you can actually improve time to market because with these kind of tools you have a design, you can quickly iterate on a new design, new design, new design. And last but not least, it's documentation. As you can see in the next slide, how do you explain your system design to another engineer or to yourself after a few beers and a couple of years later in a different team? So we have three usual approaches. So one is informal model where you write pros. This is how it works. You can do it like RFC with masks, shoudn't etc. But it's like writing pros, not like philosophy code but this is how it should do, this is what we do etc. This isn't really complex. It's not really super precise because there could be some ambiguities. On the other end of the extremes there is code which is super precise. This is exactly what your system is doing. It's super written but it's like tens of thousands of thousands of lines of it and can be really complex to read. Model checkers, this model specification, this model checker language falls in between. It's a bit more complex than informal pros but it's precise because you can check it. And when a new engineer joins the team he can look at this more precise model specification and he can figure out what assumptions does the system have. And as the system evolves over time you update this model spec and keep everybody in sync and you can actually analyze whether you can do this improvement to the design. So here is some example. You have two Kubernetes objects inspired by Kubernetes. On the left is an object named bar for which the Kubernetes operator will create some object foo via generate name it will create an object via generate name and Kubernetes APS server will fill the name foo something something for it. If you want to write operator which does this for you we start by importing some stuff so typical programming language one-on-one. We define some variables this is totally like usual day work we define a bar object which has a name foo and it hasn't created the foo object we want to remember that and we have a list of foo objects which exist in the system. As you can see it's a bit different syntaxes most see your Python languages but it's nothing like you couldn't learn in a day or two. It's not that difficult. So usually when you want to check some properties in the system there are two type of properties you care about safety properties and liveness properties. Safety nothing bad will happen you won't lose data nobody will die etc liveness properties something good will happen your system will protest requests with all these methods usually you focus on safety properties because they are much easier to define and easier to test for. Liveness because usually you're not operating on some kind of hard build time kernels or hard build time hard drives it's really picky to test and to specify so focus on safety we want to define safety property no more than one foo object will exist in the system it will be our system invariant just one foo object that's it and we start writing our algorithm so far so good we have atomic start we have vial not bar object.created do something a controller the consolation loop at the end of this consolation loop we assert we have exactly one foo object only the one we created so let's see what's in those three little dots our main reconciliation loop we have non-determinism oh this is interesting so we have a create object atomic operation you can even create foo object by creating it and appending it to created foo objects or you can reboot during creation so your operator can crash at any time and you code this knowledge into this design this model so much has been created or you can crash and burn then what happens you then mark creation foo objects true or you crash and burn and now let's run it this is a tiller plus a runner integrated into vscode by the way previously this isn't pure tiller plus this is something called pluscale which less lamprod throat after tiller plus it has a nice syntax sugar but in there it's translated to pure tiller plus so where was it we have a error trace we found sequence of states which violate our invariance which we defined as a maximum of one foo object is so let's investigate we have a bar object just one which created this false and name is foo foo objects no and our program counter is at start so remember this this was start a step happens we are at create object ok we are here we want to create object nothing good everything is fine and after creating an object we create it but then we crash and burn so instead of marking it creation we crash and burn and repeat this whole loop again we crash and burn we start again the create object we broke the invariant we have two foo objects why is that? because in this code we first create an object then market is created which is obviously a design error it's a bug with this I finish my presentation and please ask me any questions you want like I love this discussion so let's start with Q&A session thank you