 Well, hello everybody. I'm very excited that we got our video back Because you know, they don't want to go to that work to make it a nice little video and then have it not show up so today, let's see we are doing KBEI and we do this show on a monthly basis on the last Tuesday of the month and you know, we Like to sorry, I'm getting some background noise. I'm not really sure why I don't know what's going on and Sorry, I'm actually hearing myself on like a five second delay. Hopefully that will be better now. All right So like I said a few minutes ago if we don't have a couple of technical difficulties were afraid the show goes gonna go terrible So so that works out. Well Yeah, so I think I had a twitch stream running in the background that was off mute, which is always annoying But all right. So today so just by way of introduction. I'm Langdon White I am a faculty member at Boston University and I used to be a redhead employee focused primarily in my last year or so on Actually doing another twitch show which we'll be doing this last episode on Wednesday of this week which I will be rejoining to kind of You know check back in and say hi and all those kinds of things that was about containerization and open shift But my primary focus was actually on working with the field and developers around serverless Knative etc and a venture of an architectures And so that's a little bit about me and my co-host today is Josh Wood who you've seen in prior shows and Do you want to introduce yourself real quick? Sure Langdon. Thanks. I'm Josh Wood. I'm currently a developer advocate at Red Hat my focus is on open shift and particularly interesting for today's guests and the project that we'll be talking with them about Kubernetes operators and how they work at the core of open shift to deliver auto updates and management of foundation software on the platform And where we see a lot of people building cool new stuff sort of in the space and on the scaffolding That the operator pattern or concept and the toolkits around it represent Nice, it's always nice when you can have a tool that does its own like installation and kind of updating Right, because then you don't have to figure out how to get it right You know this we've we've celebrated this in the Linux world right with package managers for a long long time And you know, I don't it's so painful to use something like Windows where you have to go manually update things all the time Although it's getting much much better So today we have on the show Lilith Sourash. Did I say your name correctly? I hope yeah All right, so and if you wanted to quickly introduce yourself I meant to check before the show, but you know, feel free to say it again correctly and I will try to Sure. Yeah, so hi everyone. I'm Lalit Suresh senior researcher at VM where where I primarily work on cluster management basically looking at new ways to program cluster managers and For the focus of this talk would be some of the work I've been doing on testing cluster managers. So, yeah So the emphasis was on the wrong slobbel And Michael Gash, do you want to introduce yourself? Sure. Hey everyone. Thanks for for having me on the show I'm Michael Gash. I'm based out of Germany. I hope my audio is good now We had some technical issues as you said earlier and as of today I'm still with the ember but on September 1st. I'll be rolling into a new adventure as part of my career I'll announce on Twitter. I think maybe tomorrow. So so stay tuned and follow me on Twitter if your career is where I'm headed to The insider show don't we get the don't we get the secrets now? And I forgot to mention that at the ember I worked on event of in systems as well Knative server less and early on Kubernetes and I tend to do both like prototypes and help our customers with large installations including open shift as well Cool Yeah, and so if you if you want to follow both these folks on Twitter You can find their Twitter handles as well as ours on the notes on the show on cube by example But why don't we get right into it? And so the first thing we like to ask our guests is Is you know kind of what got you into open source? You know what what was the driver that made open source the thing you wanted to kind of work on? Whoever wants to go first feel free and we'll try hopefully not to step on each other Mr. Suresh you yeah, I can start so yeah, this goes back to my Undergraduates when I was first introduced to things like Linux and false and I used to attend a lot of these open source conferences back Back home during that time and then for me like a sort of the sort of the first Time that I started becoming a contributor was with the NS3 project. It was it's basically a network simulation software heavily used in the network research community and And then I joined the Google summer of code program with them And that was sort of my launching pad into open source and that's where I would basically say I was trained in Writing real world software, right? So then I was a maintainer for a while But then yeah, that's kind of where I got introduced and then after that once you get started It's easy right like contributing to more projects and it just rolls after that It's that first once you're like pulled in you can't it's hard to escape again. Yeah, so what what brought you to Kubernetes? Kind of along that route Yes, so that started when I was when I joined VMware actually And so I just so I work on disputed systems and networking and at the time I was looking into things like fall tolerance and then I started looking into things like how do you Like there's a lot of reinvention of the wheel in cluster management, right problems like software Say rolling up grace all of these things, you know, the concepts are very similar except Reinventing that wheel in every new system that we come across and in one particular project where I was looking at You know, how do you write things using declarative programming and stuff? We were like, hey, let's we have this cool technique We are sure it makes it easier to build certain things The second and third times So why don't we try building a Kubernetes scheduler on top of all the other things we've built, right? And that's how that was my first introduction to Kubernetes and that's where I also, you know was Talking to Michael a lot because he knew a lot of the Kubernetes internals that at the time I was sort of running into Like why is this? Where's the Kubernetes API behaving this way Michael help me out here? That's kind of how we got started into this sort of rabbit hole that led to see eventually That's cool. That's cool Yeah, I mean as you say I one of the things I think was hilarious. I saw an oscom talk many years ago That was basically everything's already been invented by 1979 in software, you know And then we're just kind of redoing it better and cleaner and faster, you know Since then, you know in a lot of ways, right? The cloud is some mainframe on steroids, right? So You know, it's it's it just keeps going and going Oh go ahead Just as an aside Langdon when you bring that up I'm wondering what it was that was invented between 1969 and 79 that I'm not thinking of because I Michael tell us Instead of instead of me diverting us down other paths into history Tell us how you got your start an open source and and then especially you could do a beautiful job Like Lallet did of dovetailing that right into the Kubernetes experience True. Yeah, exactly because I also had like two entry points into open source the first one was 2002 where I was part of a training program at a research Institute and There they were all doing Linux stuff and I was tasked to migrate the Windows NT Domain system and file services towards Samba some of the file services and by the time Samba was Moving from version two to version three and it was an alpha one So I was like there was a lot of bugs and issues I was starting to use it and give feedback to the great Samba community and this was kind of my first open source adventure include like some dino Samba and open elder and then like I built a lot of systems on Linux lot of File services parallel file services systems and then years many years later in 2015 I joined the ember and I was going through an on-boarding phase at the ember doing all that the hiring stuff and I saw a talk from somewhere from some some of our colleagues talking about messes and Kubernetes And I got somehow intrigued even though my role did not require that knowledge by the time I was just fascinated by the technology and all the stuff that they were working on especially like building a distributed operating system or kernel And so that got me into into Kubernetes and then I focused primarily because Kubernetes was growing a lot of Primarily focused on the resource management aspect like scheduling the community scheduler and how all these resource semantics Play into it. I Was lucky to give a talk at cube con as well about resource management and then a Bit later I met Lilith and it was always so fascinating because he also was working on a lot of stuff that earlier that I started to touch on and But by the time he was working on it It was often like proprietary like all special frameworks for like large-scale institutions And so we met and I was like, okay, maybe we can use Kubernetes or we can use Istio or some other open source stuff to Use your idea and just make it white more widely accessible and building on Standout and open technology and that kind of build a great relationship between us and I always enjoyed it because he's very very knowledgeable especially about distributed systems just Google him up and you will find a lot of amazing papers quoted papers and So we got into see if like I'm not jumping the gun here Just but just this funny story how we got into this thief or because if you look at what Steve does It's very, you know, very research He does a lot of knowledge and background which goes into see if to make it do whatever like we intend to do But at the same time I was working on an at CD article Because I was fascinated with that CD as well and it's it is very important for Kubernetes, right? So I was working on this on this blog post about Kubernetes and at CD and you know how the informers work and the controllers and all these patterns and I was writing this article and Leliz called me and said, you know, Michael We have a question on at CD. Maybe you can help us like I was like, this is great because I was just literally typing Answer to your question in, you know, the unreleased blog post back back in it. I think it was two years ago. Maybe yeah So that was your long call where we were just diving through the code and just confirming what we were fine Yeah, that's what you were writing about right now This was a very very funny coincidence It's got me also into collaborating and working a little bit on the sieve work But the main, you know, the brain is Leliz and Jodong on the work definitely That's awesome So one of the questions I you know, so I kind of put back to you too is so so why Is kind of doing work in the open source world? You know better or more interesting or whatever then kind of the alternate places that you might have done development work I mean for me personally especially coming from sort of Like an act like the academic half of me, right? I I think it's one of the best ways to Learn how real world software works. It's right out there, right? Like you don't have to um I I really credit a lot of my education In computer science to open source, right? I never had to wonder what these concepts in the books meant when I could just look at real world software And see how it's actually implemented, right? Like whether it's the jvm or python or whatever Like how do all of these things are right out there, right? So There's always that aspect of open source that appeals to me, right? There's all these widely used software that we can actually see how exactly how it's coded, right? We don't have to guess The other aspect of it is just it's more it's just easy to get people on board and collaborate Yeah, totally. Um, yeah, it's interesting My my my experience in a university though is that they they a lot of students are unfamiliar with open source And I find it kind of mind-blowing, you know, because like when I was a college student I was always looking for things that were cheap or free, right? And you know, if nothing else, right, you get that advantage of open source. And so One of the things one of my kind of coat and coat missions since I joined academia, you know, academia is to try to make open source More available and more prevalent amongst university students So let's move on and talk about sieve You know, this is one of those words that I learned from reading. So I pronounced it sieve for a long time But uh, can you tell us a little bit more about the project? Like where does it come from? You know, what does it do? Yeah, so the the Genesis of this project was Another project like I said where I was trying to build a kubernetes scheduler and I was running into all these little quirks with you know building just one piece of Like one piece that runs in a larger distributed environment, right? So Every time I get some kind of stale input from the api server or less sort of some kind of Jitter in the network or lag or something like this. It turns out that these very small Interactions with the environment would basically break my coat And at some point and every time I remember I would keep going to Michael's like, why is the kick? Why is this happening? Right? What am I supposed to do? Then he'll tell me about some pattern I was supposed to use and I'm like, okay fine And so after that project ended I was like, okay, this if it's you know, I don't know if it takes two of us to write this Right, I can't imagine this I can't imagine this is going to be You know bug-free anytime anyone writes it, right? So At the time it was also this thing where I would see all these papers on you know, like trying to find Automatically finding bugs in distributed systems was a hot topic But I always had this frustration that yes, you can build something for one system, but it's not broadly useful after that, right? and so I I then met the you know, the the rest of the team at At a conference I was attending. So this is Chen in zoo. He's a professor at uiuc So I ran into him and I pitched him this idea that like, hey look I've been programming a bit in kubernetes and It seems like it's very hard to write correct software there Right, so write these correct controllers that you know, no matter what kind of external input they get they'll still do the right thing and not do anything and And he's an expert in reliability and Testing and so on and that's kind of how we you know, we clicked right there and we teamed up And his student would intern with us the next summer. That's Shadong He's the lead phd student behind this project pretty much the star of the show I would say And uh, yeah, that eventually led to see basically great Like it was a very organic process and at the time like Michael was also looking into things like the list watch and all of those Uh, the details around that and I don't know. That's how the whole team got formed was very organic. I think And Michael, can you tell us how kind of you felt like you got involved like what what drew you to the project? Yes, so I because I already knew Lalith and you know every Almost everything that he did or wrote about Got famous in one or the other way At least for me like famous, you know relative, but for me it was always very Interesting, so I kind of followed his work and we um, I I also spent a lot of time with him And we were like just having lunch and sharing ideas and so Um, that's why whenever, you know, there were like kubernetes questions or debates around some of the Stuff how kubernetes works as let it already said Uh, we just hopped on a zoom call and we were chatting about it and Coincidentally, there was this time where Jodong and Lalith Were having questions about how at cd the some of the semantics that at cd provides to kubernetes Um, you know, like for consistency semantics and also how the informals then work In all these controllers and as I was writing those blog posts about at cd I was very familiar with the at cd code base and consistency semantics And Lalith had this one questions about the linearizability guarantees in at cd and whether it provides like a logically and monotonically growing Order of events or changes in the system. I was just literally writing that paragraph in the blog post And so I I said, yes, it's there. It's a number. It's just the counter which goes up the revision In at cd and I said, this is this is great because we need that in order to You know make some assumptions about the work that cfis doing and so and then I was asking so what what is this project all about and We started working working on this And I gave more input and some of the reviews in the paper But again, I was more like an advisor from the outside For the work and Lalith and Jodong that the main heavy lifting The other bit is that like for every everything we find Michael knows at least 15 issues on the Kubernetes GitHub that we can actually reference. So that was Yeah, so he was always our bridge to the community. I would say like He was more of the insider than we were at the time Well, I mean sometimes sometimes being on the outside helps Interesting and right like it can be it gives you a different perspective, you know, whatever Um, so that's pretty cool. So so let me just kind of ask a little bit more of a background question. So like I I come across the project. Um, what do I want to use it for? What is it? What does it actually do for me? Um, that's kind of uh, you know a deployer or an application author Yeah, so this the main target audience is A controller or operator developers, right? Um, let's say you you want to write say a Cassandra operator and You'd go through. I don't know. You would use the operator SDK or you'd use one of the you might build on things like controller runtime And you you you're going to write some code that maybe it monitors some kind of custom resource in kubernetes and then it takes some It does some workflow depending on changes that it observes to these resources being created updated or deleted, right And now this code that's running like whatever this reconciled loop that's running somewhere in your cluster or maybe outside the cluster is um just one entity in a large distributed system, right and There are other independent actions happening and at the same time you can have things like failures and network partitions and All these other kinds of hiccups, right? And the tricky bit with writing this type of code is that it's very hard to test how your system will behave under it's very hard to know how it will behave under any failure scenario, right? For example, like most Code bases that we looked at aren't testing for what happens if you crash after every interaction with the api server if the controller crashes after every interaction with the api server But sieve can basically do these type of tests automatically for you So you give us some tests we currently have our own little way of writing these tests But it could very well be the e2e framework or something like this, right? But you give us some tests and what sieve will do is it'll explore many executions where it introduces failures in these tests and if and what sieve can do is that it'll automatically be able to flag when An execution where it introduced a failure actually produced a different result than the failure free run. This is one of the other innovations in the in sieve basically And it can do that diffing and then it'll tell you that hey look I found some executions where I introduced a failure and it turns out that the cluster looks a bit different than it did before Or it went through some steps that it wasn't visible in the failure free run, right? So it can actually do this type of reliability testing for you I got you so so in a sense, it could have been a member of the like the simian army Like netflix's chaos monkey, etc. It's kind of the same vein, right? Well, which is is actually a question I wanted to to kind of at least run through quickly as we get into to see even what it is is If I were telling someone else about this project at a high level, what what kind of testing does it do? Like what I refer to this as chaos testing or You know, like can I put it into one of those pigeon holes that we kind of use as a shorthand to talk about? Like what surfaces we're testing and what techniques we're using? Yeah, so chaos testing tends to be Something that you would run in production live Usually, this is not how you're so you would typically run to see if it's meant to be a development time testing tool, right? so you can think of see more as Another layer of tests that you would run Say in your ci platform or something like this, right on a cluster of machines where this just runs I don't know once a week or something like this, right? It's It's more expensive than a typical unit test or integration test because we are actually for each of those tests We're running many tests, right with where we're introducing all kinds of failures But one key distinction between chaos testing and what sieve does is reproducibility If sieve finds a bug based on a fault it injected you will Basically have a what's called a test plan It's a self contained file that sieve runs where it knows that okay It's waiting for exactly this event to occur in the api for it to pass the controller and inject a particular type of failure Like it knows all of that is an coordinate file With that with just that file you can reproduce exactly that execution over and over again Right, so it's that that's how it really differs from chaos testing It's is it running in production versus is it running on your laptop or some ci server? That's one distinction and the others the reproducibility aspect right, right So that you can kind of go back and find the bug and fix the bug And then you know, so in in a lot of ways it sounds it's it's a bit more like unit or integration testing in the sense that you know, it's it's a well prescribed set of tests that you know Then you basically, you know, you know when you have introduced a new problem when you go to the next revision or whatever which you know, I think is is software development component that is gets a lot of short shrift, you know, like You always like to write the first one, you know, but having to maintain it over time is always much much harder And nowhere near as much fun Cool. So so have you seen It's kind of the operators using it in the wild yet Is it is it, you know, has it been around long enough? Have people been adopting the usage? We we've been getting like a trickle of reports from people who are not us trying to run the tool So that's always promising. There's still more work that we need to do to make it easier to consume as well Right, like the ergonomics is one place Again, like we've been talking with michael about this on What's the right way to, you know, make it easy for anyone say using the operator sdk or controller runtime to just You know come Generate a project batteries attached to start testing with c right like so there's a bit of legwork there So it's the least automated part of using c which is onboarding c1 to your project Right, so that's there's a bit of manual work that you need to do first And I think once we are in that out it'll get easier for people to do it But yeah, we are seeing the first trickle of reports coming in from people Do you think there's potential for for building support for for civer for cver for tools like it into things like the operator sdk Or some of the other toolkits that have grown up around Building operators and then second sidebar kind of to both of you and and I like I hope And it's something that'll be useful for the audience is it we have so far in this in this discussion used two words interchangeably or even kind of neglected the word controller And it like I'm I'd like to kind of draw out like what's a controller? What's an operator? How are they different when we talk about those two things so that we can move from there to understanding and making Sure that that sieve is something we would use with one or both of those two permutations of the controller concept Do you want to take You were always the one telling us that Michael Yeah, so um the good thing is that whatever Like type of software you're building whether it's a pure or plain controller or it's a more advanced operator um sieve Works in both places because at the end of the day sieve instruments kubernetes api and client libraries and calls and since both of these Controller patterns if I want to call them this way they all use you know client go or controller runtime in some form Whether it's through kubel or the operator sdk Um sieve would apply there as well The test plans obviously might be more complex in the operator world because you know if you if you have a system that does like backup snapshots Scaling platforms or like storage volumes, etc Obviously the test plans and that maybe even the time running these tests are would be longer than in a very simple controller Which just spins up pots, right? um But uh, yeah, I would not throw any distinction there where sieve would apply Funnily, I think everything we've tested so far is an operator not a Yep. Yeah, so it's it definitely works on on that spec on that part more. I would say so And maybe as a concrete example Josh the I was working a lot with the rabbit mq team over the last two years mainly because you know kenev Has this concept of a broker and you know, there are implementations of these brokers and rabbit mq happens to be one and so we were heavily relying on the rabbit mq operator and So I don't know if it was lolith or me or maybe together we were like, okay, maybe that's a good one to test and see Like how good the the operator is written or how well it is written and one of the bugs we found was an interesting one where under some circumstances the the operator the rabbit mq operator would Delete the wrong stateful set Which you know would have been attached to a rabbit mq cluster that someone deploys through a custom resource definition and we found that back through sieve And the fix was an easy one. It was just using a precondition During a delete which a lot of people I've not seen using them and you know a lot of controllers and operators But as you know, or might not know when you create or update or delete objects and kubernetes you have this um conditions field or options field that you can pass in these controllers and Not always but often you can pass preconditions Which for example could be delete that object But only if it matches this id and we this patch was added To the rabbit mq operator and the team, you know was happy That this was fixed because it was a critical one if you delete the wrong stateful set obviously, right? There was another one with volume management. It's again with the rabbit mq like that was one of the most fruitful engagements with the put with the team we've had like they So sieve found a bug and then in response They they kind of read it how they go about volume management because there were some bugs around incorrectly managing volumes there as well And then they actually added a corpus of tests To exercise things the way sieve would have done it right like they that was pretty Uh Cool to see as well. So yeah And so I wonder if if there's some way like i'm i'm searching or trying to come up with a question that'll kind of draw out There's a lot of ways i might test an operator that i've built to to Like even to directly adopting the example We've kind of been working with so far with rabbit mq like i'm going to manage some volumes and The bug that you were just bringing up lily thin What there's a lot of ways i might test like that code base if that's what i'm writing What does sieve do that's different? That that is the real key advantage here from like say any other like a different test framework that i might look at Now in i like in my in my understanding and the little bit i've played with it I i kind of understand or at least have a clue About like i'm i can i have something that can kind of predictively look at a bunch of points a long execution and then instrument those points with Different things that might happen in the underlying underlying api i'm addressing So i i think that's somewhere close to the heart of sieve and i'm wondering if for me and for the audience You could kind of close that circle and tell me Exactly why is it like different from using a more general purpose to test cool tool kit or something that likes Me like i might have heard of in the past or be familiar with Of course. Yeah, sure. So let's let's take let's take this example of this Any operator it has a reconcile loop right and now when this loop runs typically what you have is you'll have like I know this can be you know scattered across many files thousands of lines of code whatever right and there are many points in that reconcile loop where it's actually Interacting with say the kubernetes api or something external to the controller And this all actions of that nature are asynchronous. That is you might create a stateful set But then the pods corresponding to that stateful set. There's no guarantee that it'll arrive By the time this blob of code Finishes or something right So everything related to how the controller interacts with the environment is what sieve will perturb Right and so sieve perturbs an execution Like you give it a test workload to run sieve will run the test workload and then when it It's running the version with the failures. It'll perturb the execution in many ways So one of the patterns we introduce this we call we we basically checking if your code is making an assumption that your reconcile loop Will always run to completion Right, it'll check for atomis debugs. So in this case what sieve will do is it'll generate a bunch of plans where At any point that it can do it it'll basically inject a fault where it'll crash the controller exactly after it executes one Interaction with the kubernetes api. So every get or every delete or every update it'll introduce a crash Right. So this is one thing that like you won't find any off the box Testing framework that will do this for you and in fact, I would say like most Um Tests suites that you can find in open source controller operator projects will not do this type of crash testing right now crashing is crashing after every sort of client Client go interaction is one thing but we also do more. For example, we'll also do this thing where Uh, we observe during the fault free run which Which messages are the control is the controller reacting to right And then we will selectively skip some of those because controllers are supposed to be level triggered. They're not guaranteed to be There's no guarantee. They're not supposed to assume that they'll receive every notification about the edges in your Uh state changes, right? So we'll actually force out those uh hidden assumptions from your code by finding finding edges in your Uh sequence and making sure that the controller doesn't see it Right, we'll simulate all of those kinds of conditions for you Which is not something that any off the box testing toolkit will do and it's certainly not something people do by hand either It just doesn't scale To anticipate all of those cases by hand, right? And we also like the hardest thing that we do is this stale state testing where we Simulate what happens when you get stale messages from the community API, which is currently allowed um And there are ways around it Uh not all perfect, but Yeah, um, but yeah, because see you will actually look at your execution and it will do the reasoning to uh this Rather than just sort of Exhaustively try out every possibility in a very clueless way. It'll actually reason and say, okay It looks like this particular message is only meaningful within this time frame And that's exactly the time frame where it'll give you a stale message or something like that, right? Later, it'll actually do. So we do three patterns right now Atomity testing Stale states and the sort of level triggering was set triggering kind of states, so It's a bit Um comparable to what jebson is doing in the database world like halkinsbury and the team Like they enforce or well, they can prove That there are bugs in the system, but they cannot prove that they find all of them and um, you know That's it's for me sieve is a Is a tool that you would run for conformance or you know in a qa kind of context Because it is it requires more resources and time obviously than just running a unit test as you already said just sure um, but at the same time it's um, I would also describe it as like the copilot Which looks across the shoulder and it's like well, you're violating rule number five of writing kubernetes Controllers, which is you're assuming order Or you know that you see all the events and all the stuff like say your co-buddy a little bit and Because the systems are so hard and even understanding kubernetes and all the semantics and how the libraries might change You will go different versions like nobody's an expert in there. Maybe to moccansis right and all these small people But in in general I would say that every line of code that you write in a controller operator Is not block free and so sieve helps to discover those yeah So can I ask a little bit more of a background question? Which is um, you know, you've put all this effort into sieve, right? Um, which you know has a very Relate like narrow use case quote unquote Why is this investment in kind of supporting the development of operators such a good idea? Like what is it about operators kind of or controllers in general that you think is so important that You know, I completely agree with you Michael like testing distributed systems is a non trivial exercise like of any kind Um, but obviously this particular space you found Particularly compelling that you wanted to go and support this So what is it about that space that you think is so interesting? Um, I think I don't know. I think kubernetes is a fantastic platform And that's the excitement really comes for me from kubernetes the platform itself, but there's this nice Thing that kubernetes enables and that most things running on a kubernetes cluster is probably third party right, it's not Right like most operators controllers and control plane functions that come into any kubernetes cluster is third party And they all work pretty much off of the same Core libraries right like line go controller runtime, whatever right And that just means that if you can if you have that nice vantage point to do this type of testing And that's what we were able to find that there is such a Clean vantage point from where you can do this type of automatic testing that means all these operators and controllers can benefit from One testing tool, right? Usually this is quite hard to pull off in an arbitrary distributed system They usually have very bespoke, you know internal interfaces Which are you can you can you can put in a lot of effort and make that testable But then there's no reuse potential But we see if we found you know that this um Clean separation of state and computation in a kubernetes cluster Allows you to do this type of testing like you can create this testing tool once and A large part of the ecosystem can actually benefit from it and that's why Yeah Can I ask to to clarify a little bit um because it certainly in in reading material about see before we Before we join you folks today and having this discussion I'm like at cd, which i'm really familiar with my backgrounds at core os I wrote a lot of the initial first couple of versions at cd documentation More or less well to the extent. I understood at cd, right? Where you just referenced to discovering a point where you could repeatedly do this kind of testing Is that point in the kubernetes api or is it based on underlying guarantees from at cd that filter up into that api? Like what like can can you help me understand? Why is et cd as important it is as it is to the background discussion of this? um If we're if i'm then largely targeting the the higher level kubernetes api when i'm doing the actual testing um So from our vantage point so from cd's point of view or From the testing vantage point right like the kubernetes Core the api servers and et cd is really just a database of objects Right, so you run some workload. Let's say i don't know you run your you run some Bring up a kessandra cluster you tear it down whatever like you'll basically see the seek At any point in time if you look at the kubernetes api Or this database right you see some set of objects and every little control plane action where you create Modify or delete an object will give you a slightly different Database right so there's this history or this trace of database states that you can always observe and the ordering of this thing is important because It can't be the case that every controller is being a different order Right, it has to be one order shared across the entire system And that's why et cd's guarantees are important But all we need to trace is basically at any given point in time when the controller did x What was the state of this database or what set of objects were in the api Before or after Each event right and so this point Where the controllers are interacting with the kubernetes api is also very simple semantically right you read modify or delete or create objects right it's you and It's literally just a simple key value store from that point of view which has very simple semantics Right, so we can actually it makes it easy for us to reason about things like the lifetime of objects I know if a delete went for a key x that means x is no longer going to be in the Api after this point in time like we can make these type of assumptions quite easily Unlike with again like an arbitrary disability like if you took hard dup and try to do this type of work It's going to be very very hard in comparison Because the semantics of their api is a very hard and opaque And you can't even you don't even know what are these objects going out of These internal apis. It's very hard to make that type of guess work You don't need to there's really no guessing involved when you're doing this inside kubernetes And that's what makes it this effective So I kind of was curious a little bit more about kind of the operation in general So, I mean, I guess what I'm also hearing you say right is that um You think a major portion of the ecosystem for kubernetes Is and should be operators whatever and so that anything we can do to kind of enable their creation and and continued flow To exist right is better You know for the for the kubernetes platform because you know, that's a that's a good delivery model for for these systems Maybe I would add to this is that if you look at how kubernetes has grown and been adopted and across the industry Like every cloud provider almost every software company is using or offering kubernetes It has come somehow this de facto api That is is offered and so for software vendors or isvs It's a nice, um, you know nice platform to to use because you can just assume it's it's ubiquitous it's always there right and One of the interesting projects that I've also been following is the kcp project from from redhead And I was um, I was having early discussions with stefan shimanski Is one of the lead engineers there as well and asking because kcp takes kubernetes, you know To the next level if you will like taking the core api principles and you know making it more than just quote quote a a container orchestrator And there the same semantic supply, you know, you use the same libraries or the same concepts the same patterns are there But you you might do user management or workspace management, you know across the globe if you will But using the same kubernetes api principles and that means that the complexity of these projects obviously is increasing And so will be the software quality and the the bugs that you will find and that's why I believe that sieve can can really help to change the game there and providing and writing better software on top of kubernetes because something that we Don't really do quite well in you know it and tech space I would say in general compared to other sectors like civil engineering or airspace is Like we are probably still in baby steps and we're still crawling when it comes to The code and the quality of the software that we produce like on average not every not everyone right but but an average Imagine I would just go out and start building a bridge and a call of the day That's probably not a good way to do stuff, right? There's way more regulations way more, you know descriptions and and You know stuff that you have to put in place before you can actually start building a bridge And then in computer science people like leslie lemper they advocated for you know writing specifications first Something like in tla plus write the specifications in a non code like language like use mathematics to write the algorithm and then Prove it and then write the actual code But often we just start writing the code and then if it works unit has passes Sounds like it's good, right and then these these bugs just uncover some are more complex and subtle to Expose and you know some just come up and and kill the workload and that's why I think sieve at least helps to Write and create better software Especially as the Kubernetes adoption is growing So Interestingly enough, you know plug the our own show when our first episode when we interviewed clayton colman We talked about kcp And I'll throw that link in the chat like I've been doing a bunch of others Um, but I was kind of curious about um, like you know, I think that kind of control plane idea Right is that you know kind of having a uniform API for all the things really does have you know some serious advantages Um, I was kind of curious about you know, it's specifically let's uh, you mentioned um that you know kind of sieve as a dev time tool Is that what you foresee in the future? Um, you know, is this something that that should be kind of running in production? You know, I mean one of the things that I've found so actually, you know, I've been working on distributed systems for a long long time And uh, you know, even you know, we we actually a friend of mine and I built a visual basic Remote tool over hcp when and and lied to it about it actually being decom. So this was a long time ago, right? and we actually wrote a way to be able to watch for Distributed system kind of communication challenges as the you know as the application was running Um, so we also had a dev time tool But we also kind of ran wanted to run at production It's the hardest part about distributed systems you put in production and all of a sudden you're like I I have literally no idea what the path of of code that was actually traversed was Um, and you know by design almost Uh, and so how how can we help with that kind of tool as we go forward? Is that a is that a different? You know, is that sieve prime or is that somewhere sieve might want to go? Um, so far, I think we're doubling down on the development side thing Right, like I think uh as you I think you've uh hit the nail on the head They're like you don't know what path you took to hit a bug in production And I think our philosophy here at least has been Don't even give a chance for a bug to happen in production if you can Avoid it right and so we are even looking at things like verification right now because again sieve is a testing tool It's It's effectiveness is only as good as the initial happy path Tests that you provide and sieve will you know explore all its conditions based on those initial tests But it's not going to guarantee the the absence of bugs for which you need things like verification on the You know doing things live side. I've I'm involved in some projects where we are looking again this principle of like if all of your state is in one globally ordered database And you don't have any state off of it In a say all state outside of it is ephemeral you it might help with doing some things like debugging Um, this is project. I'm involved in it's called like db us It's with collaborators from uh stanford and mit And here we are looking into some of these type of questions on how can you Attach a debugger to a live system and find out exactly what What was the state as of a bug happening and you can if you can trace all of the information until then It like you can do things like record and replay Debugging life, but it assumes a very different like a very restricted programming environment Unlike what we're talking about with things like controllers and operators and Overall you you have to restrict the programming environment in order to be able to do that Otherwise you're all these stat races and whatnot like it gets much harder. So yeah No, I totally understand. Um, yeah, I mean so, you know, but I guess kind of what we should be looking for is kind of the You know breath You know increase path of sieve's coverage around uh, kind of dealing with these tool chains With you know, the hope that there will be fewer bugs in production I don't know. Maybe maybe i'm less optimistic than than you are that uh, you know, certainly I have never had a production system that didn't have the occasional budget. Um, so, you know, it's always there But uh, we can get closer, right every every time Yeah, there's no such thing as bug-free software even with verification if you ask me, but You can so there's there's going to be this quality curve, right? You can you can push at it and you can change the You can play with the asymptotics a little bit, but yeah, uh, you're not going to Yeah, I I still remember in college I had a professor who told me that uh, as I as I became more experienced Um, I would stop ever having syntax errors and I was like And you know x number of years later like 25 ish. I'm still waiting for that day, you know, for example So, you know, I yeah bugs bugs are a way of life in the it's more about being able to identify Kind of why they happened how they keep, you know, keeping them from happening again You know the whole like blame-free culture and all that stuff. I think is really really important to get software Important bit here is also what is the cost of a bug, right? Right? Yeah, like it's um, I think tools like sieve I think are Valuable because the cost of a bug in this space is quite large like We are finding bugs where volumes get accidentally deleted and things like this right? You're losing data There are also security holes that show up where there were There were some operators where we found that like A crash at the wrong time would cost them to silently not configure tls Things like this, right? Like so there is quite a price for these type of bugs And therefore it's okay to spend a bit more, you know, resources and development time trying to find them For other kinds of bugs where you don't have this much of a price. Perhaps it's not Yeah, these tools might be overkill, but I think this is the right space for this type of work right I really like that that angle of analysis there loli because To the to the extent that we're using operators to manage The platform itself and foundation software that are pieces of every application we build above that surface like It may not meet any of the technical definitions But it is in fact tantamount to operating system functions and has the expense and the the the damage potential of flaws bugs and security vulnerabilities in an operating system surface so if i'm somebody who wants to push on that quality curve And i'm really interested in sieve and i'm so this is a two-part question One if I want to start using it tomorrow to make my operators better What's the first thing I should do two? If I want to start working on sieve tomorrow to make everyone's operators and controllers better What is the first thing I should do? Um, so the first part is what do you what should you do if you want to use it? You know for your operator today If you go to the sieve project, you'll basically find the sporting guide on how to you know bring sieve into your project and It also will answer your second question because you'll see a lot of the work That you're doing by hand that could be you know defaults That you get from let's say using an sdk directly, right? So here for example, you'll see that you're going to specify a bunch of version info like what kind of What go version are you using what kubernetes version are you using what? commit id are you using for your project to test and so on A lot of those defaults I think can come straight out of like if you integrated this with say the operator sdk And there were some batteries attached way where you just say you know generate this project to be sieve enabled A lot of those defaults could just come off of the project itself, right? And then for example sieve needs some information like oh, where is the docker file for this project or where like It also needs some information on the cr itself so that it can put in some annotations that sieve will use during testing to find The pod corresponding to the controller like a lot of these things can just really be It's mechanical work that I think we can automate away if we actually Wrote this as a plugin for say, I don't know Controller on time or operator sdk or something like this So this is if you go through this process once especially this crowd You'll see exactly why There's a there's a strong need for us to integrate this with one of these sdk's a lot of those defaults And things that you're hard coding into a manifest file that sieve will use can be Yeah, can just be assumed easily, right? And if you want to help us With the sieve project again like yeah This is one of the main things that we need like I think Shadong is working on a wish list that will be up soon but I think ergonomics around making it easy to consume it in any project is The big the big thing for us right now And and when that wish list is done, where do you expect it to be? Is it like a set of github issues or like its own doc or? It'll be both. So okay, there'll be a pinned issue and Either it'll be self-contained there or there will be a doc for it as well Okay. I linked the project and the porting guide in the chat. So So people should be able to find it. So that's cool Josh did that answer your question? Did you have another question you wanted to ask about it? No, I think those are those exactly kind of the the two things I'm Was looking for and especially delighted by the observation that in a crowd full of people looking for sre surfaces to automate As soon as you encounter the first order problems of manual work You'll go looking to how to how to make them not manual the second time you go through Yeah, yes, but the key to being a good programmer is being lazy, right? You know the lazier the better so Actually, we're we're just about out of time. So I wanted to maybe take this opportunity to say, you know Were there final thoughts or final questions or anything that you wanted to bring up? To either Lilith or What's what's the next big exciting feature that we would expect to see And see if I wanted to go tell if I wanted to give a five minute lightning talk tomorrow I think I could I could get through the first three minutes. Here's what it is. Here's why you should care What's my hook to walk out with here? Here's the next great thing about it. Would it be integrations with build toolkits and stks like food builder and operator stk and some of those things we've talked about Is any of that work active like is there some of that going on that that we could take a look at now or Um, I think the one big thing that we're working on now is even not needing like not needing the assumption that for example Right now we assume controller on time I think we're working to remove that assumption and work with just base Client go and once we do that it gets easier to sort of go upwards as needed for the different other frameworks that show up Right. So this is one big thing. We're working on now I would say going forward if the community would like to help us out I think integration with all of these frameworks as you mentioned is the big thing that we need to address But this is something we would really appreciate some help on and this is basically a big call to call to arms I would say it'd like to come help us with this project The long-term thing to get excited about is that we are looking at sieve as not just something to test controllers or operators, but also like What part of this can also extend all the way to the applications that are being managed by these Operators, right? So we don't have any visibility into let's say what the Cassandra cluster itself is doing to write that the operator might manage Is there something that we can do to also make it easier to do this type of testing? One level past the operator is something that we're quite excited about. So that's something to keep keep an eye out for Yeah So expanding hope of this type of testing And maybe something for the community would be Because sieve found a lot of bugs and some of them or most of them were already closed or there's remediation I think at least maybe presenting at kubecon about You know the bugs discovered and how to remediate would be a good talk because there's a lot There's not been a lot of kubernetes controller related talks lately, but the ones that You know were presented. They were highly attended By a lot of folks out there and does not even though there's two good books that I know of Josh, you wrote one and Stephens you monthly and michael hausenblass wrote the other one But I think more on this I called it the the mental Like the the mode that you have to have in your head or the mindset the mindset of writing these controllers There's a lot of things, you know, like maybe the five or ten golden rules that you have to keep in mind And relating it to like best practices and when writing these operators and stuff like that I think sharing the knowledge the learnings from the sieve projects was also going to be very helpful And then obviously driving adoption of sieve because people are getting you where sieve Cool. Yeah, um, those are those are those are great things to look forward to And I'll I'll take that as a little bit of a segue In that I'll actually be doing an interview or a panel discussion with ford At kubectown in north america in detroit in october And we have a bunch of other crazy ideas planned So as kb insider, so you should definitely join us As well as we'll be doing a Kind of a meet and greet event at the open shift comments A lot, you know a reception the day before So definitely join us there And watch for the video output of and I'll give you the teaser of You know maybe riding around in cars interviewing people Because we have the motor city And then next time on the show, uh at the end of september so the last tuesday of september we're going to be interviewing Uh, I guess about k native in chingard. Um, and so talking about oh bill. Hey bill is going to be there Oh, that's going to be great. Nice. Say hey Yeah, uh, and uh, you know, obviously i'm i'm a little biased towards it anyway Because I think serverless and event driven architectures are really the way all software development should be done Or Yes, i'm a little i'm a little biased, uh, so but uh, thank you so much for being on the show. We really appreciate And uh, we look forward to seeing you again Um, thanks everybody Thanks a lot. Yeah, thank you. Thanks very much for joining us folks