 Hi, everyone. Thank you for joining us and welcome to Open InfraLive, the Open Infra Foundation's hour-long interactive show sharing production case studies, open source demos, industry conversations, and the latest updates from the global open infrastructure community. My name is Kristen Barrientos and I will be your host today. As a reminder, we're streaming live on YouTube, LinkedIn, and I will be answering your questions throughout the show. So feel free to drop them in the section, the comment section, and we'll try to answer as many as we can. Before we get started, I wanted to thank all of the Open Infra Foundation members. Their support makes Open InfraLive possible, so thank you so much. Now let me introduce our guests. Today we're joined by a panel of sole operators and they have invited two very special guests from Balbo to talk about their deployment and discuss their operations. Please welcome James Blair, Clark Boylan, Jeremy Stanley, Johannes Bufas, and Moise Aufnitz. Welcome everyone, and I'll hand it off to you, James. Hi, thank you and welcome. I just wanted to sort of get started today by asking Johannes and Moise, how did you get involved with Zul? Yeah, that I think was around 2018. We were working on an ECU in the car and we had a bit over 100 developers and we have started to run Jenkins as a CI system, as a continuous integration system, and we really enjoyed integration. We stayed up late nights trying to solve issues that was just merged and everything stopped and we had to try to find the breaking thing. So at that time, one of the senior managers just told me and two other persons, you know, you really have to solve this. And I didn't know much about CI at the time. Actually, I was a developer and we just investigated a lot of things, like what is there? We heard about gating, so we searched the internet, like are the gating systems available somewhere? And I remember we tried GitLab, we tried, like even tried Fabricator, I think at that time it was Facebook's CI system and so forth, but it didn't have built in gating. And then we stumbled upon version two of Zul. And yeah, we liked, I mean, we already used Jenkins, so we'd be familiar with it. We switched to Gerrit and Git. And then there we go. And I think since that point, that ECU, that computer in the car had a green master, you know, it was working. And we built up these DevOps teams. I mean, I think in the beginning we were like two, one or two, three people. And we had to spend time weeding out breaking code or code that destroyed the project. And we could focus on building up the CI infrastructure. So that's how it started. That's great. I think Zul's, that's kind of the banner feature of Zul, the project gating aspect of it and the fact that, I mean, the project's motto is stop merging broken code. So it seems like that's exactly what you did. Yeah. And we discovered that it's not, you know, it's not the magic pill in itself. It's dependent on having good test cases. At that time, we really, that time we had days, like we sat down with all the different modules that we had, you know, and which is your critical function, what do you do and where is your, like, gating integration tests? Just to protect the functionality of, you know, the codebase. But with that, if you have that, then I think just, you know, yeah. It's a simple thing, really. I think it's, you know, it's maybe too simple at times, but it really works. If I can drill down on that actually a little bit. Have your developers sort of changed what they do now that they have a gating system? Does it change how they think about testing the software, how they write tests, that sort of thing? It's very hard. I mean, if we're now spinning back to, or go back into present time, it's, we have a much larger amount of users. I think I loosely counted maybe up to 1,300 users. So it's very hard to answer this question. But for some people who seem to enjoy Zoo and Zoo, the newer versions that we have in Ansible, it has changed their behavior a bit. Yes. They collaborate in a different way. And they use, they use a lot of the dependency functionality in between themselves. And they're really, really careful with their tests that they have. I think the issue they have maybe is like, when we connect many islands that they, you know, they want others also to get and have gating tests. So they are like, you know, don't destroy our tests like from external dependencies more or less. And it's something we are working with a lot. I mean, we have this, we're working with this, I mean, the major part is our core computer that you can read about on internet and YouTube. And in there, we have a major part of the developers. And we had, as I think I talked about this in Berlin two years ago, but we have 900, almost 1,000 projects in one tenant. And we are trying now to reduce that amount, you know, for different reasons. It's easier to manage centrally and for those who handled integration. And that I think will make it easier to own the gating test together, in a way. Do you see, you mentioned depends on and cross repo dependencies. Do you see that driving greater or better or richer interaction between your software development teams? You know, is that, is the tool kind of expanding its influence beyond CI and kind of impacting how your developers communicate and work together? Yeah, in a way, in some teams, I would say. And especially in our own teams, maybe we have me and Mois, we belong to two DevOps teams. And internally, yes, I say that, I mean, I say it's used a lot. And then also, when we do fixes to the base jobs, we have, for instance, that we serve. Then we use depends on a lot, you know, and, you know, if you're going to get, if they are stuck with something, then you just use depends on these things and they can continue working. So that is really great, actually. I think that's one of the yeah, greater things. And also, when we develop things, I mean, when we, when we try to increase functionality, introduce new, new base jobs, for instance, it is fantastic to have, have the depends on functionality, you know, it's, so I see it a lot. But not everyone uses it yet, yet, I think. And we have, we have one tenant, our is called EPF, it's electric propulsion, they have a real tenant, they work with electric drive, electric motors, charging. And they, I think they're, when they move to Zool, they're, they're one of the big things they're told us is like, wow, this depends on this is really great. Yeah. Great. You mentioned tenants. So I guess the implication then is, is that you are using Zool in a multi tenant fashion. Do you find that useful as far as organizing the work between the different groups or how does that, how does that functionality and Zool, you know, aid your, your CI design and, and deployment? It, it helps a lot. I think, I mean, I would like to have more than we, we do have, but it's really, we have some boundary conditions that actually forced us to have one tenant that blew beyond proportions. I mean, nine, I'm checking like nine, 973 projects. But if it has helped to separate, I mean, our electric propulsion tenants, for instance, that has helped tremendously and it helps with the speed of Zool, you know, the smaller tenants are faster. You know, the response time is the event job time, you know, before the Garrett or when the Garrett event comes and the job starts, that's what developers really, you know, want to have fast and the smaller tenants are much faster. And then also, of course, we have the, we have one Roth tenant. So they, they have a small little node on the core computer. And I think they have 10, 10 projects and they do everything in Rust. And they're pretty much self sufficient. We, we, you know, they have quite isolated jobs and so forth. And we don't hear much from them, you know, it's just when I think the other day we had, we got some, something stuck in one of the schedulers, I think, or something. Yeah. And then we had to do a restart and then they were happy again, you know, and then yeah. So we have six different tenants. We also have one for large scale autonomous drive simulations that have their own, because they, you know, they, they really, they don't interact a lot with the, you know, the software patches that goes into code. They, they take that and they build simulations and then run really logical things. So they have their own tenants. And then of course we have the Zool tenant used by us and the DevOps group. We have, so we have a, we have a few Zool admins in electric propulsion in their tenant. That's really nice. They are now free administrators. So they help out with our running and they're pretty much self sufficient. So yes, it helps, but I would like to use it more if possible. Yeah. So you mentioned when you first started using Zool that you actually were interested enough in the idea of gating that you even changed the code review and, and revision control systems you were using. I think a lot of people get attached to their workflows and not a lot of organizations or, or at least their engineering teams are necessarily willing to consider significant upheaval in their workflows. What was that like for the people who were initially adopting it? Now in retrospective, maybe it was a hard experience for them, for some of them actually. I mean, we came from diversion at that time. And, you know, I, I didn't know Geet. I was afraid of Geet at that time. And, you know, I just read up on it. I thought I have to have courses for developers. So, and other team members were pretty busy with other things. So I just had courses for, for all of them, you know, and I tried to study up as much as I could on Geet and even the most more extreme cases. And I started to use Geet and Markdown files to create Geet and Geet like guide for our developers to, to make, but, but I remember when I started to reading about, about reading about Geet, you know, Linus saying that he deliberately called, you know, used the same name for different things just to, yeah, I don't know, it was a kind of hint to diversion at that time. So it was hard for some of us. But then after a while, I think, you know, more and more people just accepted it and and people actually started to appreciate that shift to Geet and especially Geet. I mean, I must say, I really, really enjoyed Geet as a code review. I thought it was just, it was simple, not so fancy looking, but very, very nice to review. It took a while, for sure. It took a while. And now I think, I mean, it's not a question anymore. I mean, I think they made a decision even like, you know, get Garrett that we're going to use. Do you use any other code review systems or just Garrett? We use a little bit of GitLab in Zoom. Actually, the integration project of our core computer is in GitLab still. So, yeah, we have that interaction. And we had a few times when team members, you know, had updates in our job configurations, and they use the Python in GitLab. And, you know, I didn't even know even if, will this work? I checked the documentation. I think I even pinged James. I like, at least work. Because I couldn't find anything in documentation at that time. Yeah, but you, yeah, it did work. So, yeah. We do. I think that's probably one of the real killer features of Zool at present that really no other CI CD systems implement is cross-platform, cross-code review system change management and control. Yeah, especially since the platforms themselves are interested in seeing, you know, the CI system as both a feature and a vendor lock-in opportunity. So, with Zool, we're pretty agnostic about that. So, we want to test your code no matter where it's hosted. So, that's interesting that, you know, you've got, you're in the situation where you've got some teams on Garrett and some on GitLab. And I honestly, I think that's pretty common, you know, in larger organizations. Every team sort of, they have their tools that they're comfortable with. And I think it's neat to be able to bridge that gap and actually get integration work done. Yeah, and those other organizations, they typically wind up siloed with the people who are using one code review platform stuck on one CI CD system and everybody on another. And there's not an opportunity for crosstalk or they've invested lots and lots of their own engineering time on building some sort of an integration bridge between them. So, the fact that Zool can supply that I think is really cool. Let's talk about some more technical stuff here. Can you tell us how you deploy Zool? Yes. So we are basically using health charts. So a little bit of background, we have deployed Zool on Kubernetes cluster, which is hosted on AWS. So there's a service by the name of EKS, which we use on AWS. And that cluster hosts our Zool pods. And we deploy those pods through health charts, which basically we have in one of the repos. Now we have three clusters. And we're basically following a very common pattern here where we have one common file for all the three environments. And we have three different files for all the three different environments. So let's say if we want to override some settings, let's say for the dev environment or the stage environment or the prod environment. So we can basically write those settings, those configurations inside those respective files. And let's say if there is some common configuration, then we can write that in that common file. So it's very common practice to do that with health charts. And we are following that pattern as well. Other than that, we have a pipeline for it. So our Zool jobs themselves are responsible for deploying or updating and managing our Zool pods. And basically how we are doing that, in Kubernetes, we have a feature by the name of rolling updates. So when we apply or let's say run Helm upgrade command, basically what's happening in the background is that it's taking down one pod at a time. So I would say we don't have complete downtime. I would say we have partial downtime, where one pod is brought down and then another pod in its replacement is brought back up. So that's in short sense is how we deploy Zool on our Kubernetes cluster. And we have different, because as Fufas was talking about gating pipeline, we have different stages as well, where we have different checks in the check pipeline, we have different checks in the gating pipeline. And then finally, when all of our checks have passed, we deploy the Zool Helm charts in the deploy pipeline. So that is how we deploy our Zool infrastructure on our clusters. Can you talk a little bit about the scale and the components? So you talked about your deployment and how you do rolling deployments. So can you share how many schedulers and other Zool components you use? Yeah. So right now, at least for the production, we have six schedulers, 10 executors, and I think the web pods are six as well. I think this is the count that we have right now. Other than that, the Zookeeper pods count is three right now for production. But this count is not the same in the other environments. Like for example, in the staging environment, our number of executors is one. We have one scheduler as well, we have one Zookeeper, and we have one web pod. And it's pretty much the same in the dev environment as well. So I'd actually like to pause here for just a second. We have a question from the audience. And the question is, is Zool meant to compete with Git or is it a supplement to Git? So, I don't know, Jeremy, do you want to start with that and sort of talk about Zool's relationship with Git? Yeah. So probably the most important thing to understand is that Git is fundamental to and underpins basically everything Zool does. All of the references that Zool works on are basically Git. And the things that trigger it for the most part are events from revision control systems that are or code review systems that are managing changes or pull requests or other kinds of merge requests, etc., in Git hosting and code review systems. So that's Zool basically not only does it work well with Git, it effectively requires the use of Git for its functionality. And a lot of the relationships that it builds are effectively Git relationships between different sorts of units of work in those code review and revision control systems. Yeah, and just to add, so you'll find that I think a lot of Zool users are pairing Zool up with a code review system like Garrett or GitLab have been mentioned but also GitHub. And the idea is that it's Zool is consuming Git activity from those sources and then driving its continuous integration and deployment using the events, the activity that happens within those code review systems. Yeah, and the idea with Zool as a project gating system is that it is effectively in control of when commits do eventually merge to branches within your Git repositories, whatever platform it's integrating with. So Moise, I have some more questions about your deployment. So how do you handle, I guess, when do you decide to upgrade Zool versions and how do you handle version changes and things like that? And how frequently do you upgrade? So when the latest version is released, for example, right now we are on 8.3.1 and we are basically two versions below. So we have to upgrade to the latest versions mostly. We try to upgrade to the latest version but sometimes we get stuck in other things as well. So but definitely we try to be up to date as well. So the whole procedure for it is that we basically test the new version in Dev environment and then the staging environment. The Dev environment is not an exact replica of the production. However, we have another environment by the name of staging, which is the exact replica of the production environment. So the staging environment basically ensures that the version that we are testing is that version correct or not. And we try to deploy certain jobs on that those in two environments, the Dev and the staging as well, just to test whether those jobs are working fine as well. So that is our process for testing whether the new images are working or not. We also, very careful, I mean, you know, it depends on also what the upgrade is, because it's usually, I'm usually the person who's like, whoa, wait now. Okay, because we are afraid of, you know, downtown, downtown basically. So yeah, we would like just to, you know, update continuously. But yeah, we, we, yeah, we're a bit careful and maybe it's my fault. I'm blocking. Yeah, but the good thing is that we have partial downtime. We don't have complete downtime. So let's say if something goes wrong in the production, for example, let's see. So the rolling update strategy, strategy that I was talking about, that will only take down one part. It will, if there's an error, then it will not bring down the entire separate parts. So let's say if we have 10 parts, it will only bring down the one of the parts, the remaining time will remain the same. So that's a good thing for us. For the dev and staging environments, you mentioned that you do run CI jobs through there. Are you doing like, is that happening transparent to your, transparently to the users and you're comparing results between production and staging? Or I guess I'm curious to know how you're choosing which things go into that environment, into the pre-production environments as kind of your canary workloads. I would say that the jobs that run in production do not run in the, in those two environments, the staging and the dev. So in the dev and the staging, for now we have, you could say test jobs, the result of which we cannot compare with the results of the jobs on the production. But in the future, we are looking to do the same. We want to run some jobs that we can compare between the two environments, or in this case, the three environments. Do you, do you have separate code review servers in that environment? Like, do you have a dev, Garrett and a staging? Okay, so you can, you can test all the interactions without, without bothering your developers by leaving fake comments on their real changes. Yes, yes. Yes, we do have separate Garrets in all the three environments. Yes. Which are obviously shared because some of the repos of some Garrets might be on one environment, some of the repos on particular Garrets might be on other environments as well. Yeah. I want to switch hats for a minute and, and maybe actually just talk a little bit about OpenDev's deployment, since you're honest mentioned that you'd like to deploy continuously one day. OpenDev is where we host the development of Zool itself. Zool is of course open source software. And, and so Zools own Zool is in OpenDev. And, and we sort of, we do run that almost continuously deployed. Basically, what, what happens is once a week around Friday evening somewhere in the world, we, we start a complete redeployment of Zool. So just like Moise was saying, we, it's a, it's a rolling deployment. So we, we, we restart one component at a time and that happens. It's, it's whatever the state of the Git tree is at that point in time on that Friday is what gets deployed. So we're, we're getting close to continuous deployment. It's sort of a weekly automatic thing. And, and we've talked about maybe even doing a continuous rolling restart. I think one of the things about upgrading Zool is that frequently jobs take a long time. And one of the components of Zool is a job executor. And so if you're restarting this component, you have to wait for all the jobs to finish. And so if you, if you do that one at a time, it can be a pretty long process. So that's kind of one of the challenges that we're looking at in making OpenDev continuously deployed. But I think the reason why I brought all this up is, is to maybe share a little bit of the positives and negatives about this approach that we have in OpenDev. The positive, of course, is that we get new features fairly quickly. And, and it happens so often that upgrading is, is generally a non-event. Like there's no, there's no planning that goes into it. There's no concern about what, you know, is this safe or not? Because nobody is making that decision. It's all done by computers automatically. And so that is nice. The downside, of course, is that we have to be constantly on top of any release notes that are saying things like, here's a deprecation. Here's something you need to change. Here's something you need to do as part of the upgrade, that sort of thing. Because those are, there's a constant stream of those coming and, and, and we're not batching that up. Zuul's own, Zuul's testing of itself is, is fairly complete. We have a lot of, I think, ridiculously realistic tests in our unit tests for, to, to try to avoid regressions. And so generally speaking, if a change merges, we're, we're confident enough to, to deploy it. So, but you know, it's, we're only human writing and reviewing that code. And so occasionally something slips through. And so we have to be prepared that, that, you know, anytime one of these automatic upgrades happens that we might have to, to, to downgrade or do something else to, to fix it. So it's definitely an approach that I think people can consider. It's, it's, it's doable. And like I said, people, you know, there's at least one instance that, that's, that's doing that. But also, you know, when, when, when you have the entire production of a company on the line, I would understand that, that maybe the, the inputs to that equation are different than, than, than the, you know, an open source development community where, where in general we kind of expect in the rare instances where something goes wrong, people still have their own resources to get their own work done. And they're not completely relying on upstream for everything. Yeah. And also the, oh, go ahead. No, no, no, go ahead. I was just going to say, from OpenDedice perspective, one of the things that really enabled us to, to switch to that constant or near constant upgrading model was when Zool finally dropped its last single points of failure. So we, we're now able to do rolling upgrades with zero downtime, complete data persistence from the old version to the new version. Old and new versions of the different components are, are running side by side as upgrades are happening. You know, redundant copies of the services are being temporarily taken offline, updated, and then brought back into the cluster. And our users don't notice it unless there is a regression that has slipped through testing somehow. For the most part, our users don't even realize that Zool is being upgraded on a weekly basis automatically. And the good news for everyone else is that, you know, there's at least one installation out there that's fairly close to the tip of, you know, trunk. And, and so you can feel pretty confident that when you, when you do get to the point where you're comfortable doing an upgrade, you know, someone else has gone through the process before you and it seems to work for them, or at least it, if it didn't, the, the issues with it should have been addressed by the point that you, you get there. I really appreciate that you, that you are on tip of master in the project. I mean, it gives us a bit of confidence. And as you say, you know, we're, at times we're a bit nervous. We don't want to interfere, you know, if there is a slightest little chance of downtime, you know, we don't want to, you know, stress developers who are, you know, really working a lot and, you know, have deadlines and so forth. Yeah. If we can jump in here, there was another question that came in from the audience and they were wondering what the, the Zulvi 2, which was Jenkins based to Zulvi 3, which became Ansible based migration look for you, you know, how did, how did that go for you? How did you tackle that shift? Because it is a fairly sizable chunk of work to be able to make that jump between Zul versions. Yeah. That was, we were a bit hesitant doing this upgrade. And I think we, we took us a long time to do it. And, but we, we really needed the, I mean, we, as the car ecosystem developed, you know, and we, we had more dependencies and more teams. I mean, we started small scale, but, you know, now it's more than 10 times more. And, and we saw early that things will depend on each other. There will be a lot of dependencies here between different teams. So we really need this functionality of, you know, that users can freely depend on each other's development and still have the gating principle. Because we thought, you know, the gating principle is like actually what kept us alive, you know, and increased our development speed. Actually, that's what it did in, in like practical management terms. So we saw that with these, you know, putting a lot of teams together from the whole company, you know, and they will, you know, have dependencies on each other because of how cars developed, you know, they're like, they're like Zul infrastructures on wheels, you know, they're like getting, they're evolving a lot, you know, from what they used to be. So, yeah, we, that was the reason we did it. But it was very difficult for us, you know, not everyone was familiar with Ansible, for instance. So that was, but then we, I think we, we appreciate the inheritance of base jobs, you know, that possibility. Something I think that we will try to focus more on on our side, to have more, like, more control of, of the base jobs, like, defined in one place. I mean, now we have, we have, we offer base jobs to most developers, but they are variations and, you know, everyone can define their own. And, you know, that I think went to be, we need to address that. And we need to have like a library or a separate repository where the, here are the job definitions, you know, we need, but we, I mean, we grew organically and we were a bit like a garage moment in the beginning. So we, you know, and we, we worked a lot and we didn't have many, we weren't many people. So, you know, we just had to fix things, you know, like, you know, get it up. So, but now we, we clearly see, you know, that we need to arrange that the inheritance and the use of base jobs is extremely powerful when you have a large developer community, you know, you need to upgrade something, you know, you can upgrade it in one place. That's like fantastic, actually. Did you, did you do a migration from B2 to V3 all at once? Like, was there a, this is the day that everyone's cutting over? Or did you have multiple Zool instances and users would move themselves as they were ready? And I, because I think we've seen both approaches used by different organizations. I mean, for our, let's say our master tracks, we just cut. But we had to keep, you know, we had, we did deliver, I mean, during the years we used the version two, we did deliver things in production. And, you know, we need to keep those up. There are, in this, in our industry, there is, there are legal requirements, you know, you have to be able to update for a very long time. So we still have those, those old systems, you know, I, I need to live for quite a long, but not for, like, master development of our new projects. No. And when I think of it, there's an, of course, when we grew in scale, we do benefit from the, I mean, the, that we have multiple schedulers and we don't have this single point of failure. In the beginning, everything was fine. But yeah, when we, when we cramped, pushed everything into one tenant for different reasons, a bit uncontrolled. Yeah, we really appreciate the, the, the possibility of scaling the system. And of course, the rolling updates. Yeah. Did I miss anything, Mike? No, that's fine. I think you have covered everything. I'll mention that on the, the open dev side, we host a Zool that is Zool Zool, as Jim mentioned earlier, and we had to go through the same transition from V2 to V3. So there is, I think that one of the very first blog posts on the Zool website discusses how the evolution of Zool happened and how open dev Zool, which hosts Zool went through this transition as well, which is maybe may also be interesting to read through if you have more, more interest in the, the topic of the V2 to V3 migration. And sort of one of the, the topics that, that you just touched on is, and was highlighted in the comments is, is the control between the, the sort of distribution of control between your sort of central CI teams and your developer teams, right. And it sounds like you, you want to, to change that a little bit. You know, it sounds like maybe you've got a little bit of Wild West going on with the, with the team sort of doing, doing whatever and, and maybe now you can distill some, what, what they're doing and, and, and centralize some of it to, to, to kind of standardize maybe share that best practices that, that sort of thing. And I think that's one of the, the, when I talk about Zool, I talk about how that this is really, this is like a, a control, a control slider that you can set anywhere along a scale. So you can, you can completely wash your hands of this and let developers do whatever they want. They can, you can give them full control to write their own jobs from scratch. On the other end of the spectrum, you could have a central CI team write all of the jobs and not give developers any choice about whether they're even run or not. Probably both of those are not great ideas. And, and probably the Goldilocks zone in the middle where maybe you, you know, you kind of come up with here's, here's what we should be doing. Here's the best way to go about doing it. Here's some, some foundational jobs that you can build on if you need to, to, to add to that. And so I think with, with Zool, you can, you can, you can set that anywhere you need it. And, and so that's interesting to hear that, that, that as, as time goes on, you're, you're, you're taking lessons from what, what the developers are doing and sort of distilling them down and, and, and making things more standardized. Yeah. And I think it's, you know, it's, I like, I mean, freedom, freedom, as you say, freedom is good, right? But, but the result of it, if you scale is, you know, support also. I mean, it's an issue for our teams to be able to support. I mean, we're, as I said, 70 people, 17 people in our teams, and then we have other teams with three or four. And, you know, to handle that massive amount of support, you know, like, oh, the base job doesn't work, you know, or I inherited the base job and I did, you know, so it's a, it's an equilibrium. And I think we're, yeah, we need to adjust it a bit to be more efficient, you know, also as a team, you know, considering everyone, you know, I think it's important to try to find this balance. And we will definitely, we have ongoing planned work to address this a bit. Yep. We've talked a bit about code review systems and then that, that, out of the house. And you tell us a little bit about the test node resources that you use. What do you, what do you run your tests on? Do you use cloud providers? And, you know, you're developing systems for cars, as you know, or do you have real hardware that you're testing on? So basically what are you, what are you running your tests on and how do you integrate that with Zool? I think most of our tests are run on AWS nodes. I would say most of the clouds of that we run, it's mostly on AWS. 80% I would say on AWS and 20% on Azure. So these are the two most famous cloud cloud platform providers that we use. Other than that, we do have some static machines on AWS as well. And in the basement as well, we have hardware machines as well, bare metal machines that we use. But they are very limited in number. Mostly nowadays we use cloud machines. Yeah. So we do have not many, but we do have what's called in the industry hardware in the loop setups. So there we connect to bare metal machine who is in turn controlling the ECUs of the car stacked up in a rack or something. And we have, we have, we have some, some of the parts of the code go through before it's released in Zoo. And it's actually performing quite a lot of complex things like, yeah, you know, closed loop feedback of radars and all these things. But it's, you know, it's not, as Moe said, this much smaller scale. So if we compare it to, yeah, I think we, we pick around 9,000, 10,000 jobs each 24 hours. That's our like, peak job. And I mean, we have a small percentage of bare metal and these hardware in the loops, things. So, but we, we try to, so that's one of the issues we have is to, to bring the massive amount of, you know, jobs and patches that we run on easy to machines in AVS, for instance, you know, basic unit tests or unit tests and winters and all this and bridge that to verifying the system like a functioning system in a, in a much more complex setup. And there we, we have started to run ARM based nodes, which have the, we got an AMI from the, one of the ooze suppliers that we use. And they provided us with an AMI. So we can run their same architecture as we have on one or the main computer in the car. And we can run the same operating system. And then we, we are trying to bridge, you know, the very limited capacity of these complex systems to be the, I shouldn't say infinite, but yeah, I mean, we don't hit any limits yet in our cloud usage. So that's a, it's a something we try to address. And it's not an easy equation, but we, we see that. Yeah. When, when these new things that turn up in, in the cloud, you know, that the cloud or that the computer architecture that we have can be supported there. You know, that's tremendously helpful. And we see that, that we can actually, you know, bridge the x86, the next environment towards the target environment. That is something, you know, ARM based and completely different. So, yeah, very, very interesting. So, so you call them racks, not garages. But more seriously, though, it sounds like you have a really diverse set of test resources, cloud resources and different clouds and bare metal. And you've also got different hardware architectures that you're dealing with. Have you, have you found that Zool's use of Ansible, you know, helps you communicate and work with all of those together? Does having that, like a tool that is kind of agnostic to it, to what it's talking to make your life simpler? Or, yeah, I see. Yes, it does. In fact, we were just recently talking about the cloud providers, we have actually started running jobs on pods as well. So Ansible helps us with that as well. We can use the Ansible script to communicate not only with two cloud platform providers, we can also communicate use the same tool to communicate with pods as well. So we are actually running some jobs on pods as well, Kubernetes pods. And that's one of our future goals as well. We are looking to bring down our AWS and Azure cost by running most of the jobs on pods. So that is our future goal that we are targeting. You mentioned that you're using AWS and Azure. Are you using them both for different feature sets or are you sort of trying to get redundancy by having multiple clouds? It's partially to do with redundancy. So let's say if there is something wrong with AWS, then we can deploy the same jobs on Azure. But mostly in most of the cases, what we're looking for is because it really depends upon the nature of the job. And let's say if the job is communicating with something that is deployed on AWS, then our priority is to deploy that job in AWS. So let's say we have multiple artifactories. One of the artifactories is on AWS. So let's say if we have a job which is looking to download a huge chunk of data from that artifactory which is on AWS, then we would look to deploy that job on AWS. But in other cases, if let's say there is some resources which are on Azure and our job is trying to communicate with those resources, then our priority would be in this case to deploy that job on Azure. And again, it is to do with redundancy as well. We have had AWS outages. So in certain cases, we did have to shift to Azure. So again, it's to do with redundancy as well. So you're running jobs, potentially the same jobs in some cases in multiple providers. I know we do a ton of that in OpenDev. We've got like half a dozen different cloud providers, multiple regions. And so we've noticed that it can be hard to achieve consistency. One of the things that we have done to help that is to build consistent images and deploy the same images into all of those providers. Do you take advantage of that functionality in NodePool yourselves? Yes, we do. So we have a pipeline for that as well, which builds the images for both AWS and Azure. So again, our base image is built by the CI images, which is another report that we use. And then on top of that, we basically build upon that base image to create specialized images for special jobs. So that is what we are doing right now. And we do that for AWS and for Azure as well. And believe it or not, we actually have Windows images too. We are fighting trying to get rid of Windows images because it's difficult. A lot of our problems has been connecting Zool to Windows nodes, especially when they were bare metal. I think we almost rid of our bare metal Windows servers, at least. And we have always tried to not use them. But some of these systems are traditionally developed for Windows, especially the garage, as Clark said, the garage. Some of those frameworks are still on Windows, believe it or not. So we sometimes need, we still need our Windows images too. So it sounded like maybe you're not using NodePool Builder to build these images, because you said they build them in a pipeline. You're actually using Zool to build the images, right? Yes, we're actually using Zool to build the images, yes. So I think that's, I feel like you're maybe a little bit ahead of the curve on this, because I actually, I have a development specification for Zool that I've started working on, and hopefully we'll be making more progress on it soon to actually move quite a lot of the NodePool image building into Zool itself. So sort of what you're doing now, where you have jobs that build the images. And it's kind of interesting because we've, in the Zool project, we've gone through a lot of, we've evolved quite a bit from, we started by doing exactly that. We built images in Jenkins and uploaded them to clouds. And then we found that we actually needed a daemon running all of the time to actually make this reliable enough across all of the clouds that we did. So we developed the NodePool builder to do that. But then we found that that has, that is, how should I say this, it makes the image building process somewhat inaccessible for users. So if you're hiding it away on this server, it's harder for people to see the build logs and the bug issues and things like that. So the next step is to try to bridge that gap by doing the image builds in Zool, and then also having Zool responsible for doing that upload. So kind of keep the background daemon aspect of it for uploads, but move the image building into the foreground for user accessibility. Yeah, the logs will definitely help because the user will have more visibility of whether something went wrong and he could basically debug the issue. So that's good then. Earlier you mentioned or you hinted at redundancy and outages and I'm curious to know, where have you found the operational challenges of running a Zool? Is it in keeping the service up with up high uptime? Is it the upgrades? Is it dealing with the jobs? Where do you find the struggles in operating a Zool installation? I think humidity makes it really easy to manage Zool infrastructure. Like I discussed, rolling updates is one of the features that it provides. So we are good on the redundancy side and the screenshots make it really easy to upgrade and manage Zool stuff. So we are good from that side, I would say. I think a few of the struggles we had, because we have two cloud providers, we sometimes have congestions in the communication and the network traffic between. And that has caused that we have got corrupted executors and yeah, because they just couldn't communicate as they should internally. And then we had to debug and try to. And then that's the struggle I would say. I would describe in a scenario that we encounter where let's say if one of the PVs, which is a volume on AWS, let's say that's in one of the zones. And let's say Kubernetes has deployed the pod in let's say another AZ availability zone. So our pod in that case cannot communicate with the PV or the volume that is inside another zone. So that's an issue that we have encountered. So in this case, what we have to do is we have to recreate that PV in another zone. And so that the pod and the PV are in the same zone. So that's that is one of the problems that we have encountered. The solution to this we have also discussed is that the solution can be to use EFS volumes instead of EBS volumes, which which are both on AWS, because EFS is multi supported. So that will sort of fix the issue. But again, we are looking to cut our cost down. So EFS is three times more expensive than EBS. That's what I have noticed. So it's more about balancing between the cost and performance, whether we want redundancy in this case, or whether we want to lower the cost. And it's I would say it's not too much for us. But sometimes we have to manually go in and delete the PV so that it is recreated in the same zone. I guess that's the issue that we have noticed quite often now. And and I think there are other issues that will that we have encountered previously. Like for example, if the the IO of the disk of executors that is attached to the executors, that's not enough. Then we have to manually upgrade the disk itself, which in Kubernetes is not a very easy solution because there are some attributes that you cannot just change. So you have to make those spots orphan, then recreate those executors and then move them manually one by one. So that's because there are some things like for example, I have mentioned the disk size, that's has to be done manually. But there are other reasons that can be done through Helm charts as well. Like for example, if I want to add affinity or anti affinity to some of the pods that I can do that through Helm charts as well. But for let's say something as as like changing the disk size, that thing has to be done manually. So that's what we have encountered in the past. Well, I think we're about out of time and that seems like maybe a good time to good place to stop. Yeah, I don't know if if Johannes or Bois, you have anything any last words you'd like to share? Not really. I mean, thank you very much. It's been really nice talking to you. So yeah, thank you for inviting us. It's been great talking to you and hearing about your your result setup. Thank you for having us. Yes, like like James said, we're almost out of time. So I want to thank everybody for coming today and I appreciate y'all joining us. And thank you to the audience for asking some really great questions during the show. A big thank you again to the Open Infer Foundation members for making the show possible. And I just want to mention join us again for our next Open Infer Live on September 21st. For this episode, our guests will be Thailand's largest open stack public cloud provider, which is NEPA cloud. So if you have any ideas for a future episode, let us know and submit your ideas at ideas.openinfer.live. And maybe we'll see you on a future show. Thanks again to for today's guests for joining us. And we'll see y'all on September 21st for the next Open Infer Live. Thanks. Bye.