 that works. So, hello, everyone. Great to be here. Great to be on the summit. Felix and Simon and myself, I'm Thomas. We are from a team. Actually, there are more of us here. And what we do is we supply a CI system for our colleagues. So, all colleagues who are doing the software for the cars. There's quite many, you know, software got more complex. And it's actually really demanding to do the software for the cars. So, driving dynamics, autonomous driving, head unit, instrument cluster. There's really a lot. You have a lot of dependencies. So, it's demanding. And that is a reason why our colleagues also push us so that we supply them a good CI system. And they pushed us also to scale that to meet that demands. And so, it's not a coincidence. This is a picture of the marathon of Berlin happening here every year. Why is that? Because for us, to be honest, sometimes it was painful. And that is a bit like if you train for a marathon or if you then want to get better, you hit also easily some personal limits maybe. At least I'm used to that. And then you can of course say, well, stop it here. But you can also say, hey, what can we do to improve? And that's what we would want to show you. So, what we actually in Seoul, we also met a problem, a limitation. And we did something in the upstream project. I was a lot with Tim and others. And we will tell you what that is and what we did change. So, but first of all, why do we do this? This continuous integration as a kind of a service to our colleagues? Well, often continuous integration starts small. So, there's one, maybe he or she likes our test automation and continuous integration. And does that for the team. But I would say this is more like starting as a hobby, not as a main job on the future team. And if you look at the bigger organization, many development teams, big projects, then there are more and more of those. And what's the problem here? The problem is that you start to solve the same actually problems. So, you create different solutions for the same problems. So, that is not so efficient. And therefore, we have a bit separated this and we are a team really supplying this continuous integration as a service and supply also some common solutions so that not every feature team needs to solve that. And our feature teams can more focus on the actual features and, of course, also the project-specific aspects of the continuous integration. Now, a bit from our perspective, why not another CI system? We have just heard also from Jim some good reasons why Sule for us. You have seen we want to supply common solutions. So, Sule has actually the right structures to do that. So, for example, this job hierarchy, so we can supply some common jobs, some base jobs for all the projects, for all the users. And even in big projects, you can supply also some generic jobs for that domain even. So, and then you can have very specific jobs for specific problems. And there's more. There's Ansible roles. There's templates and there's tenants. So, you can really, you have the right structures. The second thing is big projects. You have a lot of changes, a lot of things to be merged, to be tested. And therefore, you need really a true gating system. Otherwise, your master is quickly destroyed, it's read. And you need also the power of speculative merging that Sule has so that you can integrate more changes during a day. Otherwise, it's too slow. So, another thing, we're not going to detail this here. There was the other talks. And last but not least, if you operate that whole Sule, you need also to scale Sule. And Sule had from the beginning very good components and structures to support that. So, we could scale that quite a lot. And Felix is now showing a bit the history of the last two, three years about it. So, hi from my side. So, what I want to show you today is how we managed to scale out the I system over the last years. And what I want to use to show you is this chart that shows one of our key metrics that we use to measure the performance and the stability of our system. And that is the so-called jobs per hour. So, the question, what is a job? Basically, a job in our system can be anything that a user wants to run in our system. So, it could be something small like a commit message check. It could be a unit test. It could be a study code analysis. But it could also be a very long-running IO hungry job that takes up several hours. So, we have all of that in our system. And having this metric really gives us a good overview of what's going on, how the system performs, if there are issues. Yeah. And, okay, one point I forgot. This chart shows the last two years, roughly. But actually, our Sewell history goes back much further. So, we started around six years ago with a prototype of Sewell version 2. And I think around four years ago, we switched over to Sewell version 3. What you can see in the beginning of this metric is the first half year, where you can see that we, yeah, we were around 3,000 jobs per hour. And there were some ups and downs, but you can see that it's very stable. And the point here was that we were really limited in our system. So, we didn't manage to get any higher because of, that we only had a single scheduler instance. So, we already had a large system with a lot of components. I think we had around 100 Sewell executors at that time. But only this single scheduler instance. And the point here was also at that time that it wasn't technically possible to have more than a single scheduler instance. So, we thought we have to make that possible to allow our system to grow further and to scale with the amount of projects and tenants and jobs that we have in our system. And having only a single scheduler in that case comes with a lot of scaling issues, so you can have instabilities in case you have memory leaks and this single scheduler instance crashed. I think it took around 20 to 30 minutes until our system was up and running again. During that time, you lose all the events, all in-flight events, you lose the state of the current scheduler. And also, whenever kind of some factor in the system was going slower, so like the event processing went slower, there were more events piling up and the event processing took even longer. So, we were really at a limit at that time. So, we decided that we have to make this scheduler scalable and we knew that this would be not easy. So, we knew that it would take some time to develop, but we also knew that we would have to, during that time, still keep our system up and running and allow to keep up with the increase in load or something like that. So, we decided as a first step to try to optimize this single scheduler instance. So, we did several things there. For example, we implemented a feature we called a synchronous reporting that basically moved a lot of I.O. operations that involve GitHub or Garrett out of these scheduler main loops so that this main loop was faster and processing more events in a shorter amount of time. We did a lot of profiling sessions to identify other performance bottlenecks. We fixed a lot of memory leaks, stuff like that, so to make this scheduler faster and prevent it from basically from crashing due to different reasons. And then at this point, we really said now is the time we can start with this bigger feature. And that's basically what we did over the last year. So, starting from the beginning of last year, we actively developed this feature. And we did that in close collaboration with the upstream community and other with Jim Blair. And this really gave us kind of direct benefits. So, we actively developed that on the main branch. So, we decided not to split up into another branch and do all the stuff and merge back in the end. So, we continuously developed those features. And I think you can see that very good in the chart that with those single scheduler optimizations, we could improve our system a little further. So, up to around 4,000 jobs per hour. And then about this year, there were some ups and downs, but we kind of kept this level. So, that was due to some of those improvements or some of those features really gave us an improvement. And there were other features which kind of had some performance impact because we were basically running a multi scheduler feature set in a single scheduler deployment. So, there were some features that had an impact on the performance, but other features that improved the performance. So, that kind of kept the level. What we also did, we took some measurements in our own system. So, apart from this feature development, we increased the number of cloud regions. We increased the number of schedulers further. And also, we increased the number of build nodes. So, all of that allowed us to run even more jobs in parallel. And yeah, that basically allowed us to keep the level of the load or the throughput of our system. And what you can see in the end of the chart is where we really deployed this multi scheduler feature. So, beginning of that year, we increased the number of schedulers starting from one scheduler, I think up to nine schedulers, which we are now running. And you can really see that this gave us a big performance improvement of better throughputs. And also, it gave us a better stability for our system because we don't have these single instance issues anymore. And whenever one of these schedulers wasn't performing that good anymore, or the other schedulers could take over the work and basically went on with what they were doing. And now, Simon will give you a more technical view on what we actually did. Thanks. So, let's have a look at some of the technical details. I'm missing that somehow in this conference. There are so much high level talks, but I really like the technical pieces. So, this is the Sewell, let's say a high level architecture of Sewell version 4 as it was two years ago. We have Gearman as the central piece here. Basically, all the other components used for communication. So, I want to quickly run through this example flow here. So, let's start in the left upper corner. We have an event. For example, pull request is created and it gets sent as a web hook to Sewell web. It will basically dispatch the event to the scheduler. The driver will do some pre-processing of the event and then add it to an internal trigger queue. From that point on, it's basically the Sewell scheduler main event loop taking over and pushing the state machine forward and processing all the events. So, a change gets enqueued. The scheduler will scheduler merge. We have Gearman receive the result then request nodes. So, for this, we already had Zookeeper in our system for node requests and the nodes. Then with the nodes, it would schedule a scheduler job on the executers and then receive the result back via Gearman and finally report the result back to the change. So, on this picture, we can see the prompts that already Felix alluded to. We just have one instance. So, this means it's a single point of failure. We cannot do rolling updates of the system. And it took pretty long for us to restart an instance because we have such a big tenant configuration like it was between 30 minutes, sometimes even longer. And also, since there's a lot of in-memory state, we have queues, we have the pipeline state and a lot more. All of this is lost when you basically restart the scheduler instance. And we've also got a second problem. So, we have one main event loop here. This means that all the events need to go through this event loop, basically. So, you can end in a situation where you have a resource contention and you can imagine it being like a traffic jam where when events start to pile up, it tests those ripple effects where the system gets slower and slower to a point where it's not responsive anymore. So, you can actually see this resource contention on this metric that we have. It's called run handler frequency. It basically measures the count of how often a scheduler will go over the tenants and the pipelines and process outstanding events. And this is showing it per five-minute interval. So, here, we have even some points where we don't have a full iteration during that time. So, what this means is to the user, the system feels totally sluggish and unresponsive. And they, like, you probably get people asking, why is my change not starting to get not showing on the status page? So, coming now to version five where we are able to run multiple schedulers. Basically, what I noticed before, I forgot the database here as a component, also in the previous graph, but it's not of interest on what I want to show here. So, I want to go through this example once more now with the new architecture. So, we have again a pull request, for example, a connection event coming in. And so, web will add it to a queue-like structure in ZooKeeper now. So, this means it's persisted from that. One of the schedulers will pick up this event, do some pre-processing, and add a trigger event to the trigger queue, which is also in ZooKeeper. Then, one of the schedulers will start basically pick up this trigger event and queue change as before. All the merge requests and also the job execution, this is all done via job queues and result queues in ZooKeeper. And we also have the pipeline state itself in ZooKeeper. So, this means basically, when I restart a scheduler, the state is not lost because there's not much, there's a bit of a memory state, but we can reconstruct it from the data we have in ZooKeeper. So, like, this solves now the rolling upgrade problem. And we can also run multiple schedulers in parallel. And I get to, like, the level on which we can parallelize in a minute. But before I want to focus on, like, from this diagram, it looks like it was just moving some things from the memory state to ZooKeeper. And why did it take up to a year for us to do that? So, while it's true, actually, that we just had to move some of the things to ZooKeeper, like, we had to work on a lot of primitives basically that we could later on leverage to move the data more easily. So, yeah, we moved some of the queues that we had before. But there was also, like, new concepts we had to come up, especially around config loading and configuration, like how to make the schedulers coordinate with each other. And, like, initially, we planned to move the pipeline state first to ZooKeeper and then continue with the parts around it. But we figured out pretty early on that, like, there were too many code-level dependencies. So, what we did instead is to first go with the easy structures that had no dependencies or dependencies that were easy to break. And we moved them to ZooKeeper first and then, finally, the pipeline state when we, like, had eliminated all or a lot of the dependencies and was more easily to move that. So, coming back to this question, how do we run things in parallel on multiple schedulers? So, the basic level that we can use for parallelizing things is a pipeline. In other words, this means that we can only process a pipeline on one scheduler at any given point in time. And in order to achieve that, we have several means in place. So, like, I promise this is as technically as we'll get, we have two logs that are distributed logs. So, like, I know there are no such things as distributed logs, but we also have other means in place to deal with that. We basically have a tenant read-write log. I'll get to why we have this in a minute. And we also have a pipeline log which acts as a kind of a mutex for that. So, when a scheduler is processing a pipeline, it first acquires the tenant read log. And this can be done by multiple schedulers. And then it will pick a pipeline and acquire the pipeline mutex. And then it can process events and process the pipeline. So, there's certain things that require exclusive access to a tenant. And this is mainly reconfigurations. So, whenever there's a change to the Zool configuration, the scheduler needs to do something that's called a reconfiguration. So, it needs to load the new configuration, create something that we call layout state, which is a new version of the configuration. And then all the other schedulers need to follow and also update their local version of that. So, for this, we have a write log. Basically, the scheduler will try to acquire the write log, wait for all the other schedulers to leave the critical section and release the read logs. And then it can process the reconfiguration, create a new layout state, and then release the log again. Afterwards, the other schedulers, as mentioned, can go ahead and update their local version and process pipelines again. So, this brings us back to the metric that I showed you before. So, this is a screenshot from last week. It's also showing the run handler frequency, basically, how often it will go over the pipelines and process events. And as Felix mentioned, we are running nine schedulers at the moment. And this is now much more consistent. This means it feels snappier to the user. You see that the changes show up almost immediately on the status page. And this was taken during a time where we had quite a high load, so around 6,000 jobs per hour. And with that, like I will hand over to Thomas, who will wrap things up. Yeah. So, a little bit looking back at this and the results and how it did it So, first of all, actually, we did have a problem when we started deploying this. You have seen from Simon how much has been moved to ZooKeeper to have these dates stored centrally. And actually, the data size of that has been quite huge. And it's quite huge. So, we first had to optimize that to get it really at our load working. And this might be a topic that maybe needs also in the future a bit more attention. Having said that, so, the development overall, I mean, this was a huge change. You have seen how much was changed and really use a focus. It worked very well to work here with you and the community. So, thanks for that. And, yeah, I think we really managed to something with Chin with others to do that. It enabled us to do actually, since we have deployed that, a lot of rolling updates. So, before that time, we did really have maintenance breaks, no maintenance breaks anymore. So, with our CI, that's cool. And, yeah, before that time, we have been really afraid if there was some problem in the single scheduler, memory problem or so. And now, if one scheduler dies, no problem. So, that's great as an operator. And here, again, you see the user impact. So, the time before we deployed, this is a job that runs regularly, periodically, and it always executes the same software build. And those are the startup times. So, this is what the user kind of gets. And you can see we had some really horrible long startup times. And since we deployed that, and we did some more optimizations, it's actually quite okay. I wouldn't say it's perfect. So, there are still things we can do, I think, but it's really good. Yeah. So, I don't know if you know this guy here. So, it's Elliot Kipchoke. He did run the world record in 2018 here in Berlin. And believe it or not, he's still training really hard to improve even further and maybe run below two hours in an official marathon. And I think that's the right spirit, right? So, we can do even more. We can improve things more together. That's the power of open-source software. Thank you. And I think we have still, I'm for one or two questions if you are interested. If not, you can also find us later. So, during the breaks, if you are interested in something, come around and talk to us. Okay. Thank you. Thanks.