 So, yeah, from outages to SLOs, that's the topic of this talk. Hello and welcome, everybody. The subtitle is focusing on Bosch performance. So just a quick check of hands, maybe. So who's here for the outages? That's going to be the fun part, right? And who's here because of the SLOs you're interested in, service-level objectives? A few people? Okay. So all your other people care about the performance stuff? Cool. Right. So, yeah. I'm a developer at SAP. I'm currently the Bosch PMC lead, and I'm the PM of the Bosch Core team in Europe. We are a team of 10 developers at SAP, and besides contributing to Bosch and maintaining the Bosch OpenStack CPI, we are also the team which is operating Bosch in our largest SAP internal Cloud Foundry installation. In terms of scale, we'll get to that at a later point in time. So if you're wondering what the largest installation means, I'll tell you in a second. Here's a few things if you want to reach out to me, like after the talk, you can also ask questions if you know the deal. Today we're going to talk about three things in particular. The first part is about outages that may or may not have happened exactly like this in our production systems. We are talking about the improvements that we have made as reactions to those outages, and lastly, we're trying to switch from a reactive to a proactive mode by setting service-level objectives, trying to monitor those, and figuring out how to react before bad things happen. Okay. So first, let's take a step back. So why is Bosch performance important, or why does it matter now more than ever? So something seems to have changed. What is it? Previously, Bosch was more or less an operator tool. People have been using Bosch whenever they wanted to change the state of a system. You needed to scale up your Diego cells. You needed to maybe scale down because there was less run currently on the system than there was previously. You needed to roll out a new stem cell operating system or software updates or whatever. And of course, Bosch is always there to auto repair your system if something went wrong. But then something happened. Bosch became a runtime dependency. How so? On-demand service brokers happened. So now your customers typing CF create service are talking to Bosch, which means if they can't do this, they are angry because they don't care about Bosch or whatever you might be using to give them their Postgres database. And B, they can't give you money. And that's always bad. We want that. So if you don't know exactly what on-demand service brokers are and you want to learn more about it, tomorrow afternoon, there's a talk by Zoe Vance and Denise Yu. I've heard it has dinosaurs in it. So please attend it. It's going to be awesome. I'm pretty sure. All right. One quick look on the architecture of Bosch. And you don't need to understand all of the details in here. I'm going to quickly highlight a few of the important things for this talk. So first, you can see that we have, as I said, an additional consumer or user of Bosch. And they are talking to the director, which is here, a big box. But we kind of lie here. And I'm showing you in a second why that is the case. And we do have other components such as the health monitor and the VMs have their agents. And those components are decoupled because they're using nuts as a message bus to talk to the director mostly. So this seems like an architecture that doesn't have obvious bottlenecks. It should be scaling. It should be working for thousands of diplomas, right? OK, so to summarize this prologue, we have new users with new requirements. We have new usage patterns. People are using it during runtime and relying that Bosch is there. And we have a new scale of Bosch usage and deployments on our directors, which brings us more or less directly to this, right? So bad stuff happens, or as I prefer to call it, this is evidence we can do better. This is not necessarily bad stuff. This is just our system telling us, well, there is something you can improve. Your mileage may vary here, which means the problems I'm going to talk about in a second might not occur in your current system because of scale, which means we are running currently a separate Bosch for our on-demand services, like all of them on a single Bosch. And currently, that's more than 1,000 deployments and roughly like a little more than 6,000 instances on that single director. And that's why we started seeing and keep seeing certain issues that you may face in the future or maybe have already seen on a little smaller scale. So outage one, I'd like to call that the HTTP 502 bad gateway, who has seen an HTTP 502 or knows what that is and when it happens. Okay, a few people. So here is a recap of what happens during that time. And as I said in the architecture diagram, we lied a little by depicting the director as one big box because actually it's two boxes. We have an engine X in front, which does all the TLS termination stuff. And we have the director as a Ruby app behind that engine X. And we do have some computer doing some requests to engine X. It does TLS termination. It sends the request to the director, you get a response, you know the deal. But let's say your director, for some of the requests, it's going to take a while because of reasons, right? You have issues in your code, you just have a thing that takes long, whatever. And then what happens when you add more consumers to it doing more requests? Some of them do get valid responses. But at a certain point in time, your Ruby app will just be busy processing enough other stuff that it will not accept any more incoming requests and not issuing responses any more, which means that engine X will send you in 502, meaning, well, that backend thing that I'm supposed to talk about, that's not working properly, right? So in our case, it was this specific endpoint that was the easiest candidate to be saturated. So we were trying to get the list of all deployments on that director. And as you saw, that's several thousands. And that was sometimes not responding well enough. But the issue was retrying fixed it. So that made us recognizing that there is an issue, but people were saying, well, I just did the thing again, seems to work. So unfortunately, that meant for us we didn't prioritize it as high during that time as we maybe should have, which brings us to outage number two, the Bosch health monitor flood. So a few words about the health monitor itself. The health monitor is stateless. It's there to compare your desired state, so what deployments and instances you want to have with the actual state in the infrastructure. And then it alerts and reacts on certain things if it finds a difference between those two. So here's how it works. We have three things here. We have the director. We have the health monitor. And we have all the DVMs that we have deployed. The health monitor requests the lists of deployments from the director. For all of those deployments, it requests lists of the instances that are supposed to be in that deployment. All the VMs regularly send heartbeats that the health monitor listens to. The health monitor compares that every 60 seconds. And if something bad is detected, it posts an event and potentially it schedules a scan and fix to bring up new instances that might have failed in the meter. So now let's look at what happens if we take the stateless system, which is the health monitor, and we feed it with outdated information. Any guesses? So here's our process. We get the list of deployments. For each of those, get the list of instances. But now let's say our deployments endpoint takes 20 seconds to give us an answer. And as we iteratively go through all of those deployments and get the list of instances for each deployment, the list of instances is actually 20 times x for x being the number of deployments that you have, seconds old. If you have thousands of deployments, that's going to be a pretty large number at some point in time. And as I said, the health monitor sends alerts. For example, when it finds an agent here reporting a heartbeat that the director doesn't know about, it sends an alert. It does so every 60 seconds. So let's say you are running a stem cell upgrade through your 6,000 VMs, which means all of them get new agent IDs, which means your health monitor will find more or less all of those 6,000 VMs to be there, although they shouldn't be because the Bosch database didn't provide you with recent information, which means this is going to happen. So what happened in our case is we had approximately 1,800 events per second being posted from the health monitor to the Bosch events API. Yeah, so essentially that happened. So health monitoring just killed your director. You couldn't use it. You couldn't do anything. You couldn't even fix it. So if that happens to you, there is no direct fix for it until now. So you might want to increase your rogue agent alert. That's what it called to some incredibly high number because that just means if the health monitor finds an agent that it doesn't know about yet, it doesn't do anything, which is working. Okay, so now let's take in those two auditors and let's look at the stuff that we did in response to those auditors and try to come up with some improvements for it. First the stuff that we did, right? We wanted to improve retrieving the list of deployments because that seems to be like the first major pain point that we seem to hit. We wanted to have easier parallelization of request processing because, and I'll get to it in a second, the way that the web server of the director Ruby app was built and used unfortunately didn't do that very well or didn't make it very easy to do. And we wanted to use multiple cores, right? Turns out our director is like a 16 or even 32 core machine. The director web server is using one, like, one. Okay, so here's how we tried to do this or did it. We, to get a faster get deployments endpoint, we looked at a few SQL queries and joints in the code tried to fix those. For easier parallelization we switched out the web server used within the Ruby app to a different one. And that one was, for that one it was possible to use multiple web workers using multiple cores in your system. For SQL queries you'll find stuff in your code looking like this, right? So this is supposed to be an accessor getting all deployments sorted by name. That looks very innocent, right? It just gets all the deployment, what does it do by name? Sounds reasonable. However, you need to understand for this that SQL, the object mapper used in here, uses lazy loading for all associations. So while you think you spent all your query time in there, you actually spent all the time when you're trying to serialize the result of this and send it to the client. Which is terrible because if you're using only a single core and a single process and a single thread web server, this blocks for 20 seconds. And the fix looks very innocent and very simple. You're just trying to eager load all that thing so you make sure you spend the processing time where you think you spent the processing time. As I said, the web server before was thin and that's a single process thing. It's a single thread and it uses eventing to achieve concurrency. If you have a SQL query like that, you think, well, I'm just going to defer that SQL query because that's going to be slow and then you can switch to a different statement. But turns out you spent the time elsewhere and you forgot to defer that one. That's what I mean by it's hard to do concurrency right when doing eventing instead of multi-threaded. That's why we switched it off for Puma as a web server. They can use multiple cores by spinning up multiple workers. It's properly multi-threaded, which means you get results if you actually compare the performance which looked like this. So on the x-axis, we have throughput and request per second. On the y-axis, we have latency measured in seconds. And the red dots are what the latency does when you're using thin previously. And the blue ones are what Puma gives us. At the same time, this graph also includes the changes that we did to the deployment endpoint. Using this combination, there is roughly factor 10, I think, what we achieved between those two configurations. So that was a nice improvement. At some point in time, though, this still wasn't enough, which meant we needed to switch the regular get deployments endpoint to something like this. So we introduced additional query parameters to exclude a bunch of stuff that we don't care about when we ask for the list of deployments. Traditionally, it also goes to the data with selects, all the configs, all the releases, and all the stem cells associated with the deployment. You remember this complicated SQL giant thing? Just get rid of it if you don't need it. So that meant the JSON that you get, for example, now, or that you used to get was this. You get your deployment and your name, and then you get a bunch of things you don't care about. Stem cells, releases, configs, and so on. And now, it looks like that. Just include all the exclude parameters, and you just get what you care about, which is the list of deployments. Right? Okay. So far for the reactive mode, let's switch to the proactive mode. Service-level indicators and objectives. To reiterate, what we have is a situation where we have new users with new requirements, they are having new usage patterns, and the scale is unprecedented that we are seeing. So we got to ask ourselves, what do our new users with the new use cases, what do they care about, and which of those things can bot influence and how. So users usually think about, I'm typing CF create service, so you better accept my task. I don't want to do this again. Just take it, give me my database at some point in time. I'm also concerned with how long does this even take? Does it take a day? Does it take like five minutes? I don't know. You tell me, right? And it better be successful, right? Because otherwise I'm opening a ticket saying, well, my service just didn't show up. So what are the things that we need to care about in Bosch? So obviously, accepting the task is highly dependent on the API's success rate. I'm submitting a new task. If you, if Bosch is able to store it in the database and at some point in time pick it up from the queue, that's good. That's what I care about. For that queue, what Bosch can influence is the time that the task spends in that queue waiting. We can't influence so much how fast infrastructures actually provision things, right? If a provisioning a VM takes five minutes, then it takes five minutes. And in terms of being successful, we can try to include mechanisms to be more resilient about certain scenarios. Okay. When we look at that and try to think about, like, how do we get to some kind of service level indicators and objectives for those indicators for it? One thing you end up finding over and over is the four golden signals. That's also in the Google SRE book, that's also in a bunch of blog posts and so on on that matter, which is saying you, in general, it's worthwhile to look at four things for every service that you're trying to operate, which is traffic, latency, errors, and saturation. In Bosch terms, I guess, that somehow means we are looking at the request rate, that's our incoming traffic, that's a good thing that we want to have and not stop, right? We're looking at the request response time, like, how long does it take to respond to a request? Are we seeing this slowness that we have seen before with requests taking, like, seconds and more? How long is the waiting time in the task queue? And what's my success rate of API requests? Am I rejecting stuff with HTTP 502 or with similar or different errors? In terms of resilience, I guess, like most of the resilient stuff is actually CPI work, right? Because they abstract the stuff that the infrastructure does. For example, in the OpenStack CPI, what we did is one common thing is that network ports tend to be there, although they should be gone, right? And you're trying to create a VM using that specific IP. And what we did was, like, if the port is there unattached to a VM, we try to reuse it and not throw an error as it was before. So stuff like this is probably what you want to do in all of your CPIs. What we can also try to do is, like, improve the workflow in Bosch, and one of the examples is the create update delete VM strategy, where Bosch creates your new VMs first before taking down the old ones. And that means it can create all of the new VMs in parallel, and you don't have to spend that time waiting for any serialized processing that happens afterwards. Okay, so how do we get there? What's the road ahead? What is the stuff that we're actually doing and pursuing right now? So currently, we're trying to do small steps by enabling telemetry for the director and for the Blobstore NGINX servers, such that you can get some kind of decent statistics about what is my HTTP connection queue, how busy is NGINX, should I scale it up, or am I doing well enough? For the Blobstore, it's a similar thing. If you're introducing, for example, something like Bosch DNS, this will introduce additional stress on your Blobstore sitting at the Bosch side. So this is also something that you may want to care about. We're enabling basic monitoring for NATS as well. So more and more stuff is sent directly through NATS instead of taking other channels, such as the registry or different channels there. So that's also something you want to keep an eye on, or we want to keep an eye on for now. We started to build a few dashboards that's currently Kibana-based and trying to evaluate it internally, how well that works by looking at all of those indicators and their objectives. And in the end, what we want to end up at is some kind of operations guide that has these indicators and ways to look at them, monitor them, and potentially alert on it such that you can more easily operate your own Bosch there. So another takeaway is performance tests, and especially automated ones are very useful. Remember that graph that I showed before where we improved the performance by a factor of 10? So this is actually something that is going backwards. So previously, this was the performance that we were looking at, and then a new change got merged. We added a new feature called generic configs, which added a bunch of new configs. And what the performance did was this, like coming from the red graph, going to the blue one. So this is why automated performance tests are cool, because by merging a single feature and nobody really thought that this might impose an additional load on a thing that they maybe didn't care or didn't really know about at that point in time, those tests will tell you immediately, and it's a useful thing. So we are trying to get our automated performance tests into the official Bosch pipeline at some point to have that stuff being detected automatically. If you know what the routing team has been doing with their automated performance tests, and you know what the graphs look like, you will find a strong resemblance to those graphs. That's because we were actually reusing many parts that they did with running automated load testing, creating those graphs in an automated way, and so on. So that's really been helpful and great for that. All right. So to recap, we've seen why Bosch performance now matters more than ever. We've looked at a few outages as evidence that we can do better, and that there might be issues that we need to do more about. I've talked about the recent performance improvements that we did as a response to it, and I've tried to sketch how we are getting to a more proactive mode by defining service level objectives for it. And I think the gist of all of this is we're trying to improve actively the operability of Bosch by switching from a reactive to a proactive mode and try to treat Bosch as a regular service, like as you would treat pretty much any service that you're running in your platform, which we haven't done previously. All right, with that, I think that's all I got. Thanks. And I think we have a couple of minutes left, so any questions? The graph that you showed for the latency, how was that generated? This one? So as I said, we were taking most of the parts from what the routing team does. They are using a Golang-based load generator and then just fire with a certain level of concurrency and request per second and so on. If you request to the get deployments endpoint, then you're generating the scatter plot using Jupyter notebooks, and there you go. And then this graph is just some kind of interpolation for the scatter plot there. Thanks. Other questions? If not, thank you very much, Marco. All right, then all that's left is this slide. All right, thanks.