 My name is David Byron, I work at Salesforce, and I'm on the Technical Oversight Committee of Spinnaker. I've been at Salesforce about two years and working on Spinnaker the whole time there, and I was working on Spinnaker for a few years before that at another company. Gonna talk a little bit about what we've been up to with Spinnaker. I am gonna end up getting into what are potentially some low level bits and pieces of Spinnaker architecture in some terms, so if I'm saying some stuff that I need to explain, then stop me and I'll explain it. If you have questions along the way, fire away. Don't have to wait till the end for that. All right, so the beginning for Salesforce was about, like I said before I got there around 2018, so we've been using it for a while, and initially we were just trying to get things working very quickly that turned into getting things working on a pretty big scale. Scale for Salesforce means lots of all of these things, lots of accounts, lots of applications, lots of pipelines, hundreds of thousands of pipelines, which will become more of a focus of the talk later, deeply nested pipelines, and around 400,000 a day executions across all of our Spinnaker instances. Yeah, and through all the dimensions that I can think of, we're deploying things in lots of regions, we have lots of images, we have lots of users, we have lots of everything, so we run into, we run into limits all the time and we're knocking those limits down, so that's what we spend our time doing. Yeah, I'm gonna show a picture of some of the bits of Spinnaker we're gonna talk about today, so let's see. In there we've got Orca and Cloud Driver and Echo and Front 50, these are the names of Spinnaker microservices and this is a copy from a slide on the page on the Spinnaker website. Yeah, we'll get into this in a bit more detail going forward, but one thing to note is that in addition to what it says there about Echo, it's about sending notifications and receiving webhooks, it also can receive PubSub information to trigger pipelines, so something to note. And yeah, I wanna talk about some of the things that we've done to make Spinnaker work for us. These are some of the not so recent fixes that we made. The ones that are marked upstream have gone upstream. Yeah, they're pretty straightforward, but the out-of-the-box endpoints for saving pipelines, you only get to save one at a time, but with as many pipelines as we have and with as often as they change, we needed to do a bunch, so we did that. Front 50 is the sort of, I don't know, it's fair to call it the data store, but it's the repository of things like pipeline configuration and applications and all those things we listed out in an earlier slide that we have lots of, they live in Front 50, so to try to be smart about not hitting the database all the time, Front 50 has an in-memory cache and we did some stuff to try to make the caching more efficient and the SQL queries more efficient. And I just got those upstream. They're merged into master, 1.31 of, version 1.31 of Spinnaker doesn't exist yet, but it will real soon now. These, all of the features that you're gonna hear me talk about today are behind a feature flag that was turned off by default, so it might not be till 1.32 that they're turned on by default. We'll see how people do with them out in the world. Yeah, compressing pipelines. Not gonna talk about that too much today, but pipeline executions are another place where Spinnaker has a big blob of JSON and our blobs were so big that they didn't fit in the database anymore, so we had to compress them. There was some excitement, I guess, or I don't know, spirited conversation about whether this was a good idea and taking the changes upstream. I think we're gonna try again to get those changes upstream soonish. We have a big backlog of stuff, so it's hard to get them all up. We had a bunch of issues with CloudDriver startup time, especially with lots of accounts before the new account APIs showed up. Kind of the only way to do it was to have a big YAML file, and it took a long time to parse the YAML file and to get things going, so we tried to make that faster. And then, yeah, the on-demand caching is something that I don't think I've talked about too much, and it's something I wanna dive into a little bit more here. I think people sort of generally know that most, I don't know, most of the place that seems like Spinnaker spends its money or its computing powers is by poking around at the accounts and caching information there to display it in the UI or to be able to operate on it in pipeline stages, so it's basically every time somebody adds a new account to Spinnaker, you're spending money, and even if they never, ever use that account, it was still spending money, so we got tired of doing that, and this is what the on-demand thing is, so let's go into a little bit more detail of that. Yeah, and again, I'm gonna try to get these changes upstream, but they're not there yet. So on startup, it's just like it was before. Let's say you have 1,000 accounts, caching happens for all 1,000 accounts, but a timer starts ticking, and if nobody runs a pipeline that uses an account, eventually that timer will expire and we'll stop caching it. That all by itself is not particularly earth-shattering, and then keeping track of when these operations happen to restart the clock, that's all fine, and for Kubernetes accounts, that was kind of it. I don't know how much people have traced the sort of evolution of the behavior of Spinnaker and Kubernetes, but it used to be that Kubernetes pipeline stages depended on the cache to operate, and I think it was around like version 1.23, this thing called live manifest mode became the default and basically Spinnaker pipelines or Kubernetes pipeline stages didn't depend on information in the cache anymore, so if an account is idle here and we don't cache it, it doesn't really matter, the pipeline would still work. The consequence is that you don't see information about that account in the infrastructure tab. It turns out AWS accounts were more complicated, and this find image stage, find image from tags, I think it's officially called, that all it does is basically interact with the cache, so if the cache is empty, you're gonna say find me an image, and you're not gonna get an image. If you haven't been caching images for that account, you're toast, so we had to do something to fix that. There's already this notion of force cache refresh, another sort of gnarly detail of how AWS pipeline stages work, and we looked at it and we were like, well there's already this force cache refresh thing, let's try to use it, and it just seemed like it was too hard. We gave up, it's complicated, one of the goals was to not do extra work, because of this feature, we wanted to only be doing the amount of work we were doing before, so maybe we invented a bad name for this, but we invented this concept called warming the cache, so if you, now at the beginning of a find image from tags, there's this warm cache stage, if the account is already active, then warm cache doesn't do anything, but for an idle account, it sort of reactivates the account, and then it sits there and it knows how to wait for the normal caching that happens to happen, so now the account comes back to life, find image from tags, waits for caching to happen, and then everything continues on, just as it was before, information gets written to all the same tables, there's no extra work, and the delay is typically very small, doesn't take long to schedule these caching agents, and they only need to get scheduled, find image from tags, you specify an account, sometimes that's actually one of the complications, you don't have to specify an account, and then we have to spin up caching agents for all the accounts, but I guess this is a side note, specify the account, you probably know which account you're looking for the image in, I don't think there's a field in the UI to put it there, but if you hand edit the JSON or build the JSON through some other means, specify the account, and specify the region too if you know it, it cuts down on the amount of work done. So now that we've talked about the old stuff, I've talked about the old stuff, I'm now gonna get to the good part. Yeah, you know, every day since 2018 it feels like more users are coming on, more teams, more pipelines, more everything, and yeah, we've had to work hard to keep up, it's in there, I guess I wanna talk about the teams at Salesforce that do this, there's my team, which is known as the Spinnaker product team, and we're really kind of nicely siloed, we get to focus on the code base of Spinnaker, we don't have direct operational responsibilities, like we're, I don't know what the appropriate term is, but we're like the third level of people that get escalated to when there are issues, there's a whole team, the Spinnaker service team, and they are super operationally focused and really make our life a lot easier. One of the things that they've done, I don't know exactly when, but over the past few months is, and this sort of sounds like a very easy thing, but turns out to be pretty significant, we now have access logs, we have Nginx actually running as a sidecar in front of all the Spinnaker microservices, so when two services like Orca and Front 50 talk to each other, they're actually talking through Nginx to get to each other, so it's pretty easy for us to see what's happening traffic-wise. So yeah, I wanna give a big shout out to the service team, specifically one of the guys on the team named Tavis Paquette, he collected a lot of the metrics, a lot of the data that you're gonna see me present shortly, so yeah, once we had these access logs, you wake up one day and you realize like there's a hotspot, and it's a really hot hotspot, so let's see what it was and what we did to fix it. So it's all around receiving, retrieving pipeline configurations and the slide sorta lays it out for you. Every time a pipeline completes, Orca was querying for all the pipelines and every 30 seconds, every Echo Pod was querying for all the pipelines all the time. Like this is, we were just sitting here warming the globe, querying for all these pipelines and mostly dropping the results on the floor. Like what do we need to be doing this for? There is a feature in Spinnaker that when a pipeline completes, the completion of a pipeline can trigger another pipeline, so Orca does actually need to know some of this information, but not all of it. And Echo doesn't really need to know about triggering every single possible pipeline, it only needs to know about the pipelines that's actually gonna trigger now. There is a bit of a special case there with manual execution, but we'll get to that in a second. So this is sort of how we got here, the before state of like querying all this stuff, it's crazy talk. It turns out it's so much crazy talk that all of the data flowing around in Spinnaker, 93 and a half percent of it was pipelines. Like these interactions, like all of the data, like people talk about CloudDriver, this and Orca that, whatever, like pipelines, pipeline configuration, JSON. And like I said, we have a ton of pipelines, we have probably more pipelines than other people, but still like this is happening. This is happening for everybody, it's crazy. So yeah, we took some measurements, Tavis took some measurements over a 48 hour period and that's how, that's the number that came out, it's crazy. So then we had to try to fix it. So these are the PRs that put up and they've been merged. Like I said, they're going into 1.31 behind feature flags that are defaulted to off, but yeah, we did it. So let me explain what we did. So what was happening was Orca was querying for all the pipelines and then looking for the ones it actually cared about. So basically like dropping 99% of the stuff it just asked for dropping it on the floor. So instead, this doesn't feel like rocket science, but take the filtering, move it into front 50, then front 50 only returns the information that Orca actually needs and everybody wins. And then similarly, like only some pipelines have triggers defined. So Echo's not going to be triggering every pipeline, it's only going to be triggering the one with triggers. So only Ask Corn 50 for pipelines that actually have triggers. And there's some more, slightly more details to it. Triggers can be enabled or disabled. So you only want the triggers that are enabled. And so that anyway, that was it. Now the thing about manual execution is that when you click the play button in the Spinnaker UI, it maybe doesn't feel like a trigger in the same way that like a PubSub message comes in or any of the other sort of more obvious trigger mechanism. But that, the UI ends up calling through gate, which calls to Echo, which triggers a pipeline. So if Echo doesn't know about every pipeline, it can't trigger every pipeline. So there was a bit of a dance and it turns out that Echo was already querying for like one pipeline. When you run manual execution, you don't have to query for all pipelines. You know the pipeline you're executing. You know the idea of the one pipeline you're executing. So only query for that one. And then you no longer need to query for them all every 30 seconds, all the time. So we did it. And let's see what happened. This graph, these graphs are going to be maybe a little small and a little confusing at first glance because there are two different X-axes because we rolled out the changes in two different phases. We did the Orca changes first and then we did the front 50 changes second. I'll just skip to the bottom. You know, we got a 60% reduction in our bill. So that's a big deal. Spinnaker, Salesforce spends a lot of money on Spinnaker. But you can see the CPU usage in Orca. This is where the glasses and the screen thing is not helping me. See if I can go over here and go through it with you. The top one is a purple line. That's just about eligible. The second one you can see, you can see echo go down a little bit later and the bottom two are both front 50, the second from the bottom and purple, you can see the first phase of front 50 CPU reduction. It goes from like between 60 and 80 cores to 10 or whatever it is. And then on the bottom, that's after the Orca change went in and then after the echo change went in it went from 10 down to five. So like, it's like a whole new Spinnaker. Basically like anybody who is doing anything operationally with Spinnaker, like their blood pressure went way down, like everybody got very excited. So yeah, instance counts went way down, everything, everything went down. And it's still kind of an annoyingly big number, like 13% of all the traffic, of all the traffic in Spinnaker is this one thing. It seems stupid, it's such a big number. But it's not 93 and a half percent, it's only 13%. I can think of ways to make it smaller, shipping some deltas around and only shipping things that have changed instead of all the things, but we stopped caring because this was pretty cool. And now we have some other things to focus on. So anyway, like I said, 1.31 has all this, it defaults to off, but we will get it turned on by default. I'm guessing in 1.32, but we'll see. So yeah, what's next? Well, just last week or whenever it was, we turned off caching for Kubernetes accounts. And if you don't care about the UI for Kubernetes accounts, you get to have another cliff to fall off of CPU usage. So we did it. There's a tension there for a lot of people, Spinnaker is a great UI to look at everything. It's certainly a great UI to look at AWS stuff. It's maybe not as exciting for Kubernetes, but it's better than nothing. And if you don't have another dashboard kind of solution, it's great, but we have another dashboard solution and we get to save a lot of money here. So we did it. Yeah, I guess I just wanna kind of hammer it home and pause for a minute. Like one of the big knocks on Spinnaker has always been it's super expensive to operate. And it's still kind of a beast, but it's a lot less of a beast than it was three months ago. It doesn't have to be so bad. And we continue to work on making it better. Maybe we're more motivated than most because our bill is big and we have a lot of big Spinnaker instances, but we are hammering on it to make it better. So yeah, it does not have to be so expensive. Let's see if I push a button. I tried to push a button. Yeah, so what's next? In a way, our focus hasn't really changed. It's still like, I don't know, basic functionality and scale, but I guess we keep moving the goalpost. So what used to be, we now take a lot of stuff for granted and anyway, it just keeps getting harder and scale keeps getting bigger. So let's look at some of the stuff we're gonna work on. And I guess the thing about security vulnerabilities is always on the radar. So this is a little bit of a peek into that. I think we've talked about this in some other forums, but yeah, we're trying to keep moving with Gradle and Groovy and Java 17 and more Spring Boot upgrades finally hopefully, hopefully getting to a version of Spring Boot that's actually like actively maintained. So it's not as big of a deal to stay up to date. The Retrofit one to Retrofit two change their, Retrofit, I guess it's like just Retrofit, but I call it Retrofit one because I get confused otherwise. It's subject to vulnerabilities, but it's also sort of orphaned and whatever all the new hotness is Retrofit two, but it turns out it's kind of a big change. They made a bunch of structural changes. So it's a lot of work to take all the references of all the Retrofit one classes and make them go away. And we're working with Opsomex to do it and it's gonna take a while, but we're gonna get there. So Java 11 is officially getting end of life in September, Spring Boot stuff is already unmaintained. So like we, I don't know, you get to choose which unmaintained thing you wanna live with for longer. I think I'm gonna be pretty excited to move to Java 17 as soon as we can. Yeah, and then as far as scale goes, I've talked a little bit about pipeline execution JSON and how we're compressing it, but it's still sort of a seemingly big, this ends up being the Orca database pays a big price. We pay a big price for Orca databases because these executions are so gigantic. Scale doesn't only have to be about computer scale, though it's also about human scale. And this may be something that you guys experience. I know I experienced that at my last company, but basically anytime a Spinnaker pipeline fails, people are really, really quick to point the finger. It's like, oh, well, that must be Spinnaker's fault. Or like, you know, there's a bug in Spinnaker or we hate Spinnaker or anyway, there's a lot of hate and blame and whatever, we have thick skins. But it turns out that it's inefficient, you know, organizationally to have it happen like that. And if Spinnaker, well, of course if pipelines could fail less, we would love that. But when they do fail, if it's more obvious, like what the cause of the failure is, like if you put in some invalid Kubernetes YAML, like deploy manifest is not gonna work, even though Spinnaker did the best it could. And even if your YAML is fine, something might be wrong in your code and the readiness probe never trips and the container, you know, the pod never comes ready. And so Spinnaker's sitting there waiting for the deployment to stabilize and it never stabilizes, but like Spinnaker did the best it could do. But the support ticket gets filed and it ends up in Spinnaker's plate, you know, if you're not very careful. So this is what we're working on up ahead. So, and of course we have this big backlog of changes that we'd like to get upstream and we're desperately trying to get ahead of the performance curve ahead of our user adoption so we have time to do that. So yeah, that's what we've been doing. And if anybody has questions, there is a microphone to pass around so it makes the recording. But if nobody has questions, I won't bother. Yeah, let me check, here. So you talked about the improvements for the performance and kind of the cost savings but could you talk about it from the end user's point of view of what the response times and reliability, did you measure any of that? These particular changes, I don't know how, I'm trying to think of how user-visible they would be. We're actually focused right now on making the UI snappier. We have some applications that are so big and have so many pipelines and the pipelines are so big that sometimes the UI just doesn't render at all. If that happens to you, try Safari. It seems to work better than Chrome and these gigantic universes. So I hope to have a better answer for that soon. I don't think, I mean, other than just like, relaxing the system so that, if anybody was querying Orca for anything, they would get much better responses than before. But I don't, it didn't seem like this is the kind of thing where the UI just magically got better, but it had to. We didn't measure it specifically. We were more worried about the bill. The other way to look at it is, with the changes you made, you went from 200 CPU to what can CPU? Yeah, I mean, 75% of them we gave back. So we're using a quarter of the computing resources, basically, than we were using before. With that, there is no perceptible difference by end users to complain that something is not working. I mean, it's better that, you know, there are fewer GC pauses, it's way less spiky. So it's like, I don't know, we nailed it. It's like better all around and cheaper and faster. But, I don't know, it ended up working out really well. You also mentioned that you disabled the caching. Is it through running as a chain and then scaling down caching parts? Or you had a flag that disabled it? No, this is a flag that's been in Spinnaker for, it's already out in the world. It's not like a new thing that we added. I think it was Hermon from when he was at Armory, he did it, and you flip a flag and it disables Kubernetes caching. Yeah, we didn't scale down. We still have AWS caching happening, caching for other cloud providers. So we run, we have cloud driver caching pods, we have RO and RW and the caching pods are still there. We just don't need nearly as many of them anymore. They're not doing as much work. On the caching thing, you mentioned if you have another dashboarding solution. Can you talk a bit about what you guys use as an alternative? I can, I think I'm probably gonna get some terminology wrong. There's like the basic Kubernetes dashboard. It might even be called like Kate's dashboard. There's a team at Salesforce that runs that and so you can go to a web page and see your Kubernetes dashboard. And it's a cheaper way of doing it than having cloud driver cache that stuff for you. And then, I mean, the Spinnaker UI is nice. It's nice and pretty. It's all right there. I have to say in all of the times that I ever deploy stuff to Kubernetes, I've never used the Spinnaker UI to look at anything Kubernetes. I use it a thousand percent of the time for AWS stuff and I never use it for Kubernetes stuff. Yeah, so it's a, I don't know, it's a choice but it turns out the code works. You flip the switch and you save the money. I mean, it's true that it's a little bit fake because somebody's got to be spending the money but that Kubernetes dashboard is much, much cheaper. And it only generates traffic when somebody's actually looking at it whereas Spinnaker's caching whether people are looking at the UI or not. So, I know some other companies, I think Home Depot has a sort of a better mousetrap for how to do this where they basically preserve the UI and have a more efficient way to do caching. And I would love to get that out in the world so everybody uses it as part of the Box Spinnaker but until then here we are. I guess I didn't say this about the thing about warm caching, warming the cache with those on-demand AWS accounts. I think that may actually pave the way, it's a lot of work to do it and it would be sort of one pipeline stage at a time but I can sort of see a path where AWS pipeline stages wouldn't depend on caching either and they would get the information that they need right there in line in the pipeline stages and you'd be able to turn that caching off too. Anybody else? All right, thanks everybody.