 I want to talk about StormDriver one year later, and Michael Graph was originally supposed to present this, but he couldn't make it. I'm covering for him. My name is Gopir Ebala. I'm CTO of OpsimX, and before we get into what is StormDriver, basically what is the problem that we were trying to solve? We wanted to have some additional features into the CloudDriver. So we will talk about what those features that we wanted are and why is the current CloudDriver make it difficult to get those things, and the reason why we had to come build a new service for it. We'll talk about what the StormDriver design is, how we can deploy it, and then we will also see few other use cases with the StormDriver. So what are the features that we wanted? One of the primary ones is this account sharding. As we saw previous talks to you, when you have large number of accounts with CloudDriver, the resource utilization is quite high, and even if you do high availability, sorry, the scaling of the horizontal scaling of CloudDriver, each of these CloudDrivers still uses all the accounts, so you can't really reduce the footprint too much, but you can improve the performance through CPU sharing. So one of the things we wanted to do was address that. And the second one we wanted is also send service onboarding. In CloudDriver, the accounts are essentially in one file. The APIs that allow individuals to upload their own credentials is not there, and you have to build a layer on top. Typically, most of the people who address that by creating one more layer on top, having their own automation on the side with the Spinnaker that brings in these accounts. So we wanted to address that issue as well. We also saw some of the cases where because of the security, the target environments are not directly reachable by Spinnaker. So now you want to have a way of centrally having the process, but being able to deploy to those environments. And the same thing also works in SaaS environments. So we wanted to address that as well. And all of these features actually help you to do the multi-tenancy on the Spinnaker. So we'll see how that works. So just to give a recap on the CloudDriver, what the issues are. CloudDriver is essentially the Spinnaker service that connects to cloud providers and deploys as well as gets the state from the cloud providers. And it has this account model where it has the R back on top. So who can deploy to those accounts? Or who can view data from those accounts? A is specified on top of the accounts for these cloud drivers. It can scale horizontally. And the services that talk to CloudDriver essentially view it as one service. And the CloudDriver horizontal scaling number of instances are kind of hidden behind the service that behind the service and provides the services for that. So what we did was we put a scatter gather in front of the cloud drivers. So now what it does is each of these CloudDriver instances can have their own accounts. So you can, let's say if you have 1000 accounts, you can have 300, 300, 400 accounts with these CloudDriver instances. The storm driver sitting in front of it acts as a scatter gather. So the services like ArcGuard gate that are speaking to CloudDriver essentially talk to storm driver which caches the accounts from each of these instances and responds to them. It's designed to be fairly quick. The memory utilization for 1000s of accounts can be under 100 MB. And it's written in Go. We also wanted to keep the footprint really small and see how to get that efficiency in there. So we wrote that in Go. It basically connects to CloudDrivers because it's a scatter gather. When Orca wants to deploy, it knows which CloudDriver instance to go to deploy that. And if someone asks what CloudDriver accounts exist, it can give you accounts from all three of them together. So that's a fairly simple design. It works really well. But by itself, it doesn't allow it to do remote deployments and things like that. But so how do we configure storm driver? In Spinnaker YAML, you have the address for the CloudDriver that you reach to. So instead of CloudDriver, you specify storm driver and that's it. All the services will think storm driver as the CloudDriver. It connects to different instances, gets the data from it. So if you ask for slash credentials, it will get you the accounts from all the CloudDrivers and artifact credentials is a similar thing. It will get you the results from all the CloudDrivers. It also does the sanity checks and availability of these CloudDrivers. So it knows if someone is not responding, it can identify that and remove it from its list. So the architecture extension is essentially, if you want to connect to remote systems, if you want to have the self-service there, credentials management for these accounts, managed by individual groups, we needed a way for it to support that. So we came up with this controller agent model. You can think of this one as simply a tunnel. If there is a GRPC tunnel, controller runs on the Spinnaker side. The agents can be on the remote clusters. And so the agent would deploy in a Kubernetes environment in some other environment. It would connect back to the controller that creates a GRPC tunnel. At that point, any service on the Spinnaker side that want to connect to the remote simply goes through the controller and through the tunnel, it goes to the service on the other side. So you can have other Spinnaker services running remotely, for example, a CloudDriver. You can take the CloudDriver instance, run it remotely. We know that OrcaGate, for example, will connect to CloudDriver. They will go through the controller. At the same time, when the CloudDriver wants to deploy something, because it has the authorization that it needs to do, it needs to talk to Fiat. So this, because it's a tunnel, it can be like a service mesh, a single instance. It would say Fiat and it will go through the agent, go to the controller, and on the Spinnaker side, it can talk to Fiat and get the same response. So this agent controller model is, again, written in Go, very small footprint, a very fast startup times as well. So one of the advantages here is the discovery mechanism. So you can have multiple agents running in remote, multiple remote clusters. All of them can connect to the controller and each of these agents side, you can have a subset of the accounts or what the customer is using. And one other advantage here is that the CloudDriver running on the remote side, one instance could be running in AWS, one instance could be running in GCP. And so you can have this IAM integration for those CloudDriver instances there. So that way you can deploy to multiple clouds with their integrated identity connecting back to the Spinnaker. So basically the use cases we saw or you can deploy it in a SaaS configuration because the agent connects back into the Spinnaker. The Spinnaker doesn't need a direct access connecting to the remote system. So in a SaaS configuration, when you don't have direct connection to target environments, you can do that. And different IAM roles because you can run one instance in Google, one instance in Amazon and you use those integrated credentials to deploy to their secure environments. And the other advantage here is each of these groups that are onboarding their accounts, they don't need to share their target credentials with central Spinnaker. They can have it specified at the local agents. So this is a simple GRPC connection between the agent and the server. The agent connects to the controller. We use the public private key encryption for the environment. So the certificate rotation also can happen and entire traffic is TLS encrypted. So this is a full picture of how it works with the agent controller and storm driver. So storm driver acts as a cloud driver interface for any of the services at the Spinnaker sign. They connect to the storm driver. And on the agent side, if you have cloud driver accounts, multiple agents and each cloud driver can have its own N number of these accounts. They all, because of the tunnel, the storm driver quite is the remote agents for what the accounts exist on their cloud driver. The agent controller, because it's a tunnel and controller knows how many agents exist. So the discovery becomes fairly straightforward. Storm driver would just say, give me all the instances of remote cloud drivers you have. And it maintains a list of all the agents that are connected and what services exist at that agent. And it can discover all the cloud driver instances and accounts configured at each of these cloud driver instances. So now the discovery becomes fairly simple. And now you have a list of entire accounts and also with their R back, anyone who wants to deploy the Spinnaker R back with Fiat is still valid. So you see here the paths with different color coded pictures here that allows you to see how the communication paths are. So it's not just the cloud driver. You can see the Igor trying to talk to Jenkins. So if the Jenkins is on remote environment, which is not directly accessible, you can have the same mechanism working that Igor connects to controller. It goes to the agent Jenkins and get the responses back and Spinnaker will work as if it's all connected locally. And the reverse path also is the same. So if the cloud driver wants to talk to Fiat, it goes to the agent controller and goes to the Fiat. All right, so over the last one year we did deployments with this and what did we learn? It's a fairly easy to deploy remotely and because of the agent, it's very efficient. It doesn't take up much of the resources and scaling to large number of cloud drivers. It's a very efficient in terms of using memory and CPU. So typically we see that cloud driver memory goes up with the increased number of instances. Now with the sharding, we actually see the memory utilization go down by the same factor. So if you have 20 cloud drivers, we actually see it going down not exactly by 20, but more than 10, 15 is the thing that we have seen. And one of the problems is because now we have ability to install cloud driver remotely that creates some complexity in configuration of it. It's not single install, now you have to generate those packages to deploy remotely, which is straightforward, but if there is a problem there, then you start getting into trouble with troubleshooting. So we built logging back into the central server and ability to monitor it centrally so we know if there are issues with remote systems. So the troubleshooting and supporting are some of the issues we still can improve on these. It make it also the upgrade of the remote systems. They all need to work seamlessly. So those are some of the additional complexities that comes with this model. So we open sourced these systems last year. So they are available, for anyone who wants to try them out. Contributions are welcome. One of the advantages here is the true multi-tenancy we can bring to it. So only place where we actually see other users' accounts is when you're trying to build a deploy stage and look at the accounts or query for it. It says, these are all the accounts that are available, but you have access or not access, no access. So if you can, if feature flags say, only show the accounts that you have access to, then you would have a pure, full multi-tenancy. But there are other issues there because you need a hierarchy of groups that allows an admin to see all of them, but non-admins not to see others, that kind of stuff. So it's been a fairly interesting project and does improve the performance a lot for the cloud drivers and allows us to do secure remote deployments and onboarding users with their own systems without having to centrally provide these accounts. So if you're interested, take a look at the code here. You can download, try it out, and contributions are welcome. These are the guys who actually work on the code for these systems that primarily interface for these codes. So if you have any questions on the code or any suggestions, you can reach out to us. There's a very quick overview on what this is and how we've been doing. Any questions? Way back at the beginning, you highlighted some differences from out-of-the-box Spinnaker. I don't know if you can go all the way back to that slide, but there were two things that I think Spinnaker at least does some of out-of-the-box. So I guess I wanted to try to clarify that. Keep going back. Yeah, keep going back. That one. So account sharding CloudDriver does some of out-of-the-box, right, at least the caching is sharded. Is that right? It does. So you have to then run the CloudDriver in HMO, and the caching is the instance that does the sharding, but none of the others do. Right, okay. So every instance still knows all of the accounts. Right, okay. So that matches my understanding. And then you were talking about how CloudDriver accounts all live in one file. Yeah, so now we have the database that's there. Right, so it's potentially easier to do the onboarding. And I don't mean to diminish the benefits of this. I think it's a great piece of software, and it carves up some responsibilities in really good ways, but for people who are thinking like, but I thought, anyway, open source has least addressed some of the stuff, but certainly not all of it. So the API for automation onboarding accounts, you know, the Apple, they're going to talk about those, I think. But you still need to build this additional layer on top, but API definitely addresses some of the issues. All right, that's all, thanks. All right, if there are no questions, thanks everyone. Look forward to hearing from you.