 All right. We're going to go ahead and get started now. And welcome, everybody. Thanks for joining the session today. My name is Daniel O. I'm CNCF ambassador and sub-artist track chair at CubeCon. And I'm so glad to be here for modeling this session today. So we're going to talk a little bit about the cloud agnostic design for front end profit today. And then just one quick reminder. And then you got any technical question during the session. Please raise your hand at the end of the session. And our great speakers and the presenter will address your questions. So now I'm going to introduce our presenter, our Alex Meyer, the infrastructure team lead in Corsair Inc. And then also the unusual IR, both CTO and co-founder in Corsair Inc. Please welcome on the stage both Alex and Anusha. Great. Thanks, Daniel. Welcome, everyone. Thank you for coming. Good to be here in person. And thanks to those of us who are also online. Wanted to talk today a little bit about kind of our approach for cloud agnostic design. And you know, really I just wanted to convey like we're really excited to talk about this because we applied some of these things at our organization really. And I think I'm not exaggerating when it's changed the way we do business. It's just a kind of a new way of looking at things. A little bit more about us. So as Daniel said, I'm Alex Meyer. I'm the infrastructure lead at Corsair. I've been with Corsair since 2018, discovered Kubernetes a few years before that in 2017 actually. And I came here from places where we really were all in on a certain cloud provider and kind of got ourselves locked in. So we experienced the pain of that. And that cloud provider effectively had pricing power because we were locked in. Absolutely. Yeah, I'm Anusha Iyer. I'm the CTO and one of the co-founders of Corsair. Introduced to Kubernetes through Alex, in fact. And just fell in love with the whole concept of infrastructure as code. I think it really elevates the idea of infrastructure as a first class citizen in the whole software engineering development deployment process. And so we've really embraced it head on into integrating into code reviews and the whole kind of dev pipeline. And it's fantastic and excited to tell you more about it today. Yes. So we're going to structure our talk today sort of as a case study of what we've done. We'll set the stage early about our app just to kind of give some background and kind of motivate why we even care about cloud-agnostic design. We'll go over what we consider the fundamental building blocks to this approach, which is Helm and then Helm charts scaling that out. We'll explore some economics involved in the modern cloud computing market. We'll talk about Terraform, actually Corpse Terraform, which is a big part of our success here. We'll go briefly over our journey to date. And we'll try and apply these concepts to maybe other organizations of varying sizes. And we will talk about where we hope to kind of take cloud-agnostic design beyond this. We'll end then with kind of some summaries of lessons learned and maybe some of our favorite stories from our journey. Yeah. And so a little bit about what we do to just set the stage. We are an early startup, just about eight people. We're about five of us are on the engineering side. And there's a few of us on the BD side. We're based out of the DC area. A lot of our roots actually come from the DoD intelligence world. And so even this idea of embracing cloud took a bit for me personally. And in terms of what the app actually is, what we've come up with is a way to do fully automated MFA for APIs. And we've heard a lot about API security, I think, even in the keynotes and things I request. Have it go with a one-time use, dynamic credential, just the same way that you do with Google Authenticator or an RSA token on the human side. The way that we had to go to get it fully automated was we actually have these notions of authenticators, just like a Google Authenticator. And we push it down to a machine, so your API client. That could be a Kubernetes pod. It could be a Docker container. It could be an Edge IoT device. These authenticators start establishing dynamic identities against our platform. And a core part of our platform is a distributed ledger network. This is what's going to collect the identities and then verify the MFA credentials on the other way out. And so the notion of it being a distributed base, distributed trust, really fundamental to what we do and really led the way to a lot of the design philosophy we'll talk about today. Certainly, it's a security product. So what do we need? We need things like high availability. We need to be able to support the platform as both a SaaS as well as on-prem for certain sensitive customers and things like that. And because we're talking about API requests and authentication, we need scale. So we need to be able to support thousands of API requests per second. And so look for those themes as we go through the talk. So to give a little bit more depth as we hone in on what exactly is running in the cloud, distributed ledger is called, it's hyper ledger fabric. It's something developed by IBM. We've taken it kind of to the edge. So portions of it run in the cloud, portions of it run on the customer's systems. The customer is on the left here. The cloud portions are on the right. And I wanted to draw everyone's attention to these ideas of these peers and order objects that hyper ledger uses. Now, their exact operation isn't terribly important, but the orderers themselves form a distributed system. They use raft and elect leaders and that sort of thing. The peers use gossip to establish their own distributed system. So we get this sort of consensus mechanism. And that, as we'll see, has been really an important decision for us. All of this, we have been able to run darn near free since 2018. And how do we do this? Well, all the major cloud providers have startup credit programs. They want you to come over and try their system. They want to lure you from their competitor. And they will give you a whole bunch of free money, essentially, credit to use against their system to evaluate it. And basically, you can kind of see that we start off with GCP, went to AWS. And we got to this point where as one startup program was sort of drying up, so to speak, we had already gotten the wheels turning on the next one. And by applying these principles, we were able to kind of gain economic. Absolutely. And a startup credit's obviously great for keeping operating costs low. I guess show of hands here for those in-person and maybe out there virtual. How many of you have worked in a startup or are doing a startup right now? Quite a few, yeah. So I'm sure you're all familiar with that cardinal rule, right? Droom, don't run out of money. And so certainly, that's been a big driving force in terms of being able to do this kind of hopping. But I will say that one of the unanticipated benefits of taking this route is it's a great way to, at least it was for us, to understand that landscape, the whole cloud native ecosystem of what we're trying to protect. And also to meet customers where they are. So now we've got, say, a sensitive customer that's running in Azure Gov. We can say no problem. We can set up a managed service for you there. And so it's really turned us into almost a managed service that's a forest as opposed to just a large tree. And we, of course, did start out with this outcome in mind, I guess you could say. We originally started to annouce this point is that we need to run on a wide range of customer systems. Since we are a security product, we have customers that say, no open internet, right? No access to the internet. So they need to run on-prem. And so in that case, we say, OK, you can use cops, for example, Kubernetes operations, and deploy a Kubernetes cluster. And we don't have to kind of take on that effort because CNCF already handles a lot of that. Some of our customers want to run on VMs, right? They run on just EXEI, other VMWare stuff. And they just want to get off the ground and running as quickly as possible to prove us out. So for that, we install K3S, a lightweight Kubernetes client, onto a VM, deploy it to the customer. They can run it in their legacy infrastructure. They don't even know Kubernetes and Helm are under the hood. And then we get it in that way. Other customers are fine running on the public cloud, and they just need massive scale, right? So we support all the major cloud providers. All this from the same set of artifacts that we developed and pushed. Right now we're using Azure, so we develop in Azure through our Helm chart repo and our Docker registry. It's that point of indirection where we support all of these kind of transparent. Absolutely, and so form follows function. We had kind of a set of requirements in our minds when we started this off. This sort of took us towards this path. One was certainly we wanted a distributed consensus model where especially as a security platform, there's no single point of failure, no single point of attack. We really wanted to raise that bar in terms of attack effort necessary. And so a Byzantine fault-tolerant system like a distributed ledger network made a ton of sense. Other things we wanted to be able to, as Alex mentioned, run across a wide array of platforms and environments, and so Kubernetes is a fantastic indirection layer for that. And then similarly, being able to support kind of those air-gapped and sensitive scenarios even while at the same time offering a SaaS. And so there, the ability to omit that cloud platform or omit the requirement to connect out to a SaaS was an important design constraint for us. And so in those routes, as he mentioned, sometimes we've done K3S, sometimes it's just you can run Kates cluster locally and deploy us there. And then similarly, putting together a product you want to get to production grade services as quickly as possible. And that's where that ecosystem of helm charts has just proven invaluable to us. It's just such a force multiplier in terms of going out there and knowing you're getting such good quality stuff that you can get up and running quickly. Yeah, so let's take a look at one of these charts. In this case, we're looking at the Bitnami Postgres chart. So what we've done here is I've gone to AWS's RDS site, which is their sort of managed Postgres service and picked off kind of their bullet points of why you'd want to use that and sort of compare them here to one of the Postgres charts. This is a pretty typical advantage of one of our DIY cloud provider services. And what we're trying to get across here is that a lot of these features that RDS gives for you in exchange for, of course, paying them, you can get using already existing unmodified helm charts that you just pull off of artifact. For example, RDS says, we have fast predictable storage. Well, with the Bitnami Postgres chart, you just set global.storage class, you can use the exact same type of storage that RDS is using, and it's a parameter. No backup and recovery. There are tons of Kubernetes backup systems and like Valero, which is what we're planning on using. You can adjust read replicas, high availability, again, all these features you can get from configuration. Metrics, for me, this, the Bitnami chart has built in, Prometheus metrics. The one thing that, of course, a chart that you do on your own doesn't have as an SLA, right? RDS has an SLA, financial SLA, that it will be up. And this gave us pause, I guess, but we've been using this stuff in production for a while and really, I mean, if you flick all the switches, you can get some real stability out of this. And as long as you follow backup best practice, you can recover from it. You know, yeah, there was a little bit of config up front that was needed, but again, all this is committed to code, right, all this is just in there, and so it's a one-time cost. Another example of a chart that this time we implemented, we call it our bootstrap chart. So when we're onboarding to a new cloud provider, there are some differences like storage classes and that sort of thing that is just specific to the cloud provider. And you know, you really, there aren't much opportunities there for a generic design. So we concentrate all of that cloud provider specific configuration into this one chart. So we say, you know, when it's time to onboard a new provider, what do we need to even look at to begin considering to make sure we have a smooth transition and it's one place. So for example, I have part of a helm chart here and we pass in the provider as a helm parameter and it just renders differently based on whether you're using AWS's storage provision or otherwise. And in this bootstrap chart, we have of course, storage classes, ingress settings, since load balancers are slightly different between different cloud providers. Our web application firewall config, again, slight differences between cloud providers and get you, and in the future, we'll have sort of this base set of CRD so that those will be sitting in the cluster, ready to go. So everything in our Kubernetes cluster is a helm chart, excuse me, without exception. So this quickly becomes unwieldy. You have dozens and dozens of these helm charts and I mean, you could, I guess, grip a bunch of helm installs one after another but it just, it becomes difficult. So what we rely on is an amazing tool called Helmfile available on GitHub and you basically list out like a manifest of all your different charts and you can see here a little toy example of an elk stack, a logging stack being built by a helm chart. So we have, you specify the version of the chart you wanna install, you point to the chart online and it supports values files. So again, you point to a file that has the values for that chart. It also supports encrypted secrets. So we encrypt all of our secret data with PGP and it's just in the end. Right, so you put all of these pieces together and the emergent property that we have is we have a cloud agnostic platform. To give you a sense, we can go from zero to fully operational for our dev cluster, a staging cluster, production stacks across all of your major cloud providers in the matter of a few minutes by just kind of pointing the cube config at a different cluster. So it's super powerful stuff and really a couple of things that we've leveraged on the platform providers have helped us is one, we're sticking with managed gates on the platform provider. So it takes some of the burden of running the cluster yourself and managing that and then also making sure that we don't rely unintentionally on any managed services. So there's nothing that we have to bring with us when we're moving over and we really are kind of this self-contained application platform in a box. Absolutely. So we had this realization that we can operate on any cloud and again, it sort of came on us as a surprise but when we kind of get this self-awareness, we're like, well, what can we do with it? So we looked at the most straightforward thing which of course is cost of a cloud instance. Now cloud computing has increasingly become a commodity. We have all the big players and it's again commoditized like one VCPU on AWS is pro-intensive purposes the same as a VCPU on a GKE and for example, some cloud providers are profitable and they are willing to defend market share by pricing low, others are willing to just take a loss. An example of that is GCP because they are trying to gain market share. This situation obviously is very beneficial for us, the users, so just as a sort of example here, I grabbed the prices in late August from all the major cloud providers of the instance type we use primarily in M2, M5.2X large instance and sort of compared them and these are all roughly in the same region. So for example, if you move from Azure to AWS, that's potentially a 15% cost savings for an equivalent on-demand instance type if you move from AWS to GCP, it's potentially a 1% more expensive, right? And I mean, again, this is very specific, you'd have to create this for your region that it's variable, but just sort of a snapshot of like there are some pretty significant opportunities here for cost savings. Yeah, and so here's the challenge, it's fantastic. You can save quite a bit by being able to stay nimble, stay agnostic, switch providers, but we don't want to oversimplify things. There are nuances between the providers as Alex has mentioned, sometimes the load balancers, ingress controller, storage classes, things like that. And so even things like, well, how do you leverage the Cates API for a provider? How do you actually set up your configuration for that? And so there are, again, tricks to doing that. Briefly among them is the use of Terraform, right? Our goal is to get a Kubernetes API and backing worker nodes up as quickly as possible so that we can just hit it with our Helm files and get back up to our full kind of capacity. So Terraform is created by Hashicorp, but it is open source and it's a pretty remarkable piece of software and it just basically lets you codify cloud provider configuration. And then it uses AWS's APIs to install those resources or GKE or so on and so forth. It really did let us go from zero to having a Kubernetes cluster ready to receive API commands in under two weeks. This is because all of the major cloud providers have lots of examples, well-maintained Terraform modules because again, they want your business, right? They want you to come to them. But we sort of flipped the script on what we did with our Corsa bootstrap Helm chart where we instead of concentrating all the company that's different, we concentrate all the configuration in Terraform that's the same because the sort of assumption is that everything is going to be different in a Terraform module because the provider's right. So in our common, in this little example, we have a module that AWS EKS configures an AKS cluster and EKS cluster and provisions it. Azure AKS configures an AKS cluster. All of these rely on this common module that has stuff that transcends cloud providers like maximum size, for example, of a node pool or your IPs, right? Those are going to be the same on all the cloud providers. And we index into this list of variables using the Terraform workspaces, whether it's dev or prod, and then just smash them into the big list of cloud provider-specific configuration and there you go. You only have one place to make one change and then it's applicable to all your cloud providers. Just before we go on, I just wanted to kind of give a quick nod to the challenge of migrating prod in this sort of paradigm. For us, our architecture bails us out in that it's a distributed system that's dynamic, right? We can add and remove nodes slowly. In the example I have at the bottom here we're what we call straddling a cloud provider. So we've gone and added a node into GKE but it's still communicating with other blockchain nodes in Azure. And then slowly one by one those will come over as we move completely over to the new cloud provider. There are a lot of other examples of this for different systems. Postgres has this PG pool tool which allows you to add and remove replicas and fails over quickly. So there are ways to do kind of zero downtime prod migration. For DR, again we're going to kind of gloss over this because everyone's situation. Different Valero. There are lots of other Kubernetes objects. Kubernetes volume snapshotting right there. A lot of non-cloud provider specific options for that. And we would definitely recommend at least a little automation around the creation and restoration of this just to again make it all various. But to kind of put this all together and give everyone a look at our stack we have here a bunch of our different namespaces that we deploy. All of this in support of our of course customer workloads which are a Golang business logic fronted by Node.js for any sort of front end work. And then Postgres for customer data. We just bring our own CI CD namespace. We install our own SonarCube, our own Jenkins. We deploy an Elk stack using a last alert for log-based alerting which is a Yelp project. There are CNCF tools sprinkled all over this. Prometheus is there. We install Grafana. We install the Kubernetes Engine Xfingerist controller and then you cert manager which again handles all our renewal of certificates and everything like that when we're on board to a new provider. One thing I would just want to point out is we still do use Route 53. No matter which cloud provider we're on. We use a tool called external DNS. We give it a service account so it can communicate to AWS and then wherever we're from it just, it phones home that way. Yeah, so essentially what we've done is kind of put this application platform into a box as well as wrapped it around all of the support and services we need to manage development, testing, continuous pipelines. And so it's a fantastic way for a small team to kind of use the whole ecosystem as a force multiplier. We're definitely pinching beyond what you would imagine for the size of the team to be able to maintain all of this and still keep our sanity. Yeah, so we thought this might be helpful to any of you that are considering taking this path is to just kind of walk you through what our journey was and how we matured and added elements to our ecosystem along the way and just kind of how they staged out and stuff. So we started in 2018 when the company was founded and this is pre-Alex. So we were on AWS very much in the world of VMs and Docker Compose on AWS. And then he joined in Q2 of that year and sort of opened up our eyes to the power of Kate's, if you will, and Helm. I think Q3, we got into the Google for Startups program, a little bit of credit there. And so decided it was worthwhile transferring over learning more about GCP. And then 2019 introduced, our stack was getting more complex. We had more elements to it. So introduced Helm file at that point for better organization. And then Q3 is when we got into the AWS startup program. And that was a larger chunk of change and worth kind of shifting over. And that's when I think you and I made that conscious decision to be like, all right, let's go full in here. Let's go with kind of a cloud agnostic approach to this whole thing. And you invested in Terraform at that point. Yeah, absolutely. I would say yeah, when we switched over to AWS was kind of our inflection point in terms of being committed to this as an approach. And you could sort of see that, right? After Q3 of 2019, when we used Terraform then you can almost see us sort of asking ourselves what cloud provider services are we relying on? What do we need to do to build our own? So soon after we were using a lot of log ingestion features. So we built our own, we built an Elk stack, right? A little bit later, okay, we're using their CI CD system. What do we do? We build our own, right? We went and got Jenkins deployed that in. At that point, AWS's credit was kind of running out. So then we set our sites on Azure again, Terraform our cluster there and made the jump. Pretty soon thereafter, we just picked our next thing. What are we relying on? Monitoring, right? And of course, there is a terrific DNCF project for Meteos. So we went all in on that and got that deployed. More recently, we've been kind of tightening our supply chain for software. So we do security scanning on all of our containers. Again, everything is nice. And again, we use Helm charts for everything. But we are utterly dependent on them. So we began to assign our Helm charts, right? So that no one can intercept that portion of our software supply chain that our customers can also verify that we are the producers of it. And our credit with Azure is about running out now, I believe. So we're hoping to set our sites on our next cloud provider, which we may go back to GCP. And I'll say there, when we say we're moving or going back, really what we've been talking about is moving our Dev cluster and kind of our day to day ops there. But we are simultaneously having deployments in different clusters at the same time. That's right, yeah, based on customer. Yeah, and so I know we've been talking a lot about kind of our startup journey, but hopefully there are lessons here that transcend more than just startups, right? There is a lot of negotiating power that can be had if you have a nimble kind of agnostic design and that you are able to jump across. As Alex mentioned, we are in discussions with GCP right now. And the providers, they're very incentivized to have you come. So even beyond the financial benefits, oftentimes you can work with them to have them show you what are the best practices for terraforming onto their cluster or what are the tips and tricks of optimizing performance or storage classes or node types and all of that. And so we've been working with GCP just wrapping up a pilot there. And they've been fantastic to just kind of help us along that maturity scale. Yeah, we're a little bigger, right? So we have a little bit more cloud. And so you kind of take advantage of different programs, not just early stage startup programs, there are plenty. So looking ahead, we've spent most of this time talking about our economic motivation, right? Saving money or that sort of thing. What we are really interested in doing is using these cloud agnostic design concepts to further our business goals, right? So as again, a cybersecurity company, we are like pretty paranoid, right? And so we ask ourselves, what if we went multi-cloud and deployed one portion of our application to each cloud provider, and then they communicate at the application layer? We call this a zero trust deployment, right? Because we are guarding against the compromise or crash of any single cloud provider, right? And in this sort of example, we would deploy a third of our blockchain on AWS with its own set of private entity material, third of our blockchain on GCP with its own private, and then Azure and so forth. And then our clients, of course, just talk to all three and it's completely transparent. To kind of summarize what lessons we've learned on our journey, Anusha and I sat down and talked about if we could go back in time three years, what did we do well or what did we wish we had done better? But one thing that's the top of the list for me anyway is way back we picked a layer of persistence that stores the data that has really good distributed systems properties, right? Which lets us chiefly migrate between providers, right? You get this consensus and you begin to straddle cloud providers with a portion of your consensus on one portion of your consensus on the other. Migrate dev to your new cloud provider first, make the developers put up with any sort of things that fell through the cracks in terms of cloud friction or pain, a lot better than than the customer. But be nice to them and maybe take them out for lunch. And then, you know, as Alex mentioned, we've kind of kept DNS on AWS just route 53. It's isn't black and white in terms of what you need to stay agnostic with. If it's a low cost service, not worth the effort, that's fine. So there's obviously choices as you're going through this. And then definitely, you know, that whole rich world of free open source software and the ecosystem of helm charts, it's such a great force multiplier. But, you know, with each element that you pull in, really understand what you're getting into, especially in terms of configuration and the dials you can turn. So I guess funny story here, when we introduced a modsec WAF into our ingress controllers, we turned it on in production and I was going through a couple of weeks of just a whole bunch of demos, like back to back. And, you know, a very stable demo environment, never had troubles before. All of a sudden, the demo gods are just like pouring terror upon me. And about every two or three demos, unreliably, I would start blocking myself, like literally get like flagged by modsec. And I realized, we didn't realize that was happening in time, but the demo flow I was generating poor authentication requests and I was triggering the defaults in our modsec chart, right? So those types of things, you just really have to watch the dials. Yeah, yeah, and again, the root cause of that was minor differences in the way that different cloud providers implement Kubernetes load balancers, right? So needless to say, that is a piece of config that now lives in our bootstrap helm chart and is all just right away and is rendered, you know, in a cloud specific way. Yeah, that one was definitely paid for blood, so to speak. You want to pull as much off the shelf as you possibly can, obviously, especially Terraform. There's a lot of stuff that is very easy to adapt and begin to just hobble out of the gate with. And IAM in particular, right? There's a lot of great resources for that. That's something great to manage with Terraform and kind of simplifies the ops workload. Yeah, and then, you know, resources and limits. I will begrudgingly admit that this is a really good idea for the scheduler. Again, story, we were going through kind of this performance tuning phase where we really wanted to keep latency low and keep transaction rates high and everything. And my vote was always to take the limits off, right? Hashtag no limits. Let's see how high this thing could go. And we quickly learned, and you'll see through Slack every time, if you can look into our Slack, every time I would mention this, hashtag no limits, there's a little eye roll that Alex would add in there and he was right, right? Resources and limits are super important for the scheduler and optimizing kind of what that looks like. So yeah, I mean, there's kind of tips and tricks. Hopefully some of our pains are gonna save you some, so. Absolutely, well, I think that includes the main portion of our talk. Thanks for listening. We have any time for questions? Hi, great talk. Thank you very much. I was gonna ask about the Terraform bit. You guys talked like, how are you approaching the cloud agnostic philosophy with working with something like Terraform? Are you building modules on modules? Are you abstracting these layers? And if so, is it open source or is it closed source? Yeah, thank you. Sure, yeah, so the question was, how are we utilizing Terraform in kind of a cloud agnostic paradigm? I guess I can. So to answer your question, I'll probably go back a few slides to our Terraform slide. Each provider is actually a module as we have it currently. So an EKS cluster is a module that we bring in. An EKS cluster is a module. And again, that's invoked and we make a lot of reasonable choices inside those modules. There's not much to them because again, these cloud providers are so great with these that it's just the essential config. And we basically take these modules and glue them together with our common module that has code that we want in each of these other modules. So you could say it's a composition of modules. And we do not currently have those open source, but we'll be awesome someday. That's a good, yeah, it's definitely a thought. Yeah. Yes, we actually at the top of the hour, so please continue your conversation and question on the whole and then our presenter more than happy to address the question. And we actually, a bunch of the question from the virtual audience as well. And please join the Slack channel and ask any question and our presenter will be there to address the question. And thanks for joining again. And please make sure to submit your feedback in the schedule.com and it will be very helpful to prepare for next grade QCon as well. Thanks again. Enjoy the rest of the summit at QCon. Thank you.