 So, hello KubeCon. Thank you very much for coming to our talk. My name is Tom, and I'm a Solutions Engineer at Jetstack. I'm joined here by Oli, so hello Oli. Hello, yes, I'm Oli. I'm a software engineer improbable in the Defence Unit in the EVM Engineering Velocity Team. Today we're going to tell you the story of how improbable and Jetstack worked to build deployment and release service using Kubernetes technologies such as Argo workflows and events. This is a multifaceted story which includes many complexities and design challenges that we had to overcome in order to be able to achieve the ultimate goal of providing improbable with a platform that could be deployed anywhere under any circumstances, whether that be on the Edge, on-prem, or in public cloud. So without further ado, we want to kick off by talking to you about the problem at hand. We needed to start out by determining what is the platform and what are the ultimate goals of what we need for it to be delivered. So Oli, what's the platform that we are building for and what were the sort of design challenges we needed to overcome to have the platform deployed and operational? So improbable, we've built a synthetic environments platform and a synthetic environment is essentially just a highly scalable simulation or grouping of simulations that can be used to represent various movements across multiple domains. So when I say a domain, I actually mean in the defence sense of the word. So we're talking land, sea, air, space, cyber, and we can simulate vehicle movements, civilian movements, and we can package all of that together and enable our customers to build a virtual world containing all of that information. So that's what our platform does. It enables you to build a fully capable synthetic environment at scale. So our platform is essentially made up of multiple components built by multiple people and multiple teams. When we deploy it, we cannot really assume that we're going to run on any specific flavour of Kubernetes. It could be running on a customer site and in a military context that could be a forward operating base with zero internet connectivity. So these are the target deployment environments we need to deploy to and that's essentially one of the problems we're trying to solve here. So I guess Oli, you mentioned that the end goal of the platform was to be able to deploy on any Kubernetes service. It needs to be completely unbiased in that sense, but also it can't rely on an external internet connection. These sort of problems, you can think off the top of your head if you've got any experience with building platforms like this as to what the sort of challenges we face were to be. But what sort of challenges does that pose? Well, there's things like not being able to pull containers down is a huge thing. We're having to think about how we deploy to somewhere where you can't just pull source containers down. So we need a way to actually deploy it as some other way, whether that's by us getting the yaml together and putting it on a USB stick. It goes through some vetting process and then it's plugged into a customer site that might be one way. But it's all the things you take for granted by just having an internet connection, which these days is pretty much everything. Those things are something we have to work around because they're not always available to us. Yeah, absolutely. So moving on, I can imagine that a platform that allows you to build virtual worlds can be very complex. And I can imagine that there are lots of different components moving around that being developed by a lot of different people in a lot of different teams with lots of different methods. So what sort of problems did you face from that perspective? Yeah, as many. So yeah, we sort of offer the platform as a base and I guess we'll probably talk, get onto that in a moment. But when we want to implement it, you know, a customer has a certain use case and we've got an issue there where suddenly we need to, this platform, our core product needs to be fully extensible in that we need to add things onto it to get it to do a certain type of simulation or use a certain type of technology. And, you know, in order to do that and get all that working together, you may need collaboration from third parties or certainly even just other teams within our own business and getting them all working together. And then once you've got all these components sort of talking to each other and working together, you need to test that. And so that presents challenges in itself as well. You know, you may have a different use case for one customer and then another use case for another customer. You can't use the same approach to testing across both those use cases because you've got the different components. So yeah, we need a way to easily do integration testing of components on a per customer basis. Tracking changes and versioning, there wasn't really any of that. It was a bit of a mishmash, especially when you've got third parties involved as well. That's when it really gets complicated. So it would often be a case of just getting some automated tests running in a container, standing up, you know, a temporary cluster or something, mini-cube or something like that even, you know, just to get some end-to-end tests to run. The problem was, those tests might work on one set of components to one customer. You'd have to write a whole other set in a separate pipeline and everything for another customer. We really needed an easier way to do that because it was starting to get a bit messy and a bit uncontrolled and really difficult for auditing, that sort of thing. I want to ask you about, you know, the end deployment to the end customer, because we've spoken a bit about how we need to be able to deploy to any customer infrastructure. So if you want to tell us a little bit about, you know, just to reiterate over some of the problems that were posed there. Yeah, we have to essentially figure out how to package up our product, which is, like I said, a bunch of different components, which is, in some cases, a bunch of different microservice or data sources. Package that up and get all of that information, that deployment in the containers and figure out how we're going to put that essentially anywhere. So, yeah, bare metal is obviously another popular one because military customers are not going to be running on AWS, for example. You can imagine the manual task of having to sift through those manifests to find the specific images, having to trawl through lots of manifests to find annotations. That could be a real challenge and I guess the end goal is trying to find a situation where you can develop a platform that requires as little extra combing and refactoring after being built so it can run on the likes of these air gap facilities or any facility for that matter, no matter where you're going. I guess this is where it got to the point of the project where we had to come up with a solution. I'll start off by, you know, walking you through what we call the orchestrator. We decided we want to have a centralized approach of the development platform. So we call it the orchestrator. And what this is is a single source of truth for tracking component changes on the improbable defense platform. This orchestrator is also our single backend data store. So this will be relied upon to store all of the data required for tracking those changes and all the management that we want to be taking place. It's also responsible for orchestrating infrastructure. In an ideal world, we want the improbable developers to be able to request an environment based off specific version and just get to work from there. They don't want the hassle of deploying and configuring before they can test their changes. We also want a structure inside of this orchestrator system to be able to autonomously execute tests. So in an ideal world, improbable engineers shouldn't even really need to spin up environments to test their code as they develop. In an ideal world, they'd have a CI CD like pipeline, which just shows their changes being escalated from an alpha stage all the way to general release. And finally, we want an always available set of demo environments. Yeah. Yeah. So just on those demo environments, I mean, that's even just between, you know, developers and business people within our company. It's really useful to be able to say and look what we've created and we can do a demo on that. We could quickly, you know, spin up a cluster and just ping a link to somebody and can show off the work we've done. So that's a really useful capability for both developer and business, you know, communication. And it could even be used for, you know, product demos and things like that, which we have done. Yeah. Absolutely. So we spoke about this orchestrator platform, which is the central hub of the development cycle for the improbable defense platform. In our proposed solution, we wanted an automated system for generating new versions as changes are made in Git. No matter which team it is that makes the changes, it's automatically picked up by the orchestrator cluster and then generated with a new version that can be consumed by those developers. That can then be taken onwards into the QA journey as those engineers start to integrate those changes further with the other components on the platform. The system designed by improbable and Jetstack also very importantly needed to solve the problem of projects. We want this orchestrator system to be able to handle data and object separation between each project. But what we also wanted was dedicated versioning for those projects, as well as automated testing. And finally, dedicated developer environments for those projects. There are external developers that are collaborating on these projects to make the synthetic environments come to life. And we need to make sure from the improbable defense standpoint that access is limited to all the access that they need and no more. Yeah, absolutely. And yeah, I guess like in a sort of how you apply that, it can be achieved through for our back and various other means. As we'll talk about in a minute, the sort of namespacing of various jobs and things like that. Yeah, lends itself nicely to this sort of thing. That's all good. Well, I think we should crack on with the technology choices. Hopefully there's no OpenStack or Mezos to be seen. Not that I've got a problem with OpenStack or Mezos. Oh yeah, you can deploy Kubernetes on OpenStack now, can't you? That's not something I've found the time to do. No. Okay, so the grand reveal. What were our technology choices? I'm sure you're not all shocked to find out that our decision for the orchestrator cluster was to use Kubernetes and Google Kubernetes engine. As I mentioned, it's our nucleus for all operations. It is our main backend store. We mentioned that we wanted to keep track inside of our orchestrator cluster with dedicated versioning. We achieved that by using Kubernetes CRDs. I think this is an appropriate time to get some opinions from Oli on why we made this decision and why we didn't go for an option such as creating a dedicated database for storing this information. Yeah, there's some very easy answer to this question. I mean, just the overhead of managing a database to store all of this information is not an attractive proposition to us. We talked about ways of storing this versioning, data of all the components and how we manage that. Kubernetes out of the box can do this. You've got CRDs and you've got etcd. So etcd was database for free, right? If you're running the Kubernetes cluster, that's there. All you need to do is to write some CRDs to store whatever date you want, and we've done it in a way that we can use for versioning of our product. The next thing that we were able to achieve with our orchestrator cluster housed on GKE was the ability to vend and manage GKE development environments. So the development environments that the improbable developers will be testing their code against will be generated in GKE using Terraform. We also have namespace separation with RBAC for our customer projects. So this was a very powerful tool that we were able to leverage using Kubernetes and gives exactly the sandbox that we needed from project to project to store confidential secrets, config maps of data, but also the generation of Google Cloud Objects inside of those sandbox namespaces for each project. Inside that dedicated namespace separation, they are able to make the relevant developments with partners and internal improbable engineers to achieve the end goal of deploying the Kubernetes manifest to the customer site. The orchestrator cluster on GKE also houses some other cloud native toolings such as Prometheus and Grafana for monitoring and observability, Valero for disaster recovery, which is particularly important as we're using Kubernetes as our main backend store, Permirium, which is an identity aware proxy which allows us fine grain access controls on public facing web applications and external DNS for being able to generate the web addresses we need to be able to provide these tools to anyone we want. On top of the orchestrator cluster on GKE, we also built a dedicated REST API. We felt it was important to abstract the Kubernetes and infrastructure layer and be able to avoid engineers needing to deploy specific versions of the platform based off of all of these different components that are being changed all the time. We wanted this to be handled by Kubernetes and that Kubernetes layer to be abstracted. We also wanted the automated workflows that are executing these tasks to be abstracted as well so that the orchestrator is something you communicate with but you do not need to control. So I guess, Oli, this was a really important design decision when we were going about trying to give developers a more efficient experience. Well, that's exactly right. So certainly my team's goal is to build tooling and increase developers' velocity. We want developers developing fast and failing fast and fixing fast and succeeding fast. So we don't want them messing around with having to build their own sort of yaml for deploying services and this sort of thing. We don't want them to go into the Google Cloud console and do GKE, spin up a GKE cause or we don't want them to be writing Terraform or anything like that themselves. So we just decided to abstract all of this away and what it means is they just have, we put a CLI in front of this REST API so they've got one place where they can do all the things we talked about so they can deploy a development environment. You can query it and get its information and then should you need to log into it you can get that information very quickly. Apply pretty much anything you want on it but as part of it what it will do automatically for them is deploy our platform and the particular project components that they're working on so that's all there in one place. And they can also use this tooling in this API to query test statuses and results and version information so they know what's in each version. We wanted a one-stop shop to do it. We didn't want people worrying about workflows or the orchestrator or anything like that. That just wants to be a black box to them sitting in the background and we deal with that. So developers just have this sort of nice CLI front-end if you like. It's very easy to use. And I think in terms of providing it through a really simple CLI tool it really empowers the developer to be able to stay away from the stuff that they don't want to be doing. Maybe they do want to be using Kubernetes. That might be the case sometimes and in that case they're more than welcome to but this gives a mechanism for them to be able to do the work that they need to do rather than spending lots of time doing stuff they probably shouldn't. How has the experience been so far with the REST API that's sitting on the orchestrator cluster mixed with the CLI tool that was built by the improbable team? I think so far it's had very good uptake from the teams. We've had lots of positive feedback so far. One good thing about it is we were able to develop fast against it as well because developers are very... We've got a lot of clever people working at improbable and they've got a lot of ideas about how things like this should look. So yeah, really good to get their feedback. And yeah, so far it's been great. It's been a huge improvement over manually having to deal with like Minikube or something or even deploying the product on some VMs in somewhere. This is just unsustainable and no good. So yeah, we had to give them a form of just quickly spinning up ephemeral environments and so far it's working really well for us. So at this point we haven't really spoken about one key ingredient. How are we automating all of these jobs? And our tooling that we used was Argo. We used a combination of Argo workflows and Argo events to do all of the automation that goes on inside of this orchestrator cluster on GKE. The reason why we chose Argo is because it fits right in with our design to make this whole orchestrator platform Kubernetes native. How is that the case? Well, its use of Kubernetes pods makes it truly language agnostic. Again, in YAML, applied as CRDs, you're able to define workflow templates. Inside of those workflow templates are manifests which explain the steps that you want to take place. All of those steps are applied and executed in dedicated Kubernetes pods. We don't have to stick to any specific language. We don't have to pick a Docker image that will be used throughout the workflow. We can slide from one to the other. And it also has integrations such as artifacts and parameters. If I say do a git clone on an Alpine image in one step, I can take that into the next step and start performing Terraform actions in it by passing that repo in an artifact which is backed up on a data store like GCS. The final area where I think that Argo was a truly great choice was its integration with Argo events. So through the development cycle of this platform, there are many different components that are being developed by many different teams. Well, luckily what they all have in common is they're all being developed using some Git provider. In our case, it's mostly GitHub. And Argo events allows us to have dedicated eventing that is instantly plugable into Argo workflows to automate the tasks that are triggered by a GitHub webhook. It also has other really strong integrations. Most importantly, the ability to trigger off of Kubernetes resource changes. All of our versions are declared in Kubernetes CRDs, but also same goes with our environments that are vended into GKE. So it means that when we vend an environment with Terraform, we can update the status on that environment CRD to ensure that the developer accessing that CRD on the other end, abstracted away, can see its status as being built up. Yeah, we sort of spoke about this earlier. I tend to think Argo events and Argo workflows is one and the same thing. Like the orchestrator's kind of turned into this one-stop shop for everything. And because they work so seamlessly together, it's not like you're complicating things by adding multiple tools. You know, these two things work so well together. I think that's a very important point. I think I experienced that as well. And it really shows the strength of what Kubernetes can do alongside tools like Argo events. So continuing on with this theme of cyclical automation and autonomy, a component change is made in GitHub by an improbable engineer. This triggers a webhook that is sent from GitHub to the event source pod sitting on the orchestrator cluster. The event source is notified. It uses the event bus to send a signal to the sensor that is being configured to execute a workflow based off of that event. From there, it checks the state of each component's GitHub repo. And if it does find a change, it will know. Because when we generate a version, we give it a unique hash based off of the components that are configured for the platform. It then checks out the specific commit of each of those branches on the GitHub repo and then generates a unique hash which is then applied to the cluster. The hash is unique and hasn't been applied to the cluster at this point and is successfully applied to the cluster. This can then be consumed by the improbable developer that made that change in GitHub and they can start spinning up environments and checking out the change and how that was received inside of the wider platform. Yeah, so that's a developer's story you just talked through there. And yeah, it's working for us which is really nice. I guess another very fascinating part of this is once the new version is generated, that triggers a whole load of other stuff including automated testing. So what goes on there in terms of automated testing once the version is created? Yeah, so we want to do integrated tests for each version because we've introduced the idea of a stream. So you create a new version that gets... This is where essentially the QA journey for that version of the product starts so it will get promoted to the Alpha stream. At that point some tests will run and they can be configured per version. You can add to the version as many test reposers you want. Each container they're in test suite. And then if those tests pass, a slightly longer set of tests which tend to be like a nightly testing situation so they run each night on that particular version. If they pass then we can promote it further and eventually we can get to a release which at the moment for us is like a manual process. We have like a sort of product owner sign off. And again, that can be done through the API. So yeah, all of that is automated except for that last release step where we do a sort of product owner sign off. I think that's truly awesome because from a developer's perspective the fact that they can fly the change in Git that they made but what this orchestrated cluster goes and does is it takes the change from that specific developer. It doesn't matter if they've gone off for a coffee it really opens up the doors of what the development cycle of a platform like this can look like. Yeah, so I mean one idea and you know we've got developers spinning up clusters left right and centre to do development work on. We need to manage cost and things like that. The idea of being able to deploy a version is that it's going to be the same every time so these environments don't need to be up all the time. They can just spin up a new one every morning. So what we've done is we put like a time to live on each environment and that's based on an extra field in the CRDs and we've got just another Argo workflow that's just looking at each of these timings for each environment and figures out how long we've been up and if they've been up for a certain amount of time it will just destroy them. So it will run the Terraform and cleanly which is very important cleanly do a Terraform destroy bring everything down and remove all the resources in the cloud. So you know that's an example of sort of the events and triggering that we have in place with Argo. I think that's a really exciting part of our implementation and I think it even backs on further into our choice for Kubernetes as our data store because that time to live is described in a Kubernetes annotation. It's you know a very subtle piece of metadata that's attached to the CRD but what it allows us to do is like is without doubt with authority decide a date and time in which this environment should be decommissioned. If there's one thing that project managers like more than a successful project it's being able to keep it as low a cost as possible. I know it's been very well received that we're able to keep these costs low. One other thing as well is persistent disks. Of course we spin up our environment our GKE environments with Terraform we spin them up and we spin them down but GKE keeps hold of the persistent disks inside of the Google Cloud projects but we were able to avoid that by just adding a simple step in the Argo workflow. That's not necessarily something that we can easily integrate inside of that Terraform destroy step why not just add an extra workflow step? So they were the areas where we thought that Argo workflows and events were a great pairing for our work on the delivery service for the improbable defence platform. Through this presentation we have taken you through the story of how we took a selection of problems that improbable defence were experiencing through the development of their platform and to combat this how we built a secure Kubernetes native delivery service which automates the test and release cycle giving improbable the flexibility to expand further to separate iterations while enabling easy collaboration with external entities. I'd like to thank you all very much for coming I hope you're having a really good KubeCon and if you have any questions whatsoever we'll be in the Q&A afterwards so feel free to ask any questions there but also if you have any questions about Jetstack if you're interested in the sort of work that we've discussed in this presentation feel free to get in contact with me in fact feel free to get in contact with me about anything I'm happy to discuss anything Kubernetes or not Kubernetes related so thank you very much for coming to the talk thank you KubeCon and thank you CNCF for inviting me on to speak also thank you for the entire improbable defence team that we've worked with throughout this project you've all been fantastic to work with and it's been a joy delivering this service for you guys and well done to the Jetstack team on my side who have helped me along the way with all the challenges that I've faced throughout so Ollie. Yeah thank you very much Tom I mean yeah been fantastic to work with on this project and yeah to everyone out there if you are again just mirroring what Tom says but for ourselves if you are interested in what we're doing with synthetic environments the problems we're trying to solve simulating thousands or millions of entities at scale Kubernetes based platform and delivery yeah get in touch with us as well we're trying to solve a lot of interesting things and we're hiring so yeah thank you very much for listening