 Of course, I had to change up my shirt to introduce the next speaker, who is Chris Smith from Pulumi. And the folks at Pulumi were cool enough to send me this shirt that has this pretty neat looking platypus on it, and it says Pulumi on the back. I think it's pretty fun when I wear it out, I get compliments on it. And I'm sure you're going to get some compliments on his talk, which is going to tell us about continuous delivery and infrastructure testing and how they're perfectly together. I'm excited to check it out. I'll see you in the chat. Hello, I'm Chris Smith. I've worked at a couple of companies you have heard about. And currently I'm at Pulumi, where I spend a lot of time thinking about CICD and cloud infrastructure. And so today, I'd like us all to answer a single question. That is, is it possible to move faster and not break things? It kind of seems obvious, right? Similar to the secret to weight loss being diet and exercise, if you want to move faster and not break things, you're going to have to have more confidence in the changes that you're making to your code and to your infrastructure. However, that's a little easier said than done. So let's talk about what the problem is in, say, modern workloads that we need to address, right? Why can't we just run more tests for cloud applications? Well, first, let's take a step back and look at kind of an academic view of how software testing has been done in the past, right? Typically, it'll be stratified across different levels with your kind of fastest, most easy to write in the unit testing layers. And then as you introduce mocks or stubs, having integration tests for where software components come together. And then finally, kind of the largest, most expensive types of tests are systems tests or integration tests that cover all the functionality end to end. But there's a problem here, which is how does this apply to the cloud? Right, if you're testing cloud software, there's a lot of components that say you don't control. Your application code may involve, say, the business logic or how a login is performed, but if you're relying on a hosted database server or some networking component so you can actually receive those web requests, then it becomes a lot more tricky. And so when you want to test cloud software, there's more than just the code that say you've written. It's where that software will run, how that data is stored, and which users have access to it. And all of these are controlled and configured in some cloud provider, whether it's Azure, AWS, GCP, or so on. And so, okay, great. We understand the problem. Let's just test that. Well, if you're familiar with your cloud resource console, it's not something that's very amenable for testing. This is a screenshot from the AWS console. And if you needed to, say, go in here, click around, change settings, every time you wanted to verify a fix or run some tests, you wouldn't have a great time. And so the solution that a lot of people have started moving towards is something called infrastructure as code. And this is the silver bullet that we will be using that will enable us to actually test our infrastructure. Because once we start encoding our infrastructure in actual program code, we can then inspect it and validate it. So what is this concept? Well, it's infrastructure, but as code, I mean, you can see how rather than needing to manually click around in a resource provider console, you can just define how you want your cloud resources to be configured directly in code or a configuration file. Now, sometimes in the past, you may have heard of configuration as code as kind of a precursor to this. But if you start to use actual code, like real programming languages, you can now take advantage of the benefits of using real programming languages and that entire ecosystem around that. So I'll go into some details on infrastructure as code, but some common tools on how you can use this are Pulumi, AWS CDK, and the most popular infrastructure as code solution, HashiCorp's Terraform. One of the benefits of using infrastructure as code is that you can have expressed your infrastructure in a much more abstract or expressive way. Rather than having just a flat list of resources and seemingly arbitrary configurations, you can start to encode things using, say, if you need to create a set of resources, use a for loop, or if you only need to create a certain resource at certain times, if statements. This all lends itself well to making your cloud infrastructure easier to manage, maintain, and set up especially as your infrastructure requirements grow. So the way it all works is essentially there are three steps. First, you write some code that defines, that creates cloud resources, you know, say by calling a constructor or whatnot. Now, when you run that program, it'll evaluate those resources in memory and create the resource graph, and that will be a goal state, so to speak. And if you diff that resource state or goal state from, you know, say the last time you ran your update, you can see what changes need to be performed. And that's, say, a preview of your deployment. And then if you actually want to apply those changes, then you would update your resources. And that is then what would go and contact your cloud resource provider and actually go and create new resources or delete existing ones. So with this ability to encode cloud resources within application code, we can now start to test it in ways that will allow us to understand and validate not only the application code that we've written, but also the environments that that application will run and have more confidence that we're not going to break production if we ever say touch some networking change. So if we're going to put this all together, we need kind of some sort of application that's not too complicated, but not, you know, basic. And so what I'd like to do today is I'm a little excited. The worldwide launch of countdown as a service. I'm actually thinking about quit my job and getting some of that sweet, sweet VC money. And the app is simple, right? If there's ever an exciting event that you want to say, get a notification, super excited every day, you know, a new text message or an email, this software as a service package will do that. Or maybe say there's an anniversary coming up and you want a heads up a few days before, there you go. So it's still a work in progress, but let's talk about how we could actually test this application and verify that everything's on the up and up. So here's it's overall architecture. It's straight board, runs a single EC2 VM on AWS and it reason writes data using NoSQL DynamoDB. And then separately, a regular cloud watch event fire, say every day, that will then run an AWS Lambda to read from that database and then send out messages as needed, say to all the subscribers for various events. So again, the goal here isn't to be a super fancy application, but just have enough infrastructure that we can think about how we would actually test this and have more confidence making changes. So let's talk about kind of, well, what's the view of unit testing when we're looking at that through the lens of cloud resources? Well, one of the benefits of when you're using infrastructure as code, you have that ability to preview changes. That is build the resource graph in memory of the way things would be created or would be configured without actually needing to go and create them. And this allows us to write unit tests for the expected inputs to resources. So in this image, you'll see the DNS record for get.pulumi.com, which is a service that we use at Pulumi for distributing plugins and binaries. And so the DNS record has several inputs, such as aliases or zone IDs. And what we can do is in our code, verify that those inputs are what you would expect. So tools that you can use for doing this type of testing are Pulumi and AWS SDK. And so when you think about testing resource inputs, the thing to keep in mind is that it's well suited for asserting the requirements of your infrastructure. That is the things that are very important that they need to be enforced even as the application evolves. And so there are several areas to look at, such as permissions and making sure that dependent services have the right setup and configuration to actually work, or that, say, any SLOs that your team or group is trying to adhere to, well, that, say, your infrastructure supports that. And this is actually something that can be surprising if you think about it. For example, if you want to have, say, two and a half nines of reliability, well, that would mean that whatever systems you have for monitoring or alerting need to have a particular degree of sensitivity in order for you to not only find out about that, but then take some action like doing a failover or sending traffic to a replica or whatnot. So let's look at how this would actually work in practice, just as an example for this type of testing. So what I have here is the source code for the super exciting countdown as a service. And you can see this is the polluting source code for where we're creating these resources. In this case, however, all of the resource creation is done in separate files, and then we just export these properties. So if we want to run the unit tests for our infrastructure, not our application code, then all we need to run is npm run test. Because the infrastructure is written in code, we can use the same testing tools and frameworks that other real programming languages use, or in case of using TypeScript and JavaScript, Mocha. And so the tests for this application, as you can see, all the tests have passed. And then here's how we wrote them. First, we mocked out the part of the plumi engine, which actually constructed resources. So rather than say comparing resource state with some previously known state or contacting the cloud resource provider to make changes, we just want to keep it in memory. So this is essentially some boilerplate code to say don't do anything with those results. And then we can just launch right into writing tests. This is the same sort of setup you'd see for a prototypical JavaScript test, but we're getting the plumi cloud resources that would have been created, and then checking that their inputs are as expected. So in this particular test, let's say the organization requires that all resources have some particular tags, like a name or an owner, so that those resources can be reclaimed later. And so we then check the VMs tags have those properties. Similarly, another type of test we can have is verifying that the instant size of a machine kind of is some known class, right? So sadly, until I can fully bootstrap my new startup, I'm just gonna have to reside on T2 micro instances. You know, and you can see how again, this just allows you to encode the requirements of your infrastructure in code. And so for example, say when it comes to the DynamoDB tables, the application code requires or assumes that it has the certain indexes exist on the data store. And so because of that, we can then, when we create those cloud resources, assert via unit tests that, hey, this is the hash key or sort key for that database. So just quickly recap, infrastructure unit tests provide a very quick way to verify changes. It doesn't require contacting the cloud or using any external APIs, but instead it all just runs on your local machine. The trade-off however, is that you can only validate resource inputs. It doesn't actually look at existing resources on the cloud, which means that you can't look at some complex sorts of setups. Okay, so we've looked at that bottom most layer of the kind of test hierarchy. Well, what does integration testing look like from the view of cloud applications and looking at live resources? So I'm gonna tell you what my plan is if my new startup doesn't work out. I'm gonna build a time machine, and I'm gonna build a consulting firm. And all my consulting firm is gonna do is tell people which AWS S3 bucket has public read access. And I'm gonna tell companies, you can give me millions and millions of dollars and I will prevent these horrible data breaches that you see all the day in time. The next thing approach that we'll be using is looking at something called resource policies or ways that you can essentially apply the same types of unit tests that we wrote before, but applying them at the time before a resource is created. For example, preventing bad changes from ever being made to the cloud or inspecting resources after they've been created and confirming that they adhered to various policies or whatnot. And by doing so, we can not only verify that there's kind of a min bar of quality or security across our cloud infrastructure, but we can also make sure that the live resources that we have don't have common sorts of problems. And so some tools that you can use today to do the sort of thing are Pulumi HashiCorp Sentinel or Open Policy Agent, OPPA. Similar to infrastructure as code, policy as code is well, kind of sort of the same thing, allowing you to essentially take those, the types of validations that we wrote in unit tests, but applying them as a policy that can then be, you know, checked across every resource change that happens in an organization. And essentially raises the min bar for resources within your company. So let's take a look at how to apply this policy as code in practice. So here we have our infrastructure code for say the countdown service. Where does policy come into play? Well, this is perhaps a very boring demo, but if we say go to GitLab, where the code for this application all resides, I'm just gonna click this link to the Pulumi Stack where this exists. And so this is the production instance of that application. We can see all the resources and the update history for the app. And if we click into these details, you can not only see the changes, but policies already being checked. In fact, we can drill in here and see these are within Pulumi. My organization has been configured to apply, you know, the Pulumi AWS Guard policies across every stack update. And so whenever I make changes, a collection of AWS specific resource policies will be ran. And sure enough, we can see a couple of resources in my stack have some violations. And so let's drill into the backend VM. This is the EC2 instance that kind of is serving my application. And, you know, these policies have found, hey, you aren't using elastic block storage. That's kind of not great because then if the instance goes away, so does its data. Also, hey, you're not encrypting your data so that if somehow the machine were compromised or whatnot, that wouldn't be good. Similarly, you know, not having detailed monitoring. And so you can see how by just applying kind of common policies, especially those that are open source and can have a lot of developers with a lot of mind share working on and raising that min bar or encoding best practices. How this can be useful in just raising the bar for your own cloud applications and infrastructure. So to recap, resource policy tools allow you to validate your resources and inspect, you know, the actual state and not just the inputs. And this allows you to not only prevent bad changes from happening, but also allows for more complex interdependent resources. For example, if resource A's inputs depend on resource B's outputs, you actually couldn't test that using traditional, or using the unit testing approach that I went over earlier. So now we're onto the hardest part, systems testing, right? What do you do when you wanna test the full application, you know, run your end-to-end tests and actually use, you know, all the microservices and bells and whistles of your environment? Well, if you're using infrastructure as code, this is actually kind of straightforward. You can just create an ephemeral environment. That is, you know, you create a real cloud resources, you really stand them up and then you can use the real application. And so these two images on the right, you can see, one is just showing you the preview of changes that would be made when making an update, right? So this is from the actual Pulumi service itself and whenever, and this particular change was going to add a couple of metric alarms. Now that's kind of nice, it's good information, but you know what would be even better is if say you use a tool like a Pulumify, which would then just stand up the Pulumi stack and then just point you to a URL that you could then test and interact with and, you know, actually see whatever changes or features, you know, live on your cloud. And so this is a very powerful tool that allows you to have, you know, a high fidelity view of the results of your changes. You can set up ephemeral environments using Pulumi, a TerraTest and kind of sort of any CICD system if you configure it correctly. And so what would you need in order to create an ephemeral environment? So when creating ephemeral environments, there's a lot of trade-offs to make because every cloud application is a little different. Every team has different needs for the sorts of things that they want to have easier to validate or, you know, what's important to look at, right? So for example, if you're standing up a very sophisticated application, it may take a long time to create all those resources. And so maybe you'd only want to have your ephemeral environment stand up a subset of it so that it's quicker to run and validate. You know, another trade-off could be, say, costs, right? If your environment requires $30 an hour to run because it needs a bunch of high demand resources, maybe that wouldn't be the greatest thing to stand up. So there are a lot of ways that you can kind of configure or tune your ephemeral environment to get it just right for something that works well for you, right? Ultimately, ephemeral environments, you know, like policy as code, like unit testing of infrastructure, these are all techniques that you can apply towards gaining more confidence in your cloud applications. And the question is ultimately, well, how do you find the best set or way to apply these tools for your app? So when you're running these cloud or these ephemeral environments, you know, how do you want to get, you know, what do you do to get the most out of it, right? So first I would suggest kind of always looking at dependent services and components. That's probably the most value you're gonna get out of that because it's usually the thing that will likely break. You know, similarly, deployment safety is another suggestion that I would highlight. Not only is it beneficial to stand up in a ephemeral environment and see what the change is, but there's also an implicit transition that takes place when you deploy those changes. And sometimes if you're not careful, deploying those changes to your infrastructure could result in some downtime. It could be the case that when everything's all said and done, your app's working great. But will there be a few seconds or maybe a few minutes in which your application isn't reachable or just will fail all of its requests? You know, for example, there's sometimes where this unavoidable, like when you're, you know, switching between two databases or, you know, standing up new infrastructure. But by using the ephemeral environment as a way to simulate these changes, you can then, you know, run some sort of tool that will simulate load and then expect that it has, you know, continual success while the infrastructure transitions to its newer state. So let's look at how we can set up ephemeral environments in GitLab as an example. This was actually straightforward to do. If we go to our GitLab CI, we can just create, you know, our ephemeral environments. On merge requests and, you know, wire through the merge request ideas, that unique identifier. And so to do a little Julia Childs, you know, put something in the oven. If we go to our production instance of our countdown as a service, you'll see this beautiful model that we have for on the homepage. But you can see that in Safari, it does not look all that great. And it is because CSS is very hard in case you did not know. And so if we go to our GitLab app, we can totally, we can click some totally random SAS file and wow, if maybe we do need to uncomment this one line of code so that it'll render correctly. You kind of get where I'm going here. Well, I've already created a merge request to save us the time of creating a little merge request and then waiting for the ephemeral environment to stand up and, you know, get deployed. But you can see here's the pipeline and in the merge request, we also previewed the changes to production. And so within GitLab, because we're using a dynamic environment here, you know, there is a link to the merge request version of this application or that ephemeral environment. And sure enough, by changing that CSS property, everything has been fixed. Huzzah, right, the day is saved. So this all happened, you know, automatically by configuring our GitLab pipelines and just hooking it up with the right, you know, lifecycle triggers so that it would create, stand up the ephemeral environment, which will then be visible in our merge request. But one thing that is kind of unfortunate is that although the pipeline ran and we preview what the changes would happen in our production environment, you know, for example, it still has these policy warnings, you know, it's unfortunate that it isn't displayed in the merge request. Like, I don't know, if only there was a way to have Pulumi or whatever tool you're using to manage your infrastructure, kind of surface that information on the merge request. Well, there's a session later today by a colleague of mine, Pranit Loki, and he'll be looking specifically at this problem and seeing what you can do to kind of improve the experience from GitLab. But anyways, I'm gonna go ahead and merge this merge request into, you know, my repo. And then you can see a new pipeline has been started to, you know, build, test the application. But now you'll see that there's this new job for tearing down the ephemeral environment so that the environment that was created specifically for that merge request will be destroyed, you know, the resources reclaimed and so on. So, so far we've seen how ephemeral environments can allow you to exercise your full application with all the bells and whistles so that you can better understand, you know, what changes are being made, not just to your application code, but in the context of the changes to the infrastructure that your application runs in as well. And overall, we've seen how we can apply unit testing towards cloud infrastructure, applying resource policies towards cloud infrastructure. And then finally, these full ephemeral environments. And so to extend a perhaps an awkward metaphor, we'll see how well it fits. Some of you may be familiar with the video game Tetris, okay. And all of these tools and techniques, you know, that's all the goo in the center there, right, setting you up for success. And then finally, your continuous delivery pipeline is that wonderful, you know, four point vertical slice that's you're gonna drop down into it and it all just works together and you will be able to move faster and have confidence that you're not breaking things. If you would like to learn more about, you know, infrastructure testing, you can follow me on Twitter and by the time of GitLab commit, there will be many more resources available on how to configure your GitLab pipelines and I'll be tweeting about those extensively. Thank you very much and have a wonderful afternoon.