 Appropriate because we're about to talk about refactoring infrastructure code and like another song from that movie we are stepping into the danger zone. So let's go ahead and get started. I'm Nell Chambrale Harrington and I'm a software engineer at Chef. Although I currently live in Seattle like Jill, I also used to live in Southern Illinois in O'Fallon Illinois just across the border from St. Louis. I'm glad we have some Southern Illinois representation here. So if you're free to tweet me at Nell Chambrale at any time or email me at nchambrale.io. Now in my work at Chef I have refactored a good deal of infrastructure code, some good, some bad and I would like to share those lessons with you today. So let's first talk about why you might want to refactor your infrastructure code. You might find yourself needing to refactor when you need to add a feature and it's impossible with the code in the state it's currently in. When you need to fix a bug, when you need to improve the design of the code to make it easier or even possible to work with or when you need to optimize resource usage. This can include things like memory, hard disk drive space and in the example we're about to go through we're going to optimize our use of AWS resources. So it would be one thing for me to tell you about how to refactor but today I would like to show you a real life example. In this example we're going to refactor a Terraform config. How many people here have used Terraform? All right we've got a few hands up there. Terraform is a delightful provisioning tool that allows you to provision infrastructure whether that's VMs, containers and more. And I like to say that using Terraform makes me feel like the character Elsa from the movie Frozen. I just go shoo and all of a sudden I have a AWS cluster built. Now the problem is that changing your Terraform code or not doing it carefully, it can go just as wrong as the ice magic goes in that movie. So today the Terraform config we're going to refactor is a real one. And it's at github.com slash nelscham rel slash supermarket hyphen Terraform. If you'd like to go ahead and follow along. Now when I refactor code, whether it's application code or infrastructure code, I like to start out by mapping the major components of the code. And I'm a visual learner so I prefer doing this in a visual way. So this is the first part of our Terraform config. This is the main file called supermarketcluster.tf. And in this file, I'm using certain variables needed to when I spin up my infrastructure. So in this case I'm spinning up AWS instances. So I need to include things like my AWS access key and my secret key in order to spin things up in the correct account. And I define these variables in a file called variables.tf. That's the second major file within my Terraform config. And finally I provide the value for these variables in a config file called Terraform.tf bars. So these three files are the core files within this Terraform config. Now some of you who have used Terraform with AWS might be asking, wait, shouldn't I use an AWS credentials file rather than passing my credentials in line? And the answer to that is you certainly can and that's even considered the best practice. But for the sake of this example in this presentation today, I'm gonna go ahead and pass them in line. So let's take a closer look at supermarketcluster.tf. Now this file has way more code than I can show on a slide. There's nearly 300 lines in it. So instead of showing that code, again, let's go through a visual representation of what it does. So when I run Terraform apply with this config from my workstation, it's first going to create an AWS security group. Then it adds a rule to allow SSH access into any instance that I attach the security group to. Then it spins up a second security group and adds another rule to it. This rule will allow traffic to flow from within my AWS instances out into the public internet. Then it spins up an EC2 instance and configures it to be a Chef server. The next component is another EC2 instance and it configures this to be a supermarket server. Now supermarket is another Chef product that can be used in conjunction with a Chef server. It provides an easy to navigate artifact repository for sharing Chef cookbooks and tools. So the next thing it's going to do is configure supermarket to use that Chef server for authentication and authorization. And then it's going to configure the Chef server to be aware of the supermarket server. And I can do all of this from one Terraform config. Finally, it's going to configure the user's workstation who's running that Terraform apply command to use the newly spun up Chef server whenever it needs to use the Chef server. And it's going to configure it to use our newly spun up supermarket. And that's the end of the config. So the whole point of this config and spoiler alert, the one that we're going to be refactoring is one that I wrote. It was actually my first Terraform config I ever wrote. The point of this is to spin up a complete cluster for development and testing for Chef supermarket. So let's pose a hypothetical. Let's say management says that we are using too many AWS security groups in our AWS VPCs or virtual private cloud. So the default limit for a VPC is 500 security groups. And yes, you can increase that limit. But again, for the sake of the example, let's say our company does not want us to increase it. So we need to meet this hypothetical to change this config to create only one security group rather than two. So this is the why of why we need to refactor. I always like having a specific reason that I'm going to refactor code. Now let's go over how to refactor. And there are two common approaches to refactoring, whether it's application code or infrastructure code, according to the book Working with Legacy Code. And the first is what's called edit and pray. This involves making a change, often a major change, and deploying it, and just hoping and praying that it works. I've sometimes heard this referred to as PDD or prayer driven development. The second approach is called cover and modify. And this one is harder, but it's much more effective and much less risky. This involves covering the section of code you want to change with automated tests and then making the changes. So last year, I presented this approach on a talk on refactoring application code. And during Q&A, someone grabbed the mic and said, and I quote, this is a bad approach. You should just look at the code and change the code and know what you're doing. Now my reaction to that was I don't have that much faith in myself, to make changes cowboy style and have it not break anything. In fact, I don't have that much faith in anyone. And this is because confidence in code without tests is false confidence. I don't trust a human, thank you. I don't trust a human, least of all myself, to know all the ins and outs of how my code currently works and all the unexpected side effects of the code, which other parts of the code might be depending on, and that it will keep that functionality after I make changes. And this is because what the code is intended to do is much less important than what it actually does. The only way to know what my code actually does is to actually execute it. That is the ultimate source of truth. And the beauty of automated tests is they allow me to execute my code automatically. So when I present this approach, sometimes people immediately dismiss it as being impractical. If they have a giant legacy code mess, how in the world are they supposed to get the whole thing covered in tests? They ask me, are they supposed to suspend new work until they can add tests for everything in the code base? And the answer to this is unequivocally no. Even if it were possible, it would be very impractical to try to cover a whole code base in tests and all its associated templates and other support files as it is right now. And frankly, good luck selling that to management. So the point of adding tests, the reason we add them when we are refactoring, is to not make things worse, to preserve the current functionality of the code at the very least, and to start making the code better here and now. So the next question is, how can we test Terraform? And to do this, I'm going to harness three open source tools. Now the first is Test Kitchen. How many people here have used Test Kitchen? All right, we've got a number of hands up. So Test Kitchen is a delightful tool that allows us to spin up a VM or container, configure it with our Terraform config, and then run tests on it to verify that our code did what we expected it to do. So if you want to find out more about Test Kitchen, head on over to kitchen.ci. Now Test Kitchen was originally written for use with Chef Cookbooks, but it can also be used with Puppet Manifests, Ansible Playbooks, and more. In order to use it with Terraform, we're going to use a special provisioner called Kitchen Terraform. This just came out in the last month. Kitchen Terraform comes from a company called New Context. So we need this provisioner to not only be able to provision new AWS EC2 instances with Terraform, but also to be able to SSH into those instances and run our tests. And you can find out more about it at github.com slash newcontext slash kitchenterraform, and I'll be tweeting that link out after my presentation. So in order to use both Test Kitchen and Kitchen Terraform, we need to configure it through this .kitchen.yaml file in our configurations repository. So the first thing we need to do is configure the driver. This defines where we want our instances to be spun up. In this case, we're saying we want them to be spun up through Terraform, and the Terraform config itself will say we want to spin them up in AWS. Second, we define our provisioner. This is what will actually create the instances. So conveniently, our provisioner is also called Terraform, and we also needed to tell the provisioner which file contains the values for the variables our Terraform config uses. So in this case, remember, we're defining those values in a file called terraform.tfrs, so we tell our provisioner that's where it should find those values. And in order for Kitchen Terraform to run tests on our instances we spin up in AWS, it needs to be able to SSH into them. In order to SSH into it, it needs access to an SSH key. So in this transport section, we provided a path to that key on our workstation to use to SSH into those instances once they are provisioned. And the rest of this is some additional Test Kitchen boilerplate, so if you'd like to learn more about that, head on over to kitchen.ci. And finally, the framework we're going to use to write our tests is called InSpec, which is similar to server spec. So to find out more about InSpec, head on over to chef.io.inspec. So now that we have our tools, we're going to go ahead and start from a clean slate. We need to figure out the bare minimum we need for our tests to run first. So let's open up our config file, and what I like to do is to just comment out the entire thing and rebuild it piece by piece. When I'm refactoring application code, I use this similar approach when I'm adding tests for a method I want to change. So at this point, with everything commented out, if I ran Terraform apply from my workstation, nothing would happen. So this is a good starting point for us, and let's figure out what the bare minimum is we need for our tests. So first, we need a provider. In this case, it's AWS. So let's go ahead and uncomment out that provider resource in our Terraform config. And along with our AWS provider, we also need actual AWS EC2 instances to run our test on, including both that Chef server. So let's take a look at the Chef server resource within our config, and it's pretty standard. We're spending up an EC2 instance, we're passing in the AMI, the instance type, and the key name as variables. So let's go ahead and uncomment this out, so it actually fires. And at this point, if I ran Terraform apply, I would expect it to spin up an EC2 instance. So the next EC2 instance we need to spin up is our supermarket server. So let's take a look at that resource within our Terraform config, and it looks very similar to the Chef server resource. So let's go ahead and uncomment it out. So now, with this uncommented out, when I run Terraform apply from my workstation, I will expect it to spin up those two EC2 instances. But, we also need at least one security group in order to spin up our EC2 instances. Now, we could use the default security group in our AWS account, but I don't like doing that in tests. And the reason is that default group is out of the control of my tests. It would be easier for someone to change it, and then suddenly have my tests failed in an unexpected way. So I'd much rather define my security group in the code. So let's head on over to the section of the code that defines our security groups, and let's uncomment that section that will spin up our allow SSH security group. So at this point, if I ran Terraform apply, I would expect it to spin up that security group and my two EC2 instances, and then assign them to that security group. Now there's one more thing we need to do. Remember that kitchen Terraform needs to be able to SSH into our instances in order to run tests on it. So we need a security group rule in addition to that security group to allow SSH access. So again, we head on over to our security group resource, and let's uncomment out that AWS security group rule. So that's gonna allow SSH traffic into our instances through port 22. So at this point, if I ran Terraform apply, I would expect it to spin up that security group and that allow SSH rule and those two EC2 instances and then assign them to that security group. So now this is the bare minimum we need to create our test cluster. So let's go ahead and do that. And through, if we're using test kitchen, we're gonna do that through the kitchen converge command. And this is a lot like running Terraform apply, but we're running that through test kitchen so we have all the benefits of the test kitchen framework. Now the first time I run this, I'm going to get back an error. This is gonna tell me, it initially doesn't look that helpful, but it's gonna tell me to view the log file at dot kitchen slash log slash kitchen dot log. So let's go ahead and take a look at that file. And when I open it up, my eye is drawn to this line, unknown resource AWS security group allow egress referenced. It looks like we are telling our EC2 instances to use a security group. We're not actually creating at this time. And this is because at this point with what we have uncommented out, we need our EC2 resources to reference only the one security group that we have uncommented out at this time. So let's take a look at that Chef server resource again and looking at it, you can see that we're assigning it to two security groups. So let's change that and only assign it to the allow SSH security group because that's the only one we have uncommented at this time. Then we head on over to the supermarket server and we do the same thing. We change it from assigning it to two security groups to assigning it to one security group. So at this point, I'm gonna run kitchen converge again and this time it's going to complete. And yes, if I were really running this from a terminal, that kitchen converge would take much longer, but for the sake of time in this presentation, I decided to use a little bit of magic. So at this point, remember we have our security group with our allow SSH security group rule and our two instances when we run Terraform apply or when we run kitchen converge. So now we're at the point that we can actually write some tests and to do this through kitchen Terraform, we first need to define a test group. So let's open up our .kitchen.yaml file and the first thing we need to do is define the verifier that test kitchen will use to run the tests. And in this case, it's also called Terraform and then we define our first group. Got a little on my head of myself. We define our first group, which I'm gonna call the default group and then we define the test files that will run as part of that group. So I'm saying my file is gonna be called security groups and I'm gonna go ahead and create that in just a moment. Finally, let's skip over that hostname's attribute and let's look at this username's attribute. I've got a bunch of lemmings in the back row all of a sudden. Sorry, it's so low. This username is Ubuntu and that's the username that kitchen Terraform is going to use to SSH into the instances. When I spin up an Ubuntu instance in AWS, I need to use the Ubuntu username, at least at first, to SSH into it. And then let's look at this AWS hostname's variable. This is the output variable we will use to capture the hostnames of our two EC2 instances. We need to know their hostnames in order to SSH into them with that Ubuntu username. So let's go ahead and create that output variable and we do this through the outputs.tf file in Terraform. So I'm going to define my variable, which is AWS hostnames, and then I define the value for that variable, which in this case is the public domain name of the Chef server and the public domain name of the supermarket server as a string separated by a comma. So it's one thing for me to show you the code. Let's look at another visual of how this will work. So when we run our Terraform config from our workstation, it's gonna spin up those AWS resources. Then it will capture the public DNS value of the two EC2 instances in that variable and then it will pass that variable to kitchen Terraform and then kitchen Terraform will use that variable, AWS hostnames to SSH into our instances and then it will run tests on those instances. So before we try this again, let's go ahead and destroy our current test infrastructure to give us a clean slate and to make sure we have a value for that output variable and then we're gonna run kitchen converge again. So when this runs successfully, we now have a test infrastructure we can run tests on. So we can start thinking about writing those tests. Now it's good to start out writing tests just for the portion of the code that you want to change. So remember, our hypothetical was that we need to condense from using two security groups in our code to using only one security group. So now that we have already have one defined that second security group resource is still commented out. And before we uncomment it, let's go ahead and write a test. So the test I'm gonna write using InSpec is I'm gonna tell InSpec when it runs the command pinggoogle.com on one of my instances, it should receive the output one package transmitted, one received. This means that my instance can send traffic out to the public internet. So after I write this test, I run it by running kitchen verify and the first time I run it, I'm gonna get a failure. This is gonna tell me that the output was not one package transmitted, one received, it was one package transmitted, zero received. So this tells me our instance in its current state cannot send traffic to the public internet. And this is good. It's good to have a failure first because this means our test will fail when we expect it to. So let's go ahead and make it pass. And we do that by heading over to that security group that allow egress security group and uncommenting out both the security group and the security group rule. So at this point, if I run terraform apply, I expect it to spin up both those security groups and both those EC2 instances. So now we need to call the security group from our Chef server. So let's head on back into that resource and we're gonna change it from using just that allow SSH security group to again using both security groups. Then we go ahead and do this in our supermarket server. We do the same thing, we change it from only assigning it to that one security group to assigning it to the two security groups. So once again, we're gonna run kitchen destroy and kitchen converge to give us fresh test infrastructure to work with, then run our test using kitchen verify. And this time when I run the test, I get it shows me that the test passes. So now this section of the code that second security group is covered by tests and this means we can go ahead and make a change and be confident that it will work. So let's go ahead and condense these two security groups into one security group. And we do that by deleting the second security group that allow egress security group and then changing that security group rule that allows traffic out to using the allow SSH security group rather than the security group we just deleted. So at this point, our instances should only use one security group. And that tells us we need to go back into our Chef server resource, change it back from using those two security groups to using that one security group. When you're refactoring the test, you're gonna do a lot of changing things back and forth. Then we head on into our supermarket server and do the same thing. We're only gonna use that one security group now. At this point, if we were to run Terraform apply from a workstation, we would expect it to spin up both those EC2 instances, one security group with both security group rules and assign that security group to those two EC2 instances. So now let's run kitchen destroy and kitchen converge followed by kitchen verify to run my test again and my test passes. So we have now changed functionality without breaking our code. But let's not stop here. Let's go ahead and improve the design of this code to make it easier to work with both now and in the future. And we're gonna do that by moving our security group code into a module. And the benefits of moving it into a module are that a Terraform module is a self-contained package. It's a reusable component. We could theoretically use it in other Terraform configs. And it's a way to improve organization of our code, make it easier to find what we need, when we need it and to fix it. So let's go ahead and run kitchen destroy. And then we're going to make a directory for our new security group module. I'm gonna call it security groups. And then I'm gonna create a file within that directory called main.tf. And this is going to be the main config of my security groups module. So once I create this file, I'm gonna copy and paste the security group code into it from my previous supermarket cluster.tf. So now we need to connect this module from the main config. And in order to do that, we first need to know what variables the module needs passed into it. So looking at our security group code, when I see a variable that starts with this var prefix, that means it needs to be passed into the module from outside of it. So we definitely need to pass this one into our module. Now when we look further down the config and we see a variable like this, this one refers to a resource within the module's config. It will be determined when that module runs. So since this value is coming from within the module's config file, when I practiced this talk, someone told me I should make a joke here about how the variable is coming from inside the module. It does not need to be passed in from outside of it. So again, looking at this visually, we have our supermarket cluster.tf, which uses the variables. We define those in variables.tf. We define the values in the terraform.tf var's file. And in our module, we use those var, we also define variables in the variables.tf file for that module. And let's go ahead and create that file now. So we're gonna declare, we're going to use a variable called username. And again, looking at the visual, that main.tf is going to use the variables that are defined in variables.tf. So effectively what we're doing is we're passing variables from the supermarket cluster.tf to our main.tf in within our module. In order to do that from the supermarket cluster.tf file, we're gonna define a module resource, tell it the source of that resource, which, or source of that module, which in this case is a subdirectory within our directory. And we are going to tell it that we're gonna pass it the username variable. So when I run Kitchen Converge, testing these changes, I'm gonna get an error at first. So I'm glad I checked this before deploying this out into a production environment. And looking at that default Ubuntu log again, I'm gonna take a look and my eyes drawn to this line, which is unknown resource awssecuritygroup.allow ssh. The problem is that our supermarketcluster.tf file is trying to access that resource, but that resource now lives in a module. It can't access it directly. So the next thing we need to do is create an output.tf file within our module and pass that value back to the config. So I'm gonna create that file, output.tf. I'm gonna name that variable, name that value, sgname for securitygroupname. And then the value for that variable is the awssecuritygroup.allow sshname. So finally, we need to use this output in our supermarketcluster config. So again, we're gonna change it from using that securitygroup, referring to the securitygroup resource itself, to referring to the sgname output from that module. Then we're gonna go ahead and do the same thing from our supermarket server. Change it to use the securitygroup name that will be output by our module. We're gonna run kitchenconverge again. When that succeeds, we're going to run kitchenverify. And when I run my test, my test passes. So at this point, we have one improved our resource usage and the code's organization and design with minimal risk. We can be confident when we deploy this that it will do what we expect it to do. So infrastructure code must be maintained and refactored just like application code. Even more so because infrastructure code involves so many moving pieces, as you saw from all those visuals. When refactoring code, always cover it with tests first. That's what I'm hoping you can take away from this presentation and refactor one small piece at a time. Thank you.