 Hello? Hello? Hello? Can you hear me? Hello? Hello? Hello? Can you hear me? No? Hello? This is working. How about now? Can you hear me now? Is it faint? Okay. I'm not sure but I think the mic is fine. I'll raise my voice when I start. Can you do me a favor? Can you do me a favor? Sometimes I speak loud when I start so I need someone to lower the volume from that side. You know what? I'll just move the mic away. Yeah. Michael will chase you around. Welcome. Welcome everyone to Scale. Today our speaker's name is Rami Alganmi. He's worked in DevOps for multiple years at companies such as Symantec and currently he's working at Workday. Today he's going to be presenting on Terraform in 50 minutes so please give it up for Rami Alganmi. Thank you. Thank you and thank you Hibba for the nice introduction. So good morning everyone. I'll be talking about Terraform in 50 minutes. Now it's a tall order. Number one I'm talking about 30 minutes after who? After the CTO of HashiCorp, right? I do spot some people who I saw in the HashiCorp booth in the room so that's extra pressure on me today. So that's all good and dandy and hopefully we'll get something well done today. My name is Rami. I'm a software engineer at Workday. I do DevOps day in and day out, work with technologies like Terraform, Kubernetes, some other HashiCorp products and we try to make our deployment secure, consistent and basically everything that every DevOps engineer aspires to. You can find me on LinkedIn. You can tweet about me or at me on Twitter. Although my Twitter is, I don't tweet a lot. Last time I tweeted was celebrating when the curiosity overlanded so you can see how far back that was. So one of the things that you're going to notice today is Terraform is kind of a big thing. There are a lot of intricacies to it. So everyone who I told them that I'm going to try to cover Terraform in 50 minutes, they told me, what? So you may see me rush a little bit. My goal here is just to give you as much as I can because this is intended to be an introductory talk. So one thing that we need to cover, Terraform is an orchestration product that mainly focuses on moving your infrastructure as code. One of the big movements that we're doing now since everybody's going to a cloud is moving everything into code. Orchestration, infrastructure, including policies. So if you wait until the next talk, that's about Cloud Custodian. That's about cloud governance using code and putting your policies in code. So once you have it in code, we're software engineers, we know what to do with code, right? We verge and control it. We can put CICD, etc. So that's in our effort to automate everything that we want to go in. Now, here are the goals that are stated for this presentation. I want to make sure we have the expectations set for all of us. And yes, I did nickname my presentation TF50. So first of all, we want to understand what the basics of Terraform. Basically, if you go into Terraform documentation and you start reading, you can skip the first intro docs and you can jump directly in and you understand everything that's being said in the documentation. Number two, you can read Terraform code. Some of us actually inherit code as opposed to starting Terraform from scratch. So this is a good place for you to start. You also know how to return from documentation. And as you exit here, and these slides will be posted and they're on GitHub also where you can deploy the first slide in 60 minutes. Basically have your small lab already set up as soon as you're done. And I say 60 minutes because we're keeping 10 minutes for questions, hopefully. So some expectations about this talk. It's going to be a little like a haircut, okay? A very special haircut. We're going to have discussions up front and we're going to have code in the back. So if it feels a little bit dry up front, there are a lot of concepts that we need to go to. So like a mullet, we're going to start dry and then we're going to have all the code and the party at the end, okay? And feel free to stop me or ask as things come in. So first of all, Terraform, like I said, orchestration tool. It's an open source tool. It's licensed under the Mozilla public license. It's created by Hashicorp written in Go. It uses the Hashicorp configuration language, which is kind of controversial for some people, but we kind of got used to it right now. And if you use other Hashicorp products, you're pretty much familiar with it. And the current released version of it is .11. .12 is in beta and has been in beta for a while. The reason the beta for .12 is taking a while is because they're doing changes to the HCL, to the Hashicorp configuration language that will make things more expressive, easier to deal with, and that will require a lot of changes. Now, how many here has used Terraform before? Okay, how many of you have rewritten their Terraform code just because they can go between versions? Okay, I'm the only one. So I've been using Terraform since .2 and I needed to rewrite my code base three times. But hopefully things are still stabilized after .12. That's not a bad thing, trust me, it's a really good thing. And it shows you the power of Terraform that you can actually import your current in back into it. Now, at a high level, can you guys see the pointer when I'm pointing out the screen? Okay, great. So at a high level, after Terraform .10, one of the cool things that Hashicorp did is they separated Terraform from the plugins. Now, one basic way to think about Terraform, it's a life cycle tool. Anything that has a life cycle, Terraform can manage it. If that life cycle has an API, Terraform can do it. If you have your own internal in-house product that can create, update, report changes, and get destroyed, you can use Terraform to manage that resource irrespective of what it is. That's why there's a big plugin community in Terraform where people add things like, for example, Wavefront. It's an enterprise monitoring tool. It doesn't have an official plugin, but a company created their own integration with it. Okay, through Terraform and you can use that with Terraform. The idea is that Terraform core code does not need to be updated a lot. However, the code that interacts with AWS, let's say AWS released a new product, and you want to control that product through Terraform. You don't need to update the full version of Terraform like the old days. All you need to update right now is what we call the provider, which is the provider that provides you the cloud and hosting service, right? And that plugin, whether it's a provider or a provisioner, will interact through a client library with that provider's API. So Terraform is kind of an abstraction layer, okay? And it does a good job at that. So now we're going to get into some concepts. First session of the day, I hope you're all caffeinated. It's not going to be too long, but we want to make sure we're on top of this. So the first concept is variables. Everybody knows that, typed input, as you usually have in your programming languages. Generally, there are three kinds. There are strings, there are maps, and there are lists, okay? Then you have outputs. Outputs are basically tidbits of information you share out of from your infrastructure after you run Terraform, like, for example, IP addresses of boxes, et cetera, okay? Then you have provider. Now, a provider is the configuration of a resource provider. Things like AWS, Azure, DigitalOcean, GCP, even GitHub, GitLab. Let's say your team is responsible for provisioning repositories on your internal GitHub, okay? And you need to create a repository, you need to create specific hooks, you need to create alerts, whatever. There's a Terraform for that, right? So you can use the GitHub Terraform plugin to manage that lifecycle for you and track any changes that happen in lifecycle. And other tooling exists for things like PagerDuty, Datadog, et cetera. So there's a big, big library of plugins and providers. Then there's modules. Modules, just like we have in programming functions and modules and methods, it's basically a shareable Terraform component. What happens is a module has a set of variables and releases out a set of outputs, okay? And you basically feed in the variables and read out the outputs, okay? And recently Terraform released their Terraform module registry. So if any of you used Ansible before and they're familiar with the Ansible Galaxy, Puppet had their own, I forgot what they called it. So you can basically go and find modules there and use it. Other people, a lot of people contribute those modules including official modules from AWS, et cetera, okay? The other aspect is an important one and that is state and state file. So some orchestration tools rely on reading the current state of your cloud in order to do anything you want to do with your resources. So for example, something like Ansible, we'll go ahead, check the current state and then apply the changes that you have in Ansible to that state. Terraform is a little bit different. Terraform assumes that you're managing your life cycle through Terraform, okay? So if you create a resource using Terraform, Terraform stores the state of that resource in what we call a state file, okay? That state file describes the resource and all the attributes associated with that resource as documented in the Terraform configuration and it's pretty comprehensive, okay? Now what happens is once you have your Terraform code and you have your Terraform state, when you run Terraform, first thing it does, it checks your current code against the state, okay? If there is any deviation, it will report that deviation to you and you will instruct it to either exit or make sure that whatever is in the code is actually reflected in the infrastructure. Understanding state is an important part of Terraform and that's one of the things we're going to spend some time on right now to make sure that we understand it because everything after that is something that as a developer you can easily pick up, okay? And that's why we always try to keep the state in version control, right? Because state, literally at this moment, what does my infra look like? That is an important piece of information because at some point at a deployment later it's like what happened here? What was this like two days ago? You can actually look at your state file from two days ago and figure it out. Another thing is Terraform can work in a distributed fashion. Multiple people can work with Terraform at the same time. So one of the things that we try to do is you try to do locking and this was an enterprise feature that was actually pulled into open source. So now you can actually do locking so no two people write state at the same time. Okay, some concepts. First of all, these are the operations that you run on Terraform in order of importance. The first one is plan. Plan basically says create a plan using the current state of my infrastructure and what my code is. So you wrote code that says provision to EC two instances. Okay, and then you have your state and then you have what actually running what was actually in the cloud. Now let's say someone went ahead and deleted one of those instances. Okay, so Terraform will look in the state file. It will see, oh, these are the resources that I'm tracking. It goes checks the state of those resources. Oh, one resource does not exist. And then it will generate a plan for you on how to remediate the current situation and make it work for you and make it match your code. Okay, so let's say for example in a scenario you had two EC two instances that are not part of an auto scaling group. One of them kill one engineer killed an instance and spun up another one in its place. Okay, now what will Terraform do Terraform tracks what it has in its state file and in its state file it has one that already had before it doesn't have the other one. Okay, so as far as Terraform is concerned you're one EC two instance down and it will try to fix that. Okay, if that's not the behavior that you want then you need to actually import that instance into Terraform. Okay, and then Terraform will start tracking that as the other instance. Any questions so far by the way guys feel free to ask. Okay, the next one is Terraform apply. So once you have a plan from Terraform, okay, Terraform tells you I'm going to create a VPC to a couple of EC two instances a Lambda function etc. I'm going to create all that for you. It creates a plan. A plan is basically a JSON blob. Okay, and you can take that JSON blob and actually store it in local disk if you want. Some of the patterns that people use to deploy Terraform is actually create a plan have somebody peer review the plan and then have a job apply the plan. Okay, so apply basically takes in a plan and makes your provider deploy that plan on it. Okay, what happens in some cases that in nowadays the command apply if you did not run a plan it will run out if you did not feed it a plan a pre-existing plan it will run a plan for you. So you just need to be mindful that there are actually two steps that are happening as one. The first is the plan. The second is the apply. They just made the shortcut to make the barrier to entry for people a little bit smaller. Okay, the other one is destroy. This is important because like anything cloud you can shoot yourself in the foot. Right? So destroy will literally because unlike if you use cloud formation or if you use, especially if you use Amazon server just zero of hands. How many people left things in AWS that they didn't know they left? Okay, how many figured it out six months ago? Six months after the fact? One year after the fact? Two years after the fact? One of you is lying. Okay, there are things that stay for three years after the fact. Okay, we all know that. I see someone that eats their hands. So thank you for being to my friend. So one of the cool things if you do this via Terraform, Terraform literally keeps track of every nook and cranny. If you say Terraform destroyed literally everything will be destroyed. Nothing spared. Okay, so there will be no chargeable item that you created and managed via Terraform. The other thing is taint. So one of the things that DevOps used to use to talk about a lot until people wised up to it is let's say via instances or VMs, they're cattle not pets, right? So sometimes something acts up, a VM acts up. It's like is it worth your time to go in and figure out what's going on with it? Or you just want to shoot it in the head and let it go? That was bad. Sorry. You just want to delete the instance and let it go, right? So in this case, taint does that for you. You see this resource here? Okay. Please taint it in the next Terraform run. Terraform will delete that instance. But not only that, it'll make sure if that instance, for example, had a security group that it belonged to or it was part of an ELB or anything, it will trigger that change throughout your whole infrastructure just because you tainted that instance created one in its place and won't touch anything that doesn't need touching. Okay. Great. Then we have Terraform import. We talked about it basically important. Let's say you already have infra, you want to write Terraform for it. You write the Terraform code, you import it, Terraform will start tracking it for you. Terraform format, if you ever review Terraform code, you have a meme ready in Slack for people who don't use this command, right? So please use Terraform formatting as much as you can. Okay. So we're done with the code, with the beginning. We're getting closer and closer to the code part. So brace yourselves. Terraform, it's a GoLang application, hence it's deployed as a binary. You can download it either directly from the Terraform website. It has nice shell integration. So it works with ZShell and Bash also, so we could enjoy that. Some basic steps to begin with. So sometimes I go a little bit extreme in my DevOps thing. So one of the things I do is that every three months I wipe my Mac, okay? Just to prove that I'm a good DevOps engineer. I'm myself crazy also. So sometimes when you do that, you really don't want to keep track of your infrastructure on your laptop, right? And sometimes keeping it in GitHub is hard because someone would have checked out the state, okay? Did some changes, but forgot to check it back in. So an easy place to put it is somewhere where everyone can access it, okay? Terraform has support for things like all kinds of story, the storage provided by all cloud providers. Also things like Artifactory, for example, which some people use. In this example, since Terraform manages infrastructure, I needed to show you infrastructure to manage. So I did a quick survey with people a while back and almost everyone is at least familiar with AWS. So throughout this presentation, we'll go with AWS, but you'll find parallels with things like Azure, et cetera. So now what we're doing is basically setting up our command line because we need to, like a chicken and egg problem before we start with Terraform, we need to create the lock table and we need to create the S3 bucket where we're going to store the state, okay? So starting from a fresh, let's say you have a fresh AWS account, you create a user with enough privileges and then you basically set up the region and then you create the Dynamo lock table. There is one attribute, one string attribute you need that is locked. Since this is a demo account, we don't really need a lot of read capacity or write capacity and the goal of the lock table is basically to lock people out. So you won't have more than one person writing at the same time, generally speaking. The reason I'm keeping track of account ID is I run this demo by some people and when I was practicing today, I noticed that someone created an S3 bucket named with the same S3 bucket that I have. So S3 buckets are globally unique across everyone in the world. So in order not, when I do something now, nothing will break, I appended the AWS account ID to my Terraform state lock file, to my Terraform bucket. Great. Also enable versioning, remember, we said state is important and also enable encryption and better yet, you should actually do all this using Terraform. Okay, but I'm assuming you're starting completely from scratch, okay? Any questions so far? No? Awesome. Okay, so now just so you believe that I did this, where is it? The resolution is off. So basically this is the same set of operations that I was running, clear enough? Okay, so this is, you can see that the color identity, you can see the account up here and this is the bucket name that will be, the names that will be going in the future. So I'm creating the Terraform lock table and yes, I can type that fast. You can see the details of the table as they are created. You need to keep track of the table name. Also you need to keep track of the region that you deployed things in. Okay, and we created the S3 bucket. You need the S3 bucket name and we just enabled versioning and now we're enabling encryption on the S3 bucket and that's it. Okay, okay, so now we're going to start talking about code, okay? To begin with, we're going to talk, the first thing that you need to set up is a provider. In this case we're going to be using AWS. You're encouraged to provide a provider version because Terraform will maintain compatibility within a major version of a provider, right? So if you upgrade from 1.0, from 1.2 to 1.60, I think today is 60, you won't have any problems but let's say they change and you go to version 2, at least you will lock yourself here. We're going to be using US West 2. You can have multiple providers or multiple regions defined within the same code base. Terraform allows you to do that, okay? And the profile, I'm just using an AWS profile to authenticate, okay? Which is available in my credentials file. Once this is done, I have a link explaining that a little bit more if you're not familiar with that concept. And then we also need to create the, what we call the state backend. In this case I'm using an S3 backend and I'm defining the bucket name which we saw. And then I'm defining the key. So the key is basically the path within the S3 bucket that I'll be using. So in this case I'm calling a Terraform lab and this section is called basic. It's in US West 2 profile. Now this is an interesting part for those who deal with security in their accounts. You can give somebody a different profile for storing there, for editing the Terraform code, for editing the Terraform state. So it doesn't need to be in the same account. It doesn't need to be, it just needs to be in the same region. But as long as you can give them a separate profile to alter the state which gives you more control on auditing and things like that. And this is the DynamoDB table that is used for locking. Now demo time. Cool. So if you go, the GitHub repository that comes with this call has a lab. And this is the lab. So you see here, step one, we have the provider. The way you start any Terraform interaction for the first time ever is you do Terraform initialize. And when you do Terraform initialize, because we have the S3 provider in the backend, it needs to register with it. But also it needs to do a couple of other things. Here, it will create local files to just keep track of things. But also, more importantly, you'll notice here that it downloaded the AWS provider. So one of the main things here is that when you did initialize, it scoured your code, looked for providers. Oh, you're using the AWS provider. So it checks for where the AWS provider is. There's a local cache in Terraform on your local disk where it stores the provider. So it doesn't download it every time you call it. It can download it locally, but looks for new versions as you come in. And it keeps it here in local directory. Word for the wise, although Terraform will keep track of your providers, you need to keep track of the version of Terraform you're using. There's a whole issue that can happen if different people are using different versions of Terraform. It's out of scope. We can talk about it after the discussion if you're interested. So now, this is basically how you start with the basic setup. Now, what we want to do, we want to get more advanced. One of the things that I like about Terraform is that it gives you what they call data sources. So some things, yes, if you initialize without a plan file, it'll be just an initialization on a local disk, which is almost equal to nothing. It'll just create the local directories. The main reason you would need initialization is if you have an external providers you need to reach out to. Because it downloads a provider like AWS provider. It goes to S3, make sure it has access to the bucket. It reads if S3 has any information. Because let's say this code, I checked it on another laptop. It'll download the state file to know what happened before this directory even existed. Good question. So yes, it will download it from local disk. There's a variable you need to set in your shell. I'll show it to you later where it will cache it locally for you. It will check for a new version every time, but it will cache it locally for you. So here, data sources. One of the things I like about Terraform is that sometimes there are things that you don't own, or you don't track, but you definitely need to know. So this is what we call a data source. And in this case, I want to know what the account ID is, my AWS account ID. Maybe you have something that displays a role in. But also I want to know if the account aliases. Some companies name their accounts, so it's easy for you to know which account you're working on. But also not only that, I want to output that information so I can read it. And the way the syntax works is you're basically the type, which is data. We'll see your resource later. What is the construct here? And this is the name you want to give it. And the output is the same thing. Data, and then the type, and then the name that you provided here, and the attribute. So let's see that in action right now. So if we go here, and we go to... Again, we'll do Terraform init. And the reason it's taking time here is because it's reaching out to S3 to make sure that the bucket exists. And now if we do this, you'll see here. And if I do now Terraform apply, and the reason I'm doing apply without applying is that I really don't have a plan of just reading data. We'll have other examples going down the line where we actually have an apply. Yes. Sorry, say that again? Yeah, in environment variable, I just read it because I needed it to create the S3 bucket. So this has nothing to do with... So that was just like, that's the egg, and now we're having the chicken. Okay? So that's how we did it in command line. This is how you can do it in AWS, in Terraform. Make sense now? Yeah. Okay, now think about it. So, but thank you for pointing out. Now you notice that other than the one that we had before, that no information was out, in this run you can actually see... Let me bring it up. You can actually see the account ID and the account name displayed here. Okay? Now let's go see some fun stuff on the AWS console. And everyone promised me this is the last time you will use the AWS console. Everything else from now on goes through Terraform. It should be only read-only mode, right? Thank you. We're going to have a group session after that to discuss that. So anyways... So, now I'm in US... Can you hear me? No? Hello? Hello? I may have written... Somebody doesn't want me to finish in 15 minutes. Hello? No? Okay, awesome. So, you can see here... Now don't be afraid if you don't use AWS before. This is just a really bad interface to show us stuff we don't understand. Okay? So, I'm sorry for the resolution limitations of the projector. Now, this is the Terraform... The Terraform table that we created, the DynamoDB table. Okay? And you can see here that we have a state lock here. Okay? There is a state entry. Now, there are two things that get put in this DynamoDB table. The first one is a lock when two people are accessing the same resource at the same time. Okay? That's when accessing the same Terraform code based at the same time. The second one is a checksum of your state file. Okay? So, Terraform will know if someone was naughty and edited the state file by hand, or someone else ran it and somehow overrode the lock. Okay? So, Terraform will know that. So, usually when you keep running things, you'll see the checksum being updated here. Okay? The other thing that you will see here in the S3 bucket is the path that we created. So, we created Terraform lab and we asked for account info. Okay? So, there is a file that's called account info. You can see from here that we have versioning enabled. Right? And now, I'm going to... Oops. Wrong button. I'm going to download the file. And we actually want to read that file a little bit. We want to... Hello? Working? Not working? Hello, world? Hello? Working? Okay. So, this is the state file as it comes. You'll notice that a bunch of things are tracked here. So, this is the state file I downloaded from S3. First of all, there's a Terraform version. Terraform will not allow you to use a Terraform version older than the one stated here. So, if some engineer did a brew update, and you know who you are, by the way, if someone did a brew update and they updated their version of Terraform, they'll basically force everyone else to update. And there's tooling out there to make sure everyone is on the same version. Okay? Lineage is basically an ID to keep track of your state. And then, you can see, these are the outputs that we saw. Okay? So, outputs are kept... Outputs are kept like that, just outputs. Full tolerance. Okay. So, but also, there's extra information. So, these are the resources that we called. And there are a lot more resources, a lot more data here than when we had in our output, but that's basically the whole state. Right? If once we have EC2 and things like that, this will balloon up quite a bit. Okay? So, but it's all pretty printed and in colors. So, you're gonna... So, but this is awesome because sometimes, you just want to debug something and you want to read the full state, just pop into S3 and go read that file. Okay? And you have all the information that you need. Dependencies... Terraform is pretty good about figuring out dependencies. I have something toward the end that will make things even nicer. But if you went into a dependency problem, you really, really need to go ahead and re-examine the code and the structure you create your code in. Because dependency issues may not cause... always cause problems when you create stuff, but it will definitely cause problems when you destroy stuff. Okay? So, cool. If there are no questions, I would like to pop back into the presentation. Yes. So, let's say an EC2 instance failed. Failed to create or failed... So, what Terraform reads, it doesn't access the instance itself. It accesses the AWS APIs. So, if the instance failed, Terraform will know because AWS will know. Okay? But let's say the instance died. Okay? And you went and manually killed it. And you terminated the instance. Okay? Terraform will notice that. Okay? Well, if it failed... Terra... It's not meant to be a monitoring tool, in a way. Yeah. I think we're missing each other. What do you mean by failed here? Is it like a software failure? Okay. So, if that machine goes down, then it's reported in AWS either... like if it's in a terminating state, then Terraform will see that. But if it totally evaporated, Terraform will say, this machine was supposed to be here. It's not here anymore. Let me spin up another one for you. Oh. It'll come as part of your plan. Okay. Yeah. But thank you for the question. So, I'm playing in documentation every now and then. So, when you follow the slides, you can figure out where you can go from there. Now, we saw the state file. We saw the checksum for the state file in the DynamoDB table. We didn't see what the lock looks like. So, the lock happens for a short period of time. So, I just captured one lock for you and plugged it in. So, the name... this is basically... it has the name of the file, the state file that you're accessing, has an ID, some other information, including the person who did it. Okay. And always one of my favorite things when I... at least in my previous job, one of my favorite things is I go in the morning and I do Terraform plan. If I see anything in yellow, I just go and check who did it. And then we're going to have fun that day. Okay. So, now we need to get real. Now we need to build real info, right? Who just wants to get information from AWS? You want to start provisioning EC2 instances, VPC, subnets, and there's not enough Chuck Norris nowadays for some reason. Okay. So, what we're... Oops, that was a miss. So, what we're going to do is we're going to create a VPC. Now, a VPC in AWS... A lot of things in AWS console take painfully longer than they need to, okay? To create an EC2 instance via the UI, it takes me at least four minutes, right? To create a VPC, you need to click at least like 13 clicks just to get a VPC. Now, in this case, it's all these lines that you see in front of you here. And what I'm doing is I'm telling... And I'm going to parse this for the net... for the first segment of things that we're creating because, remember, one of the goals is to be able to re-terriform code, okay? So, I'm going to create a resource. The resource name is called AWS VPC, okay? I'm going to give it the variable name, this VPC from now on shall be known as main, okay? And this is the CDR block that I will give for that VPC. And like a good boy, I always tag my resources, right? Who has tagging problems? So, always tag. If you don't have them, you will have them soon, okay? So, always tag your resources. So, you will see me, although it's a demo, you will see me always inserting tags here. You teach for a while and just habits don't go away. You must enforce good habits, okay? Then, of course, now we create a VPC. You need to create an internet gateway so we can reach out to the internet, right? And we need to give it an at so we can reach out to the internet through the net. So, basically, I'm creating an internet gateway. Now, here's the interesting part. The internet gateway needs a VPC. Now, what we're going is saying, go to the AWS VPC that I called main and get me the ID of that VPC, okay? That's done. You run it once. It will run here. You change the region. You run the same code. Didn't need to change anything except the region. And it will still run. And everything will look the same, okay? Except the IDs, of course. The same thing goes for the AWS route. This is basically a necessity to get internet access from our resources out, right? So, as you can see, I gave it the, this is the VPC, but in this case, it wanted the route table ID and it needed the net gateway ID, okay? Now, this is a spot where I will pause and actually go to Terraform documentation, okay? So, remember, we had two kinds of things in Terraform. We saw three things. We saw data, we saw resource, and we saw outputs, okay? Now, when you're starting, there's a common mistake you run into when you Google just Terraform VPC. There are two kinds of Terraform VPCs. One is data and one is a resource, okay? The data means go read information from that VPC. Okay, and I want data about that VPC. Resource is something you want to create and alter, okay? So, keep that in mind while you're doing this. And generally, the Terraform documentation is pretty well written. You'll always find a basic use case, a complex use case, and then you'll find the argument list. So, in this case, this is the required argument that we used. There are a lot of optional arguments that are listed. Links, unfortunately, sometimes they provide links to AWS pages if something is complicated, but other than that, everything is basically simple. You just read it here, what it means. If anything complicated, just jump to AWS documentation. Generally, the name is the same, okay? And you can go from there. Attribute references, that's the fun thing. That is what information you can get out of the resource. So, we used ID. You can also read the CIDR block again. And we also read the route table just right now in our code, okay? And if you want to know what information you can extract from a resource after you create it, all you need to do is just visit this page. Sometimes there are undocumented things that you can find in the state file. Use at your own risk. Route 53 is full of them, FYI. So, great. So, now we created the gateway. Now we need to create a subnet, right? Because we need to create a subnet to put the EC2 instance on. To create the subnet, we need to create the VPC. To have internet access, we need to get the gateway in the net, okay? That's a quick way of putting it somehow. Now, same thing, create an AWS subnet resource that I'm calling main. I want to give it this CIDR block. Notice it was slash 16, now it's slash 24. And the VPC that this subnet is associated with is the main VPC. So, I'm grabbing at its ID. And make sure you give public IP addresses so we can assasinate into things. This is the availability zone you can go in. Remember, all these things can be parameterized, right? And give it a nice tag here. Tag name just means to give it a name, right? So, let's go with a harder demo right now. So, now let's do what we always do. Terraform init. So, we know where we are. This is the site file that I have. It has the VPC ID, so I'm going to point on the screen. VPC ID, the gateway, the route, the subnet, that's it, right? So, if we say Terraform, I want to demo now Terraform plan. So, Terraform plan is just that. It will create a plan based on reading the state file, guessing what you have there, and what the code you have. So, in this case, it has really nice indicators, okay? So, one of the first indicators you see is the create indicator here. And it shows you, I'm going to create this resource, and it will tell you what information it knows or what information it has, and what information will be computed later, okay? And this is the route instance access. This is the subnet, okay? And you see some of the things that we hard-coded is being hard-coded here, like the CIDR, et cetera, okay? Now, as you said, the plan does nothing, and Terraform will give you a hint. You need to do Terraform apply to get things applied. So, that's what we will do. Since we did not give Terraform apply a plan file as input, it will inherently on its own run a plan, okay? You can see the same output that we saw before is running here, and you need to spell yes-y-e-s, okay? It's just like SSH that way, or GPG. So, you need to spell yes, because these things cost money, and it will start creating the resources for you, okay? And you will start seeing things happening as you go along. So, see, you have a VPC, it's sold out to ID. This is the gateway with its ID. It has a subnet out, boom, we're done, okay? You write this code, you can give it to any of your engineers. They'll want it, they'll have the same exact thing, okay? So, now, before that, you know, let's go to variables. So, now, we notice that we had specific names in there, main, et cetera. We want to be more advanced with those things, okay? So, we want to create variables. The cool part is, once you create a variable, you feed in a variable file, you can say, oh, prod, dev, for example, okay? There are other ways of segregating deployments using Terraform, mainly workspaces. Not everybody uses them because there are intricacies there, but this is one of the simpler ways of doing it just by using variables, especially if you have a simple deployment. If you want to talk about more complex ones, feel free to reach out. We can have like a small discussion after the talk. So, I'm saying, I want to give every resource I create a prefix, okay? But also, I want to give it, this is the VPC name I want you to create, and this is the VPC CIDR, okay? And what will happen is, this is the VPC that we created. So, this is what we just did right now. I just copied the slide over, okay? Now, this is what it looks like with variables, okay? So, I'm gonna go back and forth just so you see positionally where things are. So, you can see, before we saw how to use data, okay? But now, we have a var which you can just call a variable from the variable file. And in this case, I have variable resource prefix, VPC name, CIDR, et cetera, okay? So now, let's hop in here. So now, what I want to do is, I'm gonna do a trick, okay? And hopefully, you won't have people, sorry, freak out a little bit, but what I'm doing is, I am sim-linking the Terraform, the Terraform configuration directory into this directory. The reason I am doing that is, I don't wanna do Terraform in it again, right? In the previous Terraform code, I wanna show you that if I use variables, what will change in my Terraform, right? So in the previous Terraform code, I had the values hard-coded, okay? In this Terraform code, in this directory, I have them as variables, right? So you can see here, no, where is it? For the names, I'm using variables here, okay? So if I just change my code from having hard-coded values to having variables, will anything change? Will Terraform complain, okay? A plan will tell me. So I wanted to run the same plan on the same code, and it tells me I have nothing to do, okay? So I just abstracted everything into a variable, Terraform was oblivious to that, okay? Everything works the same as long as I didn't add any extra spaces, nothing else from that regard, okay? Now, what I like to do sometimes is I actually like to mess with Terraform, okay? So one of the ways to mess with Terraform is let's go and edit the VPC, how about that, okay? And let's see what Terraform does. So you notice here, the second VPC is the default one that AWS creates for you. This is the VPC we created. Because we have a name tag, always add a name tag, please. Because you have a name tag, you will see that it's clearly displayed here. So what I want to do now is I want to change the name. I don't want to call it scale VPC. I want to call it main VPC, okay? Now let's see what Terraform does. So now we'll just do Terraform and I will do an apply. It will do a plan while it's at it, okay? And now we saw a new action. This is a change action, and it tells you there's a tag called name in this VPC that is called this, and I'm going to change it to this because this is what you have in your code, okay? And if I say yes, it will start modifying the resource. And now if you refresh, you will see back to what it was, okay? So that's in the most basic sense what Terraform does, okay? So this is variables here. We have more complex variables. So those are straight-up just values. Here we can create a map, and here we can create a list. You will see when you define a variable, you always give the value default. This allows you to define default variables that you can override with a variable file. So even if you did not provide a variable file, your code is not going to exit with an error, okay? So this is how I call the complex variables. If you have, in this case, because I had a map, I called the map out of using this syntax. It's like an array syntax, similar to what we have. Now, this is an interesting one that I wanted to pay attention to. Now, I want to now create two subnets, not one. The question is, do I need to have two subnet blocks? Not necessarily, right? What you can do here is you can add a count variable, okay? So you can say count create two subnets for me. But I don't even want to specify the number two. I want to say in that variable file that I have, I have a variable called subnet. That's a list, okay? Grab the length of that list, and this is what we call a terraform interpolation. Grab the length of that list, and that's how many I want of this, okay? And then make sure for every item, and then everything from now on is per item in that count. So per subnet now, for the first subnet, grab the CIDR, same VPC for all of them, okay? And here, pick up the AZ that that subnet had. Just to make sure we're on the same page. So see, the subnets had a name, CIDR, availability zone. Name, CIDR, availability zone, right? And I'm pulling out the CIDR, the AZ, and I'm pulling the name out in the tags, okay? So one of the things that terraform 12 is going to have is some changes to make this more better or easier to handle, okay? But for now, if mainstream is still using that. Any questions about this so far? No? Yes. Yes, so I'll show you the variables file right now, okay? Yeah, no worries. So now we're going to go to outputs and shared state. So before we go there, we need to understand how terraform destroy works. So this info, we tested it out, we played with it. Now I actually want to destroy it, okay? It's as simple as or as hard as terraform destroy, okay? And then you run it. Then make sure you have VP sign off, boss sign off, have somebody sitting next to you. So make sure everything that you did is covered. So when you do yes here, better yet have your boss type yes and hit enter. I did that once by the way. So it tells you, okay, now I'm going to destroy and it's going to list the items that it's going to destroy. Okay? And you just type yes. It'll destroy them and then you're done, okay? And terraform is not asynchronous. It is synchronous. So it won't say I send the message destroy. Let's hope it's a sorry. No, no, it's going to wait until it's destroyed. Okay? And then I'll report back to you. Okay. So now let's go to this one. Now what I want to talk about is sharing information. One of the things we do is in your organization, you may have a networking team that is separate from the, let's say the developers. Developers can provision these two instances. But only a networking team can create VPCs because they're routing tables. They may have direct connect with core, a lot of things like that. So sometimes you want to separate them. But at the same time, developers need information about those VPCs to provision easy to instances. So this is what we call shared state. Okay? And the way you manage shared state is inside your, inside your state, you just say output and provide that information. Okay? And then when you provide that information, sorry, when you provide that information, it'll be available for developers to use and we'll see it next step. Okay? So what I'm going to do here, because we're short on time, I'm going to do, sorry, this is what happens when you don't initialize Terraform. They'll say I'm mad. I don't know what I'm doing. Okay? And we're going to do Terraform apply. I'm going to say yes, please. And what will happen is it will create the resources, but it will also store the state of those resources here. So then, let me see. So what happened here is that it created the resources, but you notice that I didn't have any output. So what I wanted to show you now is here. Here I have the variables that we talked about, and I also have a file that I created that is called shared state. And in the shared state file, okay, I'm sharing the VPC ID and the subnets. Why? Because to create a security group, you need the VPC ID to create an EC2 instance, you need to know the subnet ID, right? So I'm sharing those here. And so people who are creating my, and now I'm going to publish that output, so it'll be part of my shared state. So when somebody comes and creates EC2 instances, they will have that information available for them, okay? So in the interest of time, I'll jump to the slides and we'll come when this succeeds. So now, what we want to do is we actually want to create EC2 instances. Now, one big issue that always comes up in EC2 is AMIs. Okay? Which AMI do I pick? So, let's say you want to use Ubuntu. Ubuntu tells you, pick one of these AMIs. It's like you're like, which one should I use? It's like, has everything from 14-4, 16-04, to 18-04, different platforms? No, it's on different regions. So each region has its own ID. So of course you memorize that by heart, right? No. So what we're doing is, in this case, here. What I'm doing is I'm actually using the data construct to grab, saying, go to AWS, please. Grab an AWS AMI. Okay, now create an AMI, grab an AMI. That AMI has a name that looks like this. So Ubuntu says, all my AMIs for 18-04 look like this. And then they append the version to it. CentOS has their own scheme. But at the end, there is a wild card that you can put at the end, right? And then what you say is, I want it to be HVM, which is most of the virtualization we use on AWS. And don't grab any image that's like that. Make sure you grab it from the canonical account. And the canonical account was on the page that we visited, right? You go there, you grab that. This is a very important thing. Because sometimes you have your own internal images, okay? Make sure you always specify the owner. You can specify a list of owners. It doesn't need to be one owner account. It will do the search with that filter on it, okay? And what will happen is that you can just immediately generate the EC2 instance out of that, okay? So now what I will do is I want to show you other than this, so we know where to get the AMI from. Now the question is, where do we get the state from for the network, okay? And that we get from something that's called remote state, okay? So you call a remote state. What is the remote state? Well, this is the bucket. We store everything there. But go grab the remote state that is called networking that we created just before we came to this slide, okay? And this is the region and the profile you need to access it to. So let's say the networking team is not giving you access to their account, but they give you a read-only profile, which is the only thing you need for their state. Then you just put the read-only profile here. You don't need full access profile. That is different from the profile you use on your own account, because this can be cross-account, okay? Wonderful. So now let's go to the finale, I guess. Here, where we're going to create the instances. What's the first thing I need to do? You're slow, guys. Need coffee? Lunchtime? Okay. Sorry, I won't keep you too long. Okay, so I did Terraform init. Then I will do Terraform apply. And what I'll do is, and I'll show you the code while this is happening, okay? So for instances, so why can't I? Sorry, I want to make it wider, but I can't. So what I'm doing in instances is number one, I am asking for the AMI. In this case, I'm asking for the most recent AMI based on the filter, like we discussed, like we show on that slide. Then I actually uploaded my SSH key. So I can actually SSH into the box. This is your public key, so it's nothing to hide. Okay, so I created a key pair. I also created a security group that allows SSH access in. I allowed it from all the internet, but you can actually alter the security group, okay? And also I had the egress open to all because I didn't want to create two security groups for this demo. And then this is the EC2 instance. You notice that in my variables, I'm asking for two instances, okay? So it's going to create two instances in this case with eight gigabytes of GP2 storage. These are the tags associated. And this is the life cycle. Life cycle is an interesting thing in Terraform. Where you can say, if I ask Terraform to kill this instance, make sure the replacement is running before you kill this instant, okay? So this is one thing you can put in the recycle crate before it is stored. The other thing is to ignore changes. This is important because let's say you have an AMI name that you're using, okay, for filtering. Someone updates that Ubuntu releases a new AMI, okay? The data up top is going to pick up a new AMI ID and it's going to feed it here. You don't want to recreate your resource just because Canonical released a new AMI, right? You want to keep it, you'll just get an update in your node. So what you say is ignore AMI updates. Also let's say if someone adds a tag, you don't want to be bothered by altering that tag, okay? So you just say ignore tag count and ignore tags if you want, okay? There's a lot of documentation on this. And now if you go, of course, didn't say yes. And what will happen now, it will go ahead and create those resources and that will be a great help for, and that way it will, you'll have the IP addresses so you can SSH directly into them, okay? So in the interest of time, I'm going to hop on the slides quickly and then we'll come back here. So there's one killer app that Terraform has that for me was mind-boggling because sometimes when you have complicated infrastructure, nobody talks about it but dude it's awesome, okay? Especially if you like graph theory and it is dependency trees, okay? This is generated directly from your code of how Terraform parses your code, okay? It tells you what depends on what, what variables you're using, what you're grabbing, et cetera. This is for the networking part. This is for the EC2 part. It tells you what all the resources are. Everything depends on the AWS provider, et cetera. This is great and it also can highlight cycles. So if you have cycles, it will highlight for you. This is great for debugging complex infrastructure, okay? Also make sure when you're done to always clean up after yourself because these things cost money. So Terraform destroy, delete the table, delete the bucket that we created, okay? And last thing, I want to make a pitch for my company Workday. We're actively hiring quite well. I'm not going to brag about a fourth best place to work with and Forbes like best place for women to work, whatever. All that stuff, I'm not going to brag about it. What I'm going to say is one thing about my team that I work in, okay? First of all, we have a killer logo. Number two, you're running to what I call 1% problems, okay? Our clusters, our Kubernetes cluster are so big, are so complex that there are problems that not a lot of people run into, okay? And you actually need to cherry pick things from one C and container D to make sure that your version that you're running works. If that thing excites you, if that thing sounds interesting to you, come talk to me after this discussion, okay? And visiting here, our EC2 instances were created and the IPs we have, thank you very much. You've been very patient and I hope you use Terraform after this. Thank you. Thank you all for coming to the After Lunch Track where we're all hazy and talking about the best thing to do while sleepy. That's right, managing public clouds. Today we have Kapil Thangavelo talking about Cloud Custodian, some awesome software that he writes and everyone either uses or should be using. Thanks for coming. Thank you for that awesome intro. So I wanted to talk to you guys about Cloud Custodian. Cloud Custodian is an open source rules engine. It is used by thousands of organizations to help them manage their public cloud footprints primarily around security, compliance, cost optimization, and it's really designed to be, to let developers have native cloud experiences but to put the guardrails on. So ensure the organization's safety but also ensure that developers can actually do things, try out new tools. So it also, because it's a nice DSL for expressing a lot of different things, it ends up allowing organizations to consolidate a lot of their sort of ad hoc scripts. Like, you know, you should first go in the cloud and, you know, you're like, I've got an API. I need to do something. I'll write a program. And then all of a sudden you have 500 of these things. And what Custodian allows people to do is express all those in sort of a YAML DSL so they can consolidate them, get a single well-tested tool that can do all those things and also do them in real-time, get consistent outputs, consistently deployed, and so allow them to do compliance as code in a much more readable fashion than having to read through lots of code. It's so much easier to read through some YAML configuration files. So what does it actually look like? Well, it's the cloud. It's, you know, DevOps-y. So of course it's YAML. And the YAML for Custodian is, you know, there's a config file. It's got a list of policies. The policies are sort of structured the same way at a high level. You've got, and this is sort of based on, you know, looking at a lot of these random scripts that were doing the ad hoc stuff and trying to understand what they were actually trying to do. And they're all sort of querying a resource, doing some set of arbitrary filtering on them, and then taking a set of actions of the things that they found. And so the notion with Custodian was, well, if we can make all those, you know, filters and actions like really small and tightly contained to a particular thing, then we have, now we have a vocabulary of constructing lots of different things with them. So the additional capability here is sort of around the integration with serverless, which I'll talk about in a moment. But let's talk about the install. So it was, you know, the product started off in like 2016 when I was at Capital One, and we open sourced it, and, you know, Enterprise Software just gets bad reputation, I think. Well-deserved sometimes because, you know, it can be difficult to install, it can be tightly coupled to some internal systems. We try to make Custodian really easy for people to get started with. It should be usable from a mom-and-pop, like, you know, two-person start-up to, you know, a large enterprise. And part of that's being stateless. Stateless means like there's no database to install or no web server to install. It's a one-line, it's a command line. So it's a one-line install, you can pip install, you can do a Docker, whatever floats your boat. And we also tried to make things really nice from an authoring experience. So all the docs were actually built into the CLI. I'm sure if sewing, you know, you can basically interactively drill down. If I went down to, if I did like .filter.off-hour, it would actually show me the policy example as well as like the JSON schema for it. So everything sort of comes, batteries included. And then that's actual command line. So after the install, I'm basically going to run a policy, oops, I forgot to drop something in there, but I'm going to output metrics about that policy's execution to CloudWatch metrics. And then I was going to stick stuff in the law group. I forgot the law group name. And then stick out the output, the blob output to a local directory. So there's really rich filtering built into it. There's this language on the AWS CLI that people probably don't know about. If you do any like command on the AWS CLI, you can do like dash dash query. And you can, there's this whole like language expression. It's actually in the Azure CLI as well. And so Jamie's path expression is basically sort of like XPath for JSON. It lets you sort of slice and dice and neatly nest the data structures to sort of find the values you want. On top of that, around filters, you can do things like, you know, nested ors and ands. We have a default value filter, which sort of lets colors off on like 90% of sort of attribute use cases. You can do like age, expiration, value type conversion, membership insets, et cetera, regexes. And so that ends up being like what most people already use just to get, you know, cover off on the basics. But we, and we have tons of others, like so, which I'll go into in a minute, the other thing I wanted to highlight is sort of the richness of the outputs. So all these outputs are built in. They're available for any policy. You can like for, this is a picture of Amazon X-ray. So we're actually visualizing and tracing through policy execution and all the API calls the policy is doing here. You get CloudWatch metrics by default for any of the policies that you're running. So you get to see like how many, how many resources did I find for this policy? How many API calls did this policy make to do that work? So you get a lot of money to take. You get to stick to sort of logs in CloudWatch logs. And this applies of course across cloud. So custodian, we spent a lot of time in 2018 as a project adding in GCP support, adding in Azure support. And so all these things apply to them as well. And I'll cover off a little bit later on a slide that shows sort of the full mesh of the multi-cloud. So we've got these policies. Okay, you know, and a lot of people are sort of coming to custodian incrementally. Like they've already got an existing set of stuff and they're looking to clean stuff up. So you can sort of express rich workflows with custodian by sort of chaining these policies together. So on the top left, I've got sort of like the human readable statement of what it is we're trying to do. We're trying to find everything that's poorly tagged. We want to stop it in a day and then terminate it three days later. Maybe we'll send an email out. The policy in the middle here it's sort of an issue of mark sweep and garbage collect, right? So the first policy in the middle is basically the thing that's going to mark things. And it's basically checking to see if one of the required tags is absence, app, and env or owner. And if so, it tags the instance to be stopped in one day. So on the right-hand side, we've got our sweep policy. And it's basically going to run every day and it will look for any instances that have been tagged to be stopped, verify that they're still the initial conditions that they're still missing one of their tags and then go ahead and stop them and then mark them to be terminated. So you can sort of chain these things together as sort of a workflow to achieve ends. Go for it. Yeah, so that made status tag is sort of where the operation gets put. So it basically says resource not compliant with policy. You can customize the message. It's not compliant with policy. The actions stop at and then the date. And so it's right there, visible to user in the tag on the resource. And you can of course change the tag. This is just the default one. And actually both of these policies are actually deploying it as Lambda's. So when I run the command line you can actually get a provision above these things as Lambda's. On the one on the left, it's going to provision as a config role. So we'll also be able to track that in the config dashboard, the AWS config dashboard. The one on the right is being deployed as a Lambda that's hooked up to CloudWatch events so that it's executing every day. So although as a tool it's got a serverless provisioning framework built into it. That's cross-cloud. It's not a generic serverless provisioning framework. It's something that's very purpose built and uses. And I don't generally recommend doing that. It's just that we typically are deploying like thousands of these things. So it made sense for us. And at the time when we rewrote it there was nothing else out there. So yeah. Go for it. Yep. And it'll keep them up to date. You can like, if you change the policy you run it again. It'll update them. So we distribute a garbage collector and we have a set of policies. They're version control. You just say you delete a bunch of... You delete some of the policies. You run that... You call it MuGC. You re-run the MuGC command and you give it all of your current policies and it'll basically go look in the environment, find all the things that are there that are not referenced from those policy files and go clean them up. Like clean up all the event sources, clean up the lambdas, et cetera. The answer is, you know, because we've built it out as a set of Lego bricks, people often do things that we don't expect. Because it meets their needs. And so you can quite literally assemble them, you know, millions of different types of policies together. I've left some examples on the left on some of the things that I've seen people do. And, you know, we've covered off on AWS Azure and GCP. We're starting to work on Kubernetes stuff right now. So it's really rich as far as what your capabilities are that you can express. And the second half of this presentation is going to be a lot of YAML on the screen that we'll talk through on lots of example policies. So, you know, I said it's stateless, it's a CLI, what is it actually, you know, this is sort of your architectural diagram. You got a CLI, you hand it some policies, you go to execute. And then we can direct to where it's executing and how it's getting that data. So we can query by just doing describes against the API calls. We can query against, in AWS against AWS config to grab historical state, which is really cool because now I can say, hey, I'm writing a policy today. Was I compliant with that last month? And actually evaluate against historical state. And then, of course, we are abstract out to provision the engine itself or provision itself into one of many different event-based streams. And so, I've covered off on the rough side of sort of like AWS events, like CloudWatch events gives us a lot of different capabilities. We'll go through some of these one by one just to give some examples. So the, a lot of the, you know, we, I think we've all talked about like access code, infrastructures code, and it's got a lot of benefits. Like you get to, you know, it's, it's for humans to basically understand the changes that are happening in an environment. And so this, what Citing does is tries to bring that same mentality to sort of compliance and operations aspect of ensuring guardrails around in the environment itself. So, you know, it's interesting though, because we talk about, you know, access code and it's, if you think about where the benefits are, the benefits are all to the humans. But I thought it was, I wanted to also get something that was be visible and readable to the machines as well. Like, and so we've about this tool called policy stream and what it does is actually takes a get history of policies and actually, you know, streams them out as, as data. So it'll actually diff the policies let you know what's changed and so you can see like this policy got added and this commit, this policy got renamed this policy got modified and you can stream those out into Kinesis we use it primarily for a different capability which is around being able to diff arbitrary revisions like so I've got a pull request for a policy I'm going to go ahead and do the diff to understand exactly which policies changed and then go ahead and do a dry run and validate on those. So critical part of this, you know, is making sure that you're running valid policies and so there's a built-in JSON schema for all of custodian and you can validate any policy and then of course you can dry run policies so dry run basically means do all the filtering, show me what I'm going to touch but don't touch anything, don't run any actions and so you stick that in the Jenkins we have people using, you know, Kubernetes we have people using code build, like GitLab, CI, like, we don't sort of context free on how people want to deploy us but, you know, tools and mass operation are, of course always interesting and potentially dangerous and, you know, the reality is the cloud and the makes things really easy so I've got a script here that no one should ever run it's three lines of BOTO and Python that will nuke your entire all your EC2 instances and so it's, you know, when you're dealing with tools that are operating at scale across an entire fleet it's really important to have some sort of safety guard and so custodian has those built in they're currently off by default and we've got a proposal about turning them on but there's this notion of max resources and max resources percent the little otter basically says if I'm going to touch more than 5% of my resources, stop don't do any, don't take any actions just before you, you know, safety belt circuit breaker turn off and so that's really helpful and so I think we're looking, there's a proposal there of, you know, maybe we should turn these on by default in certain circumstances just to make sure, like if we had a, you know, a gun to a toddler like we shouldn't be surprised if something bad happens if we didn't turn the safety on so that's the logical notion so let's go into, so the rest of this is going to be we'll recover off on someone like the serverless integrations that we do across the different environments and then we'll go through some like some security and cost optimization policies so this is a policy that will provision itself, hook itself up as a lambda so create a cloud watch event target basically anytime there's a database being created, in this case it's going to, and the name here is bad is going to verify that the database is encrypted and is not publicly available on the internet and if either of those two things is true it will go ahead and delete it and skip skip taking a snapshot because the database is so new, like it's in the process of being created, like there was no data in the database, right? It takes like 20 minutes for these things to actually for an RDS database to get created and this thing's firing in like, you know, 30 seconds or less of with regards to verifying that the resource is compliant to this policy I mean, cloud watch events is incredibly powerful like, you're now able to and it's been around for a few years and you're now able to, on any API call that's going into cloud trail, you can now take you can now look at it inspect it, see what's happening and then go decide if that's something you want it to happen or not so you're now able to impose policy effectively on top of the API and that's it. Now, a lot of these things are things that, you know, if you could express an I am, you should do an I am, like but if you can't, there's a lot of things you can't for say, express in any access management, like policy language and for that doing these sorts of policies is really helpful. We also support AWS config. Config is one of AWS's management tools. Custodian generally when all these when all the pod providers come out with new sort of management services, Custodian tries to integrate with them to be the easiest way to use them. This is the easiest way to write a config role. This is a config role that's, you know, checking to see that a lambda is either seeing a tag or is available across accounts. I'll cover that in greater detail later. You know, another example it's sort of where Custodian tries to integrate it directly with a brand new feature from AWS with security hub where, you know, AWS offers now a, you know, a dashboard of all of your vendor findings from all, you know, your Palo Alto's and your dome nines and sort of aggregates all of them and Custodian was a launch partner for that. So we try to fast follow on all of the cloud features and make and be the easiest way for users to use them. Amazon activity is amazing. If you haven't turned it on, if you're in AWS and you haven't turned it on, I highly recommend it. It is effectively machine learning applied to your, you know, your flow log, your BCC flow logs, your Route 53 logs, your cloud trail and it basically is looking for malicious activity. And so when it finds malicious activity it'll actually send out an event. And so with Custodian you can actually hook up to that event and then go determine what's happening. In this case we're looking at the event to say is it a severe event? If so, let's go ahead and take some actions. And so in this case we get to see again that ability to sort of compose this existing set of actions. So we're going to yank any instance profile on it, we're going to stop the instance and we're going to take a forensic snapshot. And you can also of course use Custodian to audit that your account has been configured to the right master account. So that's this policy on the right which basically says, you know, check to make sure this account's, you know, enabled with guard duty and is pointed to the right master. And with this guard duty set up, you know, on the top left we've got this thing called member role. So this is actually a guard duty and with guard duty and Custodian in general you can actually set up in a single centralized account from a management perspective and go manage like across a fleet. So in this case this policy on the left is deployed centrally and so it's got to land centrally, it's hooked up to guard duty and anytime there's an event it's actually doing a role soon back into the target account that the event was originated from to take any remediation action. Okay, so Google. Google has a fairly decent cloud functions serverless thing. This is it supports Python it's a, you know, perennial Google beta. It's interesting in sense that it's one of the few that I've seen that actually does server side dependency building so you can just give it like a requirements file and it'll figure out all those dependencies for you on the back end and install them for you. So that part's pretty nice. From a capability perspective, you know, we wanted to, in that same capability we get with CloudWatch events and CloudTrail we wanted to have that same ability to impose real-time policies of the API calls that were being made in the other cloud environments as well. So in this example we've got a policy that we run this it'll provision a cloud function anytime you start an instance if that instance has a tag quarantined it'll stop it and as far as what, you know, how custodian is doing that underneath the hood when you actually run custodians actually hooking up to the audit log with a log link to a pub subtopic, to a cloud function, etc. And then the stuff on the right of course is what cloud functions have natively. So if we, so from a, you know, capabilities perspective we consider all that stuff sort of our baseline features. We'll get to Azure in a second. So this is sort of like the notion of how do we do, what is our custodians actually, you know, the ability to do serverless API subscribers is sort of the key feature and then of course you've got all the standard stuff around logging in metrics. And so from that perspective custodian is, you know, fully capable across all these different environments. So on Azure that same policy and that notion of sort of auditing auditing a resource modification or creation looks like this. So in this case we're looking at any key vaults that are being created and if they don't have a tag on them then go ahead and auto tag them. So that means in Azure when that speak that's effective and this is actually portable and capable across different environments it's basically looking at the event that's happening figure out the user that's in that event and go ahead and apply that user's name to the resource as a tag. So custodian typically works within an account's boundary within whatever the provider's notion of tenancy is you know, be it an AWS an account in GCP a project in Azure subscription and that's sort of custodian's notion of a tenant, of the boundary that it's typically executing it. So to execute across lots of tenants or lots of accounts we have this tool called C7.org which basically will do parallel multi-region execution across hundreds, thousands of accounts. And so part of this is that it's actually hooked into the tenant API of the cloud organization's API resource management API subscription's API depending on the provider and actually generate the config file for you and then you can customize it. You can say this policy this example on the left is actually running an arbitrary script as just an additional bonus feature of this tool which is, but we're basically saying execute against all my U.S. dev, all my dev accounts in the U.S. a cloud-format engine. And then so, and I probably should have talked about this a second ago, so custodian itself is, you know, there's a core CLI tool. It's also like a set of ancillary tools that we distribute in the tools directory and that are also distributed independently. C7n.org is one of those, that's our multi-account runner. Mailer is sort of how we get to that real-time notification back to users. So, you know, typically whenever we're doing anything to modify a resource that a user is creating, we'll go ahead and take the action of sending them an email, sending them a Slack message so they know in real-time hey, something happened to my stuff, where'd it go? Not sort of being in limbo about it, but actually being notified in real-time and being told what they should have done to have done things properly. This integrates with Slack, it'll hook up to corporate LDAP, Active Directory Systems to do email resolution and, you know, contemplate and customize the messages as you wish. So a lot of people, I think I mentioned were coming into custodians incrementally, like, they've got, they figured out they've got an existing sprawl, they need to do something, what do they do? So I find that the first thing that you should do is figure out how to say, figure out who created what thing. So if I have a database that's sitting out there and I don't know who created it, I can't figure out who to tell that they need to shut it down, or that they're oversized for what they're actually using. So this notion of auto-tagging is really helpful in that regard and so you can do this across, you know, most of the resources that are taggable. And the capability of doing this, it's a little bit racy in some sense. Some automation tools and Terraform CloudFormation will tag resources after they create them. Most of the APIs, at least in AWS, now are supporting more universally creating tagging as you create things. And so in that case, I generally recommend that people tag separately to whatever their common tag is for an owner and use it as a fallback for when that owner tag isn't there. So cost savings you could spend a lot of time looking for there's no shortage of opportunities for cost savings. And so, like, on the policy on the right, it's basically just using PowerWatch metrics, looking for databases that are older than two weeks that have had zero connections in the last two weeks. I mean, the very definition of an unused database is something that no one's connecting to. So, go ahead and then mark it to be cleaned up, send an email out and you can do that across lots of different resources. You know, an EC2 instance that has no CPU utilization or no network utilization. PowerWatch logs that aren't being written to load balancers that have no instances or no requests. Go ahead. Yes, so yeah, I mean, so the coverage across resources covers like 150 different AWS resources. On omnis, for example, you can say, hey, do I have an omni that is unused? And by unused, I mean is not in the launch configuration, not in an EC2 launch template and is not currently being used by any EC2 instances. And take that as a way of saying, okay, let's go clean that up. And then we can say, apply further filters. We only want to do that if it's like 120 days old. So you can sort of glom these filters together and very rich creative expressions. It'll do snapshots. People use this for backup, so yes, that is actually that one just went in recently. Previously, what we would do is we would sweep the omni and then go back and sweep the snapshots. But we now support the ability for people just to delete the snapshots as they do you register the omni. And so there's like, you know, log groups and then there's the intention periods for all kinds of stuff. Go for it. Yes, so the default operator for the top level of filters is and. You can of course nest ors. I think one of my previous examples had like, database encrypted or publicly available. So you can, there's block operators around ors and ands, but the default top level block operators is and. So if you're doing stuff off at night, you know, you can save significant amount of cost just by, you know, turning it off when you're not using it. And so you can deploy these against EC2s, ASGs, all your databases. And so what this is doing is saying, you know, in this particular case, assuming you had a single account for dev and QA or something, find all the EC2 instances that are in a particular entity and go ahead and find any of them that have not opted out and go ahead and turn them off at night. Then you'll have a separate one for turning things on in the morning like an off hour start. And you can, there's actually the ability within the tag of these two instances, if you have like, say you're, you know, a globally distributed team and organization, you can actually put language in there to say which time your particular team or location wants to do their off hours. And so you can put time zones in there, you can set custom schedules and sort of, that's all fairly, it's fairly involved, but there's lots of docs on it. So, I don't remember. So I think that the notion is that for, I think for on hours, I would do weekends false. I think maybe it might be a typo. So like on hours, you don't want them to start out on the weekend by default. And so some people get creative here where they're like, they'll turn it off and like, if you're taking a turn off in ASG, it's actually you're doing a lot of steps. So they'll they'll craft another policy when someone's like, I want to come in and do some work and they'll say, okay, well, I just added another policy that anytime you add a tag to the limb, that'll turn it back on for you. Because the actual process of suspending an ASG is in suspending all the instances, all the internal processes and then doing that in reverse on the way back up. So you can sort of do sort of ad hoc management policies to sort of help you to help facilitate on lots of things. So I mean, this will run however wherever you want. Again, we're context free on deployment. If I added a mode here for type periodic and put a schedule in, like put whatever schedule I want, typically the notion is that you run these every hour just to allow a chance for them to execute. In fact, these things are not going to match till that time is actually corresponding to the resource. So hideous security. It's a big topic. Lots of stuff. This is actually like two years old, so it's like actually bigger now. I'm not going to be able to get through with a handful of it, so at least cover off on some basics. I think most of the cloud trail that I think is now enabled by default, but it can be suspended, so make sure that it's on. And then do the same for AWS config. AWS config is, it's useful. I mean if you, it's useful just to have like a historical database of all your resources and to get with custodian, you know, one of the things, one thing, the artifact of it being in the enterprise was, the enterprises have lots of dashboards, so custodian doesn't ship a default dashboard. We'll push metrics to cloud watch metrics. We have people like visualizing things in Datadog. Some people like to just use config as a dashboard. Some people use security as a dashboard, so this depends on what your preference is. Some people ingest everything into like influx and toss up Grafana. Again, we're pretty context free with regards to that. Ensuring that flow logs are on, and if not then go ahead and send an email out. Flow logs are useful, but they typically need additional enrichment to be actionable. Of course, do your basics on IM, making sure that everyone's got MFA, and if not, go ahead and send them an email, making sure that your account password policy has made basic minimum requirements. This is actually a cool one on the left that just added. We're actually using IM Simulator underneath the hood, and we're collecting all of the roles for all these two instances, and we're finding any EC2 instances that have an IM role that lets them create an IM user. Now we're able to actually, and we can do this on LAMDIS too, so you can be like, which LAMDIS are overly permissioned? A lot of security on the IM side is finding all the misconfigured things. And dealing with proper rotation and management of credentials and secrets. There's a separate topic around dealing with malicious, but that's a bigger topic. The policy on the right is going to look around for any IM credentials that are 120 days old and go ahead and post to AWS's security hub that you've got a finding against a key rotation against that user. And of course you can yank the key, you can send them an email, etc. More IM stuff around making sure if you detect a root login send an email. If you find a root account which doesn't have hardware MFA enabled then send an email as well. A lot of people that are coming potentially from data centers or something that sometimes people forget about, but I think maybe in the last year it's become really obvious is that a lot of these resources are part of your network perimeter. You look at security groups and your egress points but these resources are just you're all accessible so they're part of your perimeter. I think there's been enough noise and news and leaks from S3 that I think people are familiar with this at this point but S3 is not the only resource out there which has embedded IM policies and so Qsodian actually has an IM parser built into it that will actually let you do a very fine grained analysis of an IM policy that is making sure that it's accessible only from within your accounts or if you want a whitelist particular actions or particular conditions you can do that as well. And so in this case we're looking for an ECR repository that is granting access outside of our family of accounts and if it is go ahead and send an email and yank those statements that we're giving that access. Of course make sure you get logs from all your egress points on ALBs. If you ever have to deal with DDoS stuff AWS Shield you can actually set up Qsodian to enable Shield on an account level careful there's like a one time $3,000 enablement fee but once you've dealt with that you can start enabling individual resources and so Qsodian will actually as these resources are popping in and out through if you're doing blue green deployments Qsodian can go back through and make sure that they're enabled with Shield so that you have that protection and know that it's always around. Same thing with AWS WAF you can set up your rule sets for AWS WAF and then make sure that as resources are popping in and out that they're being associated to the WAF. Qsodian spent a lot of time up front doing like tag management and encryption at rest ad nauseum. Actually we have a tool in our tools directory called SlotGIS which is something that will actually scan every key in every bucket and it's designed for populations in the billions and tens of thousands of buckets billions of keys many billions but the other thing we did is cover off on encryption on pretty much every resource that has an option to not be encrypted to be able to basically verify that encryption is enabled. Granted encryption at rest prevents against a very specific threat model of someone yanking hard drives but it's still for certain industries that's just paramount. New capability so we were talking about like managing the cloud resources on the outside like the SQSQs and all the things that you can hit with the API so with Amazon's SSM agent we are actually able to manage on the inside of a server so say we get a guard duty event about a particular server well go ahead and install OS query on there and do install recall take a memory capture and drop it to an S3 bucket or I might say well I think we're running some bad Docker image somewhere go ahead and install this query and let you query out all the Docker images installing SSM of course gives you inventory capability so this lets you send sort of arbitrary commands to your servers. We also maintain a tool called Omni SSM which is basically it's actually in Go and it's sort of a serverless control plane for Amazon SSM agent so that you can use Amazon's SSM agent in other environments so in Google, in AWS in Azure to manage your fleet because it basically handles a secure and handshaking aspect so that it'll bootstrap bootstrap SSM your instances as SSM machines as though they were in data center mode just for portability okay so roadmap we've got 170 plus contributors lots of pull requests, 1500 unit tests we do matrix builds across lots of versions of Python there's a really helpful chat room with the advisors as far as what we're looking to do we're going to have a week long sprint at PyCon right after PyCon there's a sort of a culture of doing development sprints this year we're going to be focusing on documentation and our Kubernetes provider as far as other things looking at sort of doing stronger AWS IAM capabilities that check permissions thing on EC2 is something that's brand new the next thing we're looking at is sort of looking at enabling folks to be able to find users that have permissions that they're not using or potentially users that are just not being using API at all and then of course policy authoring and experience improvements as far as making making editor integrations automatically letting you write a policy that applies to multiple resources along those lines and of course if you have suggestions we're always we're always listening we've got our homepage on we're off on Gitter that's where our chat is we just set up a reddit this week one of our community users was like hey we should do a reddit and I'm like this is great because Gitter and chat is just not very searchable and so it's been really awesome and of course there's a mailing list although that tends to be super low traffic most people just hang out on the chat all right thank you any questions yes so policies gets nebulous in my context you need a set of custodian policies and you need a set of I am permissions associated to whatever credentials you're executing with so it varies so every internally every filter in action is actually decorated with the permission it needs so to run a given policy we can dynamically construct the permission you need so you can go least privileged with custodian but every time you modify a policy you're going to need to potentially update that permission and those can be generated effectively there's actually a pull request that'll do the policy generation the I am policy generation but I think that's actually something that's going to go on that the I am AWS I am management improvements is actually giving people a tool that will both check what permissions they have and does it match what they need to run the policy I am policy for them it depends on what actions you're taking so typically we need to describe like describing whatever resource you're looking at if you are going with lambda then of course you need lambda execute permissions if you are going to stop an instance then you're going to need to stop permission so it maps logically to whatever action you're taking typically so I mean that's something right now I would say a lot of people run sort of like you don't your permissions that you need are very based are exactly based on what you're doing in your policy and so people that's definitely something that we're looking at doing this quarter is adding in this notion of generate policies and separate like you can actually create a separate role for each of these policies that are executing in lambda most people just do the default one and take whatever they got a lot of people don't have the ability to modify I am on the fly so it's a trade off on sort of operational overhead that they want to deal with yeah I mean so there's a poll across right now that will do it I think we're, it's currently doing as a sub-command so again everything is decorated so we can do it it's relatively straightforward to do part of it is that we've grown the notion of multi-cloud and so we don't want to add top level commands that are very cloud specific on their outputs so I think we'll probably distribute as a separate tool in our office directory for people to do that any other questions so you mean as far as deleting them or updating them so updating is automatic deleting is because it's sort of destructive we don't want to do it automatically as far as run we distribute a separate tool called MUGC which will, you give it a set of your current policy files and we'll go look in the environment to find anything that's out there that's from a custodian that was not from those policy files and go ahead and wipe it out and that's what's on as well so you can see what it's going to do beforehand okay anything else, if not then thank you okay so it's 3 p.m. so I'll just get started so hi I'm Yuki I'm a software engineer at edmunds.com and I'm here to talk to you guys today about Shadow Reader and how I can do service load testing for replaying production traffic so I've been at Edmunds for about three years now as part of the cloud infrastructure team so day-to-day basis I work heavily with like AWS, Python lots of AWS Lambda S3 and we leverage almost a lot of the AWS services they offer so if you haven't heard of us before edmunds.com or like an automotive website so we've got car reviews you can look up used car prices used car prices and as far as our tech stack we're fully on cloud so we used to be on prem but we're fully on Amazon web services our infrastructure runs on Docker so our web apps Dockerize and these containers are running on the elastic container service which is Amazon's managed container orchestration service so as far as languages we used to be a primarily like a job shop but recently we've been moving more towards like a Node.js back end React front end so that's been a big win for us in terms like maintainability and speed so in summary today I'd just like to talk about challenges of load testing what's hard about it, a lot of the problems people today have how we try to solve it so we developed a serverless solution and we had a reader it can replay traffic and then I'm going to talk about an incident, a production incident we had at Edmunds and that was really the first time Shadry was used to solve a problem and at the end I'm just going to go over the architecture design and do more of a deep dive on the details so first of all with load testing what's hard about it you need realistic request rates with the amount of traffic on at Edmunds.com you can see that it follows a general pattern where request rates are kind of low at night time goes up in the daytime but if you were to zoom in to a one hour time window you can see that the request rates jump up and down pretty erratically kind of like a chaotic nature and if you're to factor in something like bot traffic which can hit your site at 50,000 requests per minute and then just kind of disappear the next minute you really want your website to be able to handle these like production website conditions so if you're running some kind of synthetic load test using a tool like Jmeter, Gatling or Locust which are all open source load testing tools if you're not really familiar with those tools you're just starting out you don't have a lot of tooling built on top of say Jmeter you're going to end up with a load test that looks more like this so this is the request rates for a load test and you can see that the request rates kind of rise up in a linear predictable fashion once it reaches the targeted throughput it's going to stay there for a couple hours and once it's complete it's going to trail off again like in a linear pretty predictable fashion so you know this is fine for like 99% of use cases but if you want to go the extra mile you want your website to be able to handle those random chaotic production traffic conditions you're going to want something a little more realistic so next thing you really need is a realistic set of test URLs so say I was testing the Edmunds.com website I might use these four URLs so say like Slash Use Cars Slash Use Hondas S-U-V page so these are all valid URLs real users use it on our website there's nothing wrong with those but if you're going to go through your production access logs this is really more of what you're going to find so this is just like one URL it takes up multiple lines a whole lot of get parameters here's another one this is like a graph QL query being made to the back end you know again it's hard to like generate these using a script impossible to generate these by hand so if you want to be testing your QA website new versions of your application in like a really manner that's faithful to production traffic you're going to want something a little more realistic so some of the other hard parts challenges of load testing if you're running your load test on your laptop you know that's fine but if you want something a little more high throughput you're going to have to run it in a distributed way where you've got like multiple hosts from your load testing tool and with that you know comes a lot of challenges first one being maintenance so first of all you're going to have to maintain a set of load test configurations so these are just you know a set of test URLs as well as the request rates that you're going to be sending these URLs out at which is what I just talked about earlier and if you're testing in multiple applications multiple web pages you're going to have to have a load test config for each one each time you update your application with a new version you're going to have to maintain it update those configs and that can really beat up like one or two developers time additionally you also need like boot up scripts so if you have like let's say 10-20 servers running a load testing tool you're going to need a boot up script which installs the load testing tool maybe requires the JDK a bunch of other dependencies some plugins and you need these boot up scripts to be setting these servers up in like a reliable way and again this could you know take up one or two engineers time just maintaining these various config files and lastly you need to allocate an appropriate amount of resources to each server so that's CPU memory and network IO and if your load test boxes start running out of resources you know during a load test it's going to affect your performance test results and what happens is you know what happens is if your load test box runs out of resources your performance test results it's going to look like your website latencies are very high but really what's happening in reality is just your load testing tool slowing down because it doesn't have enough resources so you know if you have all these servers running load testing tools the server costs also start to add up at admins you know we're on AWS we run our load testing tool jmeter on EC2 servers we had at one point like 24 EC2 servers across two regions and it was costing us like over $10,000 a month and you know that's pretty expensive especially if you consider that it's not serving real life traffic for users so that's you know something you might want to avoid okay and then lastly load tests if you ever use like a load testing tool say jmeter-gatling locust they're all really great tools but you'll notice that when you click the start button for the load test there's going to be like a 5-10 minute period where you're going to have to wait for that tool to generate the test plan distribute all the test URL data to each load test box and then let's say you want to like update that config so you're going to have to stop the load test and then start it back again and then again so you're waiting like 5-10 more minutes for that load test to start up it can really eat up in period testing time so the solution we came up at admins it's called Shadow Reader and it was born out of a hackathon that happened like November 2017 and it was really like one of the few DevOps oriented projects and it managed to win third place and really what was great is that it was happened in November but the first time we used it to solve like a production incident was in January so it was really cool to just go from like idea the hackathon to actual production usage in a couple of months and then so at a high level the production so the high level features are that it can replay URLs as well as request rates that are appearing on the production website so let's just say for 1.10-1.11pm on admins.com there were a thousand requests that came in then Shadow Reader is going to replay those thousand requests or thousand URLs as seen right into your QA environment and then the next big thing is that it's fully serverless so it leverages AWS Lambda and S3 so lambdas are what does the parsing of the production access logs and then there's another AWS Lambda function that does the actual load testing or sending out of the requests if you're not familiar with Lambda it's Amazon's function as a service platform so what's great about it is you just write your code, say Python JavaScript, Java Go whatever you upload it to the cloud provider and you just run it in the cloud you don't have to maintain it so really great like operationally a lot less headaches. So here is a chart of request rates so in blue is like real traffic going to the website and then in orange is Shadow Reader replaying those requests to the QA environment and you can see that like the blue line the real traffic is pretty erratic you know it can be hard to predict but Shadow Reader is able to like replay that pretty faithfully into the QA environment and you know if you were to have like a synthetic load test you might have request rates that more resemble like a flat line so this is really kind of how Shadow Reader is kind of powerful by being able to replay those production traffic conditions so at Edmonds how we primarily, primarily use this is that we can simulate production traffic conditions in QA and you know how this is really useful is that when we release new versions of our application into QA since we're you know replaying production traffic we can check that the new versions of the artifact can it's you know returning the expected status codes returning the expected response bodies so we also leverage canary deployments at Edmonds pretty heavily so if you're not familiar with canary deploys that's when you release new versions of your application to a small subset of your users and then you're going to ramp up the traffic for that application over time as you validate it in production until that new version is receiving 100% of the traffic so what we can do actually since our QA environment is under production traffic conditions we run a canary deploy in QA and then once it passes there we can we have pretty good confidence that the canary and prod will succeed since the QA canary was under production traffic conditions so we have a two-step validation system and then as far as the load test itself it's replaying the peak traffic hour 24 hours so what that is is let's say the peak traffic hour for yesterday it was 1-2pm shadowy is going to be replaying that 1-2pm time window for 24 hours and then at midnight it's going to update that replay time window to be that day's sure yeah go ahead so what's happening is that there's a parser lambda running every minute so every minute is looking at the production access logs parsing it pulling out the URL so it's basically sampling it every minute of the day yeah so at midnight like I was saying that peak traffic hour gets updated so maybe today's peak traffic hour was 9-10am then at midnight it's going to start replaying that 1-hour time window for that day okay so I want to talk about really the first time that shadowy was used at Edmonds and what happened was that there was a memory leak in our production website causing high errors high latencies affecting a lot of our users and once we started to like investigate in the QA environment we saw that we couldn't reproduce it we didn't see that memory leak in the QA environment and that's really when we use shadow reader to start replaying production traffic and that's when finally we were able to recreate that memory leak and hone in on the root cause so okay this incident actually happened on December 24th 2017 Christmas Eve the poor on-call person was pretty immediately and she was able to like diagnose what was wrong after looking at the website health metrics you know she saw that there were the higher error rates as well as latency and after checking like the memory metrics she saw that it was growing pretty drastically every minute and what was really good about this incident is that we have the elastic container service which is our Docker orchestration engine and ECS was able to cycle out the bad containers with the memory leak so we cycle those out and then provision new containers that were fresh that didn't have the memory leak so the incident itself was like resolved fairly quickly so let me see if I can get this laser pointer working so here at the top right sorry the top left you have the memory usage of our application and you can see that it grows from like 10% to 60% so that's the memory leak in effect and here you can see the CPU usage of the application you can see that you know when the memory leak starts to get pretty crazy CPU jumps up to like 100% 120% here you've got the website latency and that one also starts to go pretty crazy at the end and the latency the website availability also starts to drop as the memory leak gets worse and worse so let me just take a sidestep and give some context as far as our infrastructure so we're fully on Docker running on the elastic container service we do canary releases in prod as well as QA and then we also have what's called autoscaling so that's when if there's a sudden burst of traffic to your website we're going to provision new containers so we can handle that increase in traffic and these are all really great modern futures but the problem was it was actually masking our memory leak so what was happening is every time we did a canary release every time we autoscaled we were provisioning new containers so every time we did that the bad containers, the old ones with the memory leak were being shut down and replaced with the new ones and then so what happened is when we collectively decided to stop releasing new versions of our application around like holiday break so those old containers with the memory leak were being replaced so that's why the memory leak manifested itself on Christmas Eve so what we found once we started to investigate that the QA environment doesn't have the memory leak we looked at the C's in QA the memory CPU in QA they all looked fine we were kind of perplexed as to why it wasn't happening so let me just kind of go over what we saw so at the bottom you can see the memory you can see that it grows from like 100 megabytes to 400 megabytes so that's the memory leak and at the top is the CPU usage you can see that it jumps pretty erratically from like 0 to 80 up to 100% again so that's really the memory leak causing these CPU spikes and then this is what we saw in QA so at the bottom is memory you can see it's pretty flat it stays at like a constant like 200, 300 megabytes no signs of any problems here and at the top you have CPU usage again looks pretty tame like at most it jumps up to like 40% so the hypothesis we had was that maybe the low test we're running in QA can't recreate the memory leak because it's not really testing the QA application in like a way that's faithful to production so in QA we're running a Pashi-J meter so that's an open source low testing tool I think it's the most popular one it's been around for a very long time it's battle tested it's great it has a lot of features but for this incident it couldn't really recreate that problem so what Jmeter was doing in QA was it's using URLs generated by like scripts or by hand and it was also sending out request rates in a pretty static manner so if you were to look at the like a chart of the Jmeter request rates it looked pretty flat just kind of a linear line so what we did was we applied the Shatter Reader load to our QA website so you can see here is the memory usage under the Shatter Reader load sorry the Jmeter load so you can see the memory usage is like at a flat 20% under the synthetic load test and what we did was we applied the replay load test to the QA environment and we saw that immediately the memory usage started to jump up so you know we saw results pretty quickly we were able to reproduce this error in QA so now that we knew what the problem was we pointed Shatter Reader to our local environment so we just ran the web server on our laptop we started load testing it and we took a heap DOM to see what was going on in the memory we found that there was a 400 megabyte object of just like metadata persisting in the server and what it was it was this cache every time a user came to our website the browser would be caching some metadata and the server was also caching the same metadata and it was actually caching this metadata for every single user that came to the website so maybe there was like a thousand or ten thousand users worth of metadata being cached and really what the bad thing about it was this cache was not being used by the server at all and that 400 megabyte object in that memory heap DOM completely useless wasn't being used so now that we knew what the root cause was we were trying to figure out why those synthetic load tests didn't work in reproducing the memory leak so the hypothesis and what we figured out was that the synthetic load tests wasn't really testing QA with enough URLs with an appropriate burst rate that's kind of faithful to production so it wasn't simulating enough users and only by replaying traffic did we generate enough like unique metadata that's really only when we generated enough like unique users in the QA environment that that's when the memory leak started to kind of go crazy and kick off like the website latency and reduce the website availability so the simple the fix was pretty simple we just disabled a server-side cache since it wasn't ever really being used so that wasn't much of a problem and we found that the memory usage is a nice flat like 14% so that was really great and we solved the problem with ShatterEater okay so now I like to just kind of go over the architecture design some of the tooling we use languages we use AWS services that this project leverages I'm going to go over the futures which are you know replaying traffic as well as being serverless and at the end I'm going to do more of a deep dive and now explain how it works in more detail so first of all we use the serverless framework so if you've never heard of it it's a cloud agnostic way for you to package your code you just write like a 30 to 40 line YAML file run one CLI command and you can upload your code up to like AWS, Google Cloud or even Azure so very simple, easy to use so that's what we do to handle all the ShatterEater deployments and as far as language used Python 3 so it's got a lot of new futures type annotations like way processes and Python 2 is being deprecated in less than a year so if you're still on that get on Python 3 and so some of the AWS services we use obviously Lambda and then S3 which is I guess the first serverless load testing platform ever I think if you're not familiar with it you can just kind of have multiple files upload at the S3 you can download it through their web console pretty easily so very easy to read and write from it you don't have to fiddle with read write capacity units you just kind of provision the bucket you upload data to it you can delete it, read from it pretty easily so that's where all the test data or the URLs are stored at and then so with the open source version particularly what ShatterEater does is it parses and replays the elastic load balancer logs so ELBs are Amazon's managed load balancing service so if you've ever tried to like have your own like cluster of Nginx nodes where all your traffic is going through it can be quite a problem if those Nginx hosts goes down you've got like a big outage so ELBs are a nice way to like reduce that operational overload so lastly we also use cloud watch events which are kind of like cloud cron jobs so cloud watch events we have it triggering the ShatterEater lambdas every minute so that's what starts the lambdas to start replaying the requests as well as parsing the production access logs so on to the futures so the big one is that it parses production access logs and replays it so what it's really doing is every minute that a request comes in and gets written to the access logs there's a parser lambda that comes in pulls out the URLs pulls 1000 headers as well as the request rates and then deposits it on to an S3 bucket where a different set of lambdas take over and does the actual load testing and the request rates as well as the URLs as seen in the production access logs some of the other futures it's got is live replay and pass replay so live replay will replay the requests as it comes into the production website and pass replay is you can replay like a past time window so maybe you want to replay traffic from January 27 2019 once at 2pm Shatteru just got that feature you can also replay certain headers like user asian or Chu clan IP and how that comes why that's useful is that say your web app responds differently if it's like a mobile user asian or if the IP address is from Europe you're going to respond with a web page in a different language so if you're replaying these headers in the QA environment you can check the load test you can see that the QA environments replying with the expected status code and the expected response body so the other big thing is that it's fully serverless and with that it's very easy to scale you provision as many AWS Lambda functions as you need you provision on demand so what you can really do is you can have one Lambda running one minute and then have a thousand running the next minute and with that you can go from like 0 to 10 requests to 10,000 requests pretty easily and at admins we've been able to scale Shattereater up to like 50,000 requests a minute but it should be able to handle like 100 requests a minute 100,000 requests a minute or more so with the serverless model another great thing is that it's cheap you can keep the costs down it's got to pay only for what you use model so you really only provision AWS Lambda compute resources as you need so at admins we were using thousands of dollars a month on load testing boxes and then we were able to replicate a lot of the futures as well as the throughput really at about only like 100 dollars a month so this is Lambda plus S3 costs so that's a pretty big win if you're looking to like reduce those AWS bills so how Shattereater really achieves the high throughput as well as low costs is that there's a high number of Lambdas running in parallel and each Lambda has a very low workload so there is a concept of a worker Lambda with Shattereater and each worker Lambda is provisioned 256 megabytes of memory and with Lambda if you haven't used it before you can provision from 128 megabytes up to 3 gigabytes of memory for each Lambda more memory you give it obviously it costs more but you also get more CPU resources as well as more network IO and since each Lambda is handling up to 100 requests each it's very easy to keep the memory footprint down and low for each Lambda and so say there's 100 requests sorry 1000 requests that need to go out in a certain minute there's going to be 10 worker Lambdas being provisioned and each worker Lambda is going to handle 100 requests and so with all this the serverless stuff, Lambda stuff what's great compared to the traditional load testing is that there's no maintenance required you don't have to be provisioning EC2 servers you don't have to maintain boot up scripts and then additionally Shattereater has got a pretty fast startup time so you click start on the load test within about 60 seconds the load test will start up you click stop load test will shut down and how it really does this is that each load test Lambda downloads the data it needs so Shattereater is replaying traffic from 1.10 to 1.11pm it's only going to download that minute worth of data and compared to like traditional load testing you would have to distribute the entire test data set to each and every single box so here is a sample like test data for Shattereater so all the data for Shattereater is represented as a series of JSONs and each so if you have like days worth of data it's partitioned into minute intervals so one minute of traffic is going to be a list of URLs from that minute so 1.10 to 1.11pm there were a thousand requests that came to your website it's going to be translated into a thousand URLs so here I'm just going to go over this JSON so it's got some key value pairs URI was like post one the original request method was a get method it's got the timestamp user asian for that request it's firefox and then the IP address was just like 1,2,3,4 just a sample IP address so basically in one hour you've got 60 JSONs and then for an entire day you would have 1,440 arrays because you know there's a 1,440 minutes in a day so particularly with the open source version it's got these kind of futures first of all it's got like a plugin system so you can choose between library play or pass replay as I mentioned previously you replay requests as they come into your website and then pass replay you're replaying a certain time window in the past the open source version has support for replaying application load balancer and elastic load balancers which are you know the AWS managed load balancing service so Shattery's going to be able to replay any kind of ALB, ELB logs and then lastly it has the ability to ramp traffic at a certain percentage value so you can tell it to replay production traffic but at 10% of that load so you can have it function more as like a traditional load testing tool so now I'm just going to go over the architecture more of the design so the whole system is composed of four different landers there's the parser, orchestrator master and worker and of these four we divide it into two systems so the first system is the parser lambda this does the parsing of the access logs deposits those logs into an S3 bucket at which point the other side of the system takes over orchestrator master worker and these are what really do the load testing so this is kind of a flow map of how a request kind of starts out here a user visits the Edmonds.com a Honda page this load balance by our nginx hosts those requests get logged and the access logs are being shipped to an S3 bucket in real time and then the parser lambda comes in it's parsing those logs every minute and deposits it onto this S3 bucket up here so you've got just an S3 bucket of those JSONs I just talked about earlier and then at which point so there's a cloud watch event triggering these parser lambdas and orchestrator lambdas every minute and the orchestrator lambda master and the worker lambdas are the ones pulling out these parser URLs and doing the load testing so let me just kind of zoom in into this diagram I'm going to start right here so the user visits the slash Honda page and at Edmonds we use nginx as our global load balancer so every single request that comes to the Edmonds page goes through our nginx hosts and those hosts are pushing the access logs in real time to an S3 bucket and that's when the parser lambda comes in parser lambdas looking at those logs every minute of the day if it finds new access logs it's going to pull out the URLs it's going to pull out the headers pulls out the request rate information and then deposits it up here to this S3 bucket just full of JSONs essentially and then so next the orchestrator lambda takes over so this is like, yes sir go ahead so the open source version only supports ELBs right now but at Edmonds we have one public ELB that does all the load balancing and it load balances the multiple nginx hosts right and then the nginx hosts are logging it to the S3 bucket that's being replayed right so the orchestrator lambda is like the brain of the system it has all the load test configuration it knows which time slice to replay like 1.10 to 1.11pm and it's going to, what it's going to do is it's going to invoke a master lambda and say you're load testing app A and app B then it's going to invoke two master lambdas one for each application and let me now go to this slide which kind of gives a more clear picture so when a master lambda for app A is invoked it's going to download URL data from that bucket of parse URLs and if there is say 500 requests or 500 URLs that need to go out it's going to invoke five worker lambdas and each worker lambda is going to handle 100 requests each and once the worker lambdas are past all those URL data it's going to do the actual sending of the requests load testing to your QA environment so here's the flow map again if you guys want a better look I'll also post the slides up sorry, go ahead so basically the parser lambdas running like every minute of the day so essentially at the end of the day you're going to have parse data for that whole day so if you want to do live replay you just replay those parsed URLs as they come in into that three bucket but if you want to do like pass replay you can just look in that bucket of URLs and say oh I just want to replay 1 to 2pm so you kind of have that option of what kind of load test you want to do yeah so that's actually a good question so if you're going to do live replay there's about like a 6 minute delay involved especially because you know the load the logs need to make it to that three bucket and the parser needs to do some parsing so live replay so let's just say the current time is like 1.10 then Shattery is going to be replaying traffic from like 1.04pm so could you repeat the first part of your question yeah so it's exactly the same QPS so yeah exactly so really it's actually a pretty simple algorithm so let's just say 2,000 requests came into your website from 1.10 to 1.11pm so that's just 2,000 URLs right so if I'm going to replay those 2,000 URLs I'm just going to invoke 20 worker lambdas and each worker lambda is past 100 URLs each so it's just like a division you know it's like each time Shattery is invoked so it's invoked each minute so the Shattery that gets invoked at like just say 1.10pm it's going to replay traffic from 1.06pm and if Shattery gets invoked the next minute at 1.11pm it's going to replay traffic from 1.07pm and then you know on and on so it's like a minute by minute basis so those lambdas have a very short like life span essentially any other questions I can answer okay go ahead yeah so it only supports get requests like at admins like 99% I've thought about adding post request support but then like you need to you know get the post payload and all that and it can add like a little bit of complexity yeah did you have a question as well yeah mm-hmm yeah so like the memory leak in production I think it happened over like a course of 2 days so when we replayed it into QA it manifested itself pretty quickly like we replayed I think like 1 hour traffic over like 3 hours and we saw that the memory leak manifested itself pretty quickly from like 10% to 40% you know this could have been because in QA we had less nodes like less containers so the memory leak manifested itself like a lot quicker than production okay any other questions I can answer okay go ahead yeah so the question was you know if there's a DDoS attack you know are you going to end up like replaying it so yeah so it's going to actually replay the DDoS attack if it comes in and appears in your production access logs you know it might be good in the future for me to like implement some kind of like circuit breaker system so that the load test doesn't go over let's just say 100 requests or sorry 100,000 requests a minute so I think you know I might put that up as a github issue but so anyways this is the github repo github.com you'll find there read me with like two three guides and there's like a live replay demo with like batteries included so there's like a cloud formation stack you can deploy I mean it's going to provision all the necessary resources in your AWS account so that you can try out live replay pretty seamlessly we're welcoming contributions so some of the things I'm working on right now is the ability to replay like AGA proxy logs NGNX logs but most of all I'm welcoming like feedbacks and suggestions so if you try it out you don't like something you like something you want some new future let me know those suggestions are greatly welcomed go ahead yeah yeah so I mean admins we don't we're building up like a system where users can log in like save cards they like but for the most part it's like unauthenticated so it's just like pure get requests you know which makes the model I guess a lot simpler so I want to try to attempt like maybe a demo here I'm just going to show how to use the serverless framework to deploy into your AWS account so let's do mirror mode so this is my terminal and I'm just going to run a serverless deploy command which is going to deploy all my shad reader lambda functions into my AWS account so what it's doing is it's just packaging your python dependencies into a zip file uploading it to an s3 bucket it's generating a cloud formation stack so that you know it deploys pretty smoothly as like infrastructure as code so and you can notice that it's pretty relatively quick thinking about like 30 seconds it'll finish so let's see what happens okay so we're done and the lambdas that were provision were you know the orchestrator parser worker and master that I talked about earlier so this is just an AWS account there's two load balancers here two ELBs and what's happening with these ELBs is that there's traffic going to one of them like real traffic going to one of them and shad reader is parsing those requests and replaying it to the other load balancer so I'm going to go here at the request count metrics and let me do like last three hours so orange is like the actual website traffic and then in blue is shad reader replaying those to the other load balancers so let me just like zoom out to like six hours you know it can follow like the ups and downs of production traffic conditions pretty faithfully and as far as the serverless framework it's pretty simple to install it's just a MPM install serverless command and to deploy your lambda functions it's another simple SLS deploy terminal command so that's it thank you and special thanks to everyone that helped me get this presentation up in order and if anybody else has any questions feel free to ask them I also have free shad reader stickers up here if you want to come up check check yes hello can you hear me testing no okay you heard now can you hear me hello hello can you hear me you want to maybe just check check one two check check check check one two check check check one two check one two check check check check check one two check one two check one two check one two check check check check check one two check check check check if I change the gains on this am I going to mess up your live stream on your end yeah they do affect it which game are you looking at it is Mike if we're going out of the inserts we just care about these okay cool check check check check check check check check check it could be I'm thinking about turning it down but I'm concerned about the live stream as well so we're coming at it right check check check one two check check check check check check one two check check check check check check check check check all right testing okay we're good should I turn it off okay okay yeah can you guys hear me okay is that too loud it's too loud I don't know how to fix this thing how's that is that better is it getting better getting better is that good okay we'll get started I think right at 430 so just a couple minutes who has been a skill before how many people like living in Los Angeles okay cool does anyone like who traveled the furthest to get here I'm just curious what's that that's enough out of you okay I mean I think I just have always wondered if people fly in for this conference or if it's just SoCal think oh wow that's pretty far Pennsylvania oh wow yeah but for this room he's the winner all right well it's 430 so let's get started is this still too loud by chance because I have a tendency to speak pretty loudly it's good okay so yeah the name of my talk is going serverless and you know my name is Joshua and a few things about me you know I like dogs I like Star Trek also Larry David is my spirit animal and I work on open fit and at open fit what we do is we help people get and stay in shape so here's a screenshot you know of our iPhone app contains you know workout videos that are instructional and then we have some like fitness tracking functionality and things like that right now we have the following program so we have this T30 program which that gentleman has won like the Tough Mudder like obstacle race something like five or six times and then we have this bar class and then a yoga class and the 600 seconds course that gentleman specializes in workouts at about 10 minutes long so yeah platform so on iOS, Android, Roku and then we have a native web player and things like that we also make and distribute supplements just FYI we just launched I don't know about eight weeks ago something like that or a little less these numbers are actually out of date we've got closer to 50,000 users and then our subscribers is closer to 30,000 and then we've got over 100,000 workouts completed in the last 30 days so we're growing and we're growing pretty rapidly an interesting thing about OpenFit is 100% of our back end is in serverless is anyone else doing this does anyone have a back end that's completely serverless? that's cool, we should chat how about running serverless or Lambda's in production how many people are doing that? cool so a few more things about OpenFit the number of Lambda invocations we've had in the past 30 days is about 2.2 million and the total cost for a Compute is less than $70 so we want to share what we've learned with you guys about serverless because we have some in-house methodologies and processes and I want to share with you how we approach problems so here's the agenda I'm just going to intro which is what I'm doing right now and then what is serverless I'm going to talk about tools for developing serverless things to know before developing serverless writing code for serverless testing debugging API design some missed topics and then we'll close so what is serverless? back in the day you may have had or dealt with on-prem servers and some people still do but when you have on-prem servers what you have to do is you have to pay for the hardware there's safety compliance issues there's concerns for how far they can be off the ground and things like that for flooding and fire so you also have to have an IT team who knows how to manage and update them and then you have to have operations to make sure that they have uptime and all that stuff and then over time there's been this advent of moving to the cloud which has many benefits right you don't have to pay for hardware anymore you have more compute power just to click and you don't need to deal with having an IT team on staff that actually services those physical servers there are some problems with this so you're charged for keeping servers up even when there aren't any requests and that's true of either a on-prem server or a virtual one you're responsible for the maintenance of the server and all of its resources you're also responsible for applying any security updates and then as usage scales you're responsible for managing and taking care of that as well over time there's obviously been the advent of managed services through either these major cloud providers or someone else and that helps alleviate some of the issues so you can focus on them and they worry about the scale and things like that and you can do that, I probably don't need to tell you guys at a conference like scale things like databases, memory storage you can have those as managed services but now we have this thing called serverless so what is it and I like to use the monorail analogy where just like how mono means one and rail means rail server means server and less means without it allows you to really run production systems without a server which that's not necessarily true right the server someplace it's just you're not in control of it I think in the clearest layman's terms serverless computing is an execution model where the cloud providers are responsible for executing a piece of code by dynamically allocating the resources and so your code is in the form of functions and you're a charge for the execution time of those functions and this is led to the advent of what is described as functions as a service where the actual cost and maintenance of running your application is just another managed service just like memory or storage what have you so some of the benefits of serverless pricing is a big one so that one of those first slides I showed you guys it's incredibly cheap to run a production system when done right scaling you get to leverage someone like AWS and they're pretty good at scaling so you got to put the brunt of those issues on them rather than dealing with them yourself and then there's no server management you don't really have to deal with provisioning servers and things like that and this allows you to manage your code and your application rather than your infrastructure and one of the key things I really want to drill into everyone who maybe need a serverless is really what you're doing at the end of the day is you're writing code and then you're shipping it just to run and there's some implications of that and I'm going to talk about what the particulars of those are so some tools for developing serverless I'm only going to touch on this there's a bunch of different tools out there I'm going to talk about also what we use and why we use them there's SAM, Zappa, Claudia, Chalice and these are tools for building and deploying so you can build and write code locally and then on your local box and you can shoot them up to whatever cloud provider you have there's not much about running them on Azure but apparently Visual Studio is like a holistic solution for that and there are a bunch of other ones but what we use is a serverless framework and this is just an NPM package that allows you to build and deploy your code and your infrastructure so one interesting thing about the serverless framework is it has this YAML syntax where you have a YAML serverless.yaml file where you describe your server function or your lambda functions or whatever cloud provider you want and it will map that function to a specific URL if you're using AWS under the hood this is actually just cloud formation so it's this YAML syntax that you write out will actually help build your stack in AWS another thing about the serverless framework is it aims to be provider agnostic so it works on AWS, it works on Google, it works on Azure, I've only had experience with AWS because that's what we use there are also plugins for the serverless framework so there's a lot of automated tasks and commonality that have been extracted into additional NPM packages that you can make use of and I'll talk about those as we go along to solve some common problems the languages supported for serverless development are contingent on the provider we use Python for the stack that I'm on we also use Node.js but I think right now off the top of my head I think AWS supports the most languages and they just had some API release where I think it's going to open the floodgates to have almost anything available on AWS but you'll have to check with a language provider for the particulars of that service some other info some other things to be aware of if you're not already things like local stack where you're able to have an AWS like environment on your local machine so you can have something like Kinesis and SQS and you can talk to those and there's aid in development as well so some things to know about serverless before developing for it here I'm going to discuss a particular some of the drawbacks what I've found over time is that drawbacks and serverless aren't really drawbacks once you know about them you really just need to know what they are so you can program around them and I'll discuss some of those so the first thing to understand is that serverless functions are typically event triggered so what that means is in the ecosystem of AWS API Gateway will allow you to map essentially a HTTP request to trigger a serverless function and there's called Lambda so that's just one example an event can basically be anything it can be an HTTP request it can be an event coming off of a queue but it can be a manual trigger so it can be in the user's console and you can just start clicking a button to trigger it it can be an event on a timer like CloudWatch, Cloud Storage Triggers so you can drop a file into S3 and then your Lambda is going to trigger after the fact you can use Dynamo Streams, you can use Kinesis you can use anything like that these are all different events that just trigger your functions you can even have multiple events trigger the same Lambda in this case you can have an SQS queue as well as an API request mapped to the same function so your stack can either have one-to-one mapping or one event triggers a Lambda or some mixture or something like that another thing to understand is that serverless functions are stateless which means you cannot guarantee the persistence of data across invocations and what that means is things like cache, variables and disk, you're able to have access to those things when you invoke a function but you cannot guarantee that they're going to exist the next time a function invokes, I'll explain why as we go on and because of this your functions ILO needs to be consistent, you need to be very deterministic with how you handle things and you solve a lot of these problems by simply externalizing them so instead of writing to something like local disk you can write to S3, that's your file system now or you can use elastic cache instead of using cache on the machine or something like that and the reason that serverless functions are stateless is because the code that is being executed actually runs inside any ephemeral container and by ephemeral what I mean is that it exists for a short time and then it might go away so you're really at the mercy of the service provider for that interaction that's abstracted away from you right so here's that info again imagine this is Lambda or a different serverless environment be it on Google or Microsoft when an event has triggered and your code begins to execute what it's going to do is look for a container to run your code if it has one it simply finds that container then executes your code in the event that it doesn't have a container it's going to go ahead and build one this process in some cases is described as thawing so it's frozen and then it thaws it becomes hot or warm and then your code is going to go ahead and execute so there's a difference in latency between these two scenarios and these scenarios can be described as your function or your Lambda being hot or warm or cold so there's a difference in about three to five seconds for the first invocation of a function to process so your client will be hanging during that time you can get around this problem by actually peeing your Lambda and this is a very common sort of methodology where you could have an event or a trigger on a CloudWatch event that hits your Lambda every five to 15 minutes to keep it warm so the next time you don't pay that startup cost of having to thaw a container right so because of this ephemeral nature you're actually still able to take advantage of the global space that's in your container so like I had said your container has actual memory and disk on it right and you're able to take advantage of that so the first time a request comes into one of our Lambda's we instantiate something like a logger so the next time you don't have to pay the execution time of that logger you're able to save things in cash what have you but you have to prepare and program around for the scenario that Lambda's gone cold and that data or that invocation may not have taken place there's no way for you to guarantee that it's going to be there so you need to be aware of it and program around it and I talked about peeing them just a little bit ago and then a cold start that process of thawing will happen for every concurrent invocation of your function so if you have two requests in parallel at the same time they're coming in to request the same function and it doesn't have any containers what it's going to do is it's going to go and thaw two versions of that container and then execute the code inside them so one thing to note about this is that your functions execute in complete isolation from one another so there's no cross process if you save or store something in memory in one container you have no access to it or no guarantee that the next container is going to actually access that container and have that memory there and this also highlights why your IO needs to be consistent because things run in isolation and this is true across providers I'm going to have a series of links at the very end that show my sources for this but on each one of their blogs and in their documentation this is true across the runtime of all these cloud providers there are processing time limitations so traditionally a service function could run for about five minutes now in AWS I think as of October it's 15 but you need to be aware of that because if your code is executing in AWS or inside a Lambda in AWS and then all of a sudden you've maxed out the processing time what's going to happen is they're simply going to pull the plug on that process so if you're talking over HTTP through API gateway to that Lambda that's an uncontrolled failure it just goes simply belly up in the logs so you need to be aware of that I find that this is kind of a good thing because it forces you ahead of time to think about precisely how long something should run and you can think about it from every sort of code path that you have but you need to be aware of that and program around it there's also a vendor lock-in this is kind of an issue in a bunch of different cases but I think the issue is really highlighted here because you're taking your code and you're shipping it to someone else to run the ecosystem and you're very invested into it it can be very expensive and costly to move somewhere else so be aware of that so writing code for serverless I'm going to talk kind of about how we do things and what we find works best for us you know serverless like I had said is just in the form of functions right so and that's just what they are, they're just functions you're not writing an entire program well I mean in some ways you are but the entry point is a function and there's no worry about starting it up cleanly or something like that here's an example of one in python where it's just basically a shell of a function def get event context the arguments are injected by the cloud provider the event might contain information that came in through an HTTP request or data that was in the queue that triggered this function and in here you can have whatever logic you want you can perform some sort of calculation and talk to a database something like that it doesn't matter here's an example because I had mentioned we use the serverless framework of how that serverless YAML actually maps what that function is under the attribute get handler to a specific path and behind the scenes when this deploys because I had said it's cloud formation it's going to map an API gateway that that path directly to that function so organization, mindset and culture one of the things I want to talk about here is developer responsibility so maybe you've worked at an organization where operations and development are two separate departments in your company where development simply writes code ships it to operations and operations tries to run it you really can't do that when you're trying to program in serverless really you do need a culture of DevOps and DevOps means a lot of different things but to me the phrase that really stood out in terms of describing this is the practice of operations and development participating together in the entire service life cycle because your developers need to know and understand that they can't expect memory that they wrote to be there across invocations they have to be aware of timeouts they have to be aware of the fact that your code will execute inside an infirmable container and things like that so I don't know what that means in your organization if that means that you have to compile training material or if you have to do sessions or what have you make tests for them or show them talks like this or bring them to conferences I don't know but they need to be aware of that because if they have a traditional mindset where they've programmed a lot on servers they can't necessarily map that directly over to serverless and have it expecting to be successful okay so monitoring I really got a touch on this really I just wanted to use this GIF so like I said earlier one of the things I wanted to drill into is you're taking your code and you're shipping it to someone else so the only transparency you have into what's going on is a transparency that you write and add yourself so logging here is just as important as any other system if not more important you're not able to SSH into a box and poke around to see what's running you don't have that luxury anymore I think it also makes a huge amount of sense to profile your applications and your programs because one of the benefits of serverless is its pricing model you're not going to be able to identify what the bottlenecks are because any time you can squeeze more performance out of your code you're saving money and it's very immediate so monitoring and profiling they're very important here alright so testing like I had just said you're taking your code and you're shipping it to someone else to run so therefore you don't want uncontrolled failures or unexpected things to happen when you're running in production and the only way I know how to get around those things is to have a good culture and practice of testing you know precisely what's going to happen and how your code is going to behave so unit testing here is just as important as it's ever been if not more right the next thing is integration testing integration tests are beyond the unit level they're holistic tests and in this case they'll just test the functions like I had showed basically the entry point to where your lambda is going to trigger these are simply to do locally you just invoke the function it's not that hard and then you can change the parameters and make sure the IO is consistent but really you do want to make sure that your integration tests run both locally and against a production setting I do not recommend that you use postman for something like this you want to automate as much as you can so we've come up with a solution to do this it shouldn't be uncommon to anyone who's spent a lot of time doing testing before but I'm going to show you what we did so here's an example of an integration test for us where we have an event and then we have this line of code that simply evokes the function associated to that event and then we have assertions at the bottom this line of code actually represents a package that we wrote what we do behind the scenes is we parse that serverless YAML file so we know deterministically exactly what the URL is going to be and we know what function it's supposed to point at and so at the flip of an environment variable what we'll do is we'll either test a production-like setting or it will run on your local box and this really helps in terms of figuring out any intricacy that's happening in production versus your local box one of the things that we've found is as I'm sure a lot of you are well AWS is really good at building stacks that make up your application and it does it quite quickly so what we've done is we've allowed all of our developers to have their own ephemeral stacks that come and go as we please where for each application each developer has their own stack so if I'm working on a feature or something like that that's production-like instead of deploying to a staging server I simply deploy it to my own stack I'm able to run my own test against it and then when I'm done and ready to merge by the time we merge we know it's production ready so that piece of code like I had said at the flip of an environment variable is either going to run in AWS or on my local machine and this gives us fantastic test coverage that works both vocally and in an app-like environment and this leads me into talking about debugging at this current time there's no debugger inside your serverless console so your cloud provider at least not yet doesn't allow you to run a lambda and then set a breakpoint and start going down to debugging again, this speaks to the importance of testing but since we have this ability to test both locally and in your own environment it's allowed us to iterate very quickly when we have some sort of production issue and I'll show you our process right now so let's just say we have this end-to-end test that makes a couple of different invocations of different functions where at the very beginning it simply posts a value back, it puts to update it and then it gets it again to check that it's been that the put was successful so if you begin debugging it someone finds that they have an issue we assign a code let's just say it's throwing a 502 or something like that so something's going wrong in production what we know is that we're able to deploy a stack that merges production essentially 100% and we also have the same version of that code running on your local box so because of that our process is actually pretty straightforward all we have to do is switch environment variables back to run that same issue locally to recreate it and we can continue debugging we can fix the error and then write a new test to ensure that it's been fixed deploy it to a new and ephemeral environment switch the environment variable back to match that ephemeral environment run the tests against that environment and then just create a PR when the problem's fixed and we're only able to iterate this fast because we have everything I talked about in place we have a good culture of testing we have that testing package I showed you we have ephemeral stacks and really I think at the end of the day this is just 12 factor app methodology you have your prod like in development environments as close to each other as possible and that leads us into our pipeline we do the same thing any time we have a PR our stacks are ephemeral, they're cheap and they're quick to spin up, they take about 3 or 4 minutes so any time the PR is put out there what happens is we use Travis and it goes ahead and it builds all the code it runs all the tests locally if it passes it will then go and deploy that code to its own ephemeral stack switch environment variables and then run against that production setting so any time you put a PR out we know that it's going to work in a production setting but that only really works if you have and you have really good test coverage across your application so API design who here has used a bad API who here has written one right I just want to see how honest you guys are you know bad APIs suck so follow a pattern like REST one of the things that we had learned is that the size and shape of the API matters and I'll get into why and I'll show some illustrations to highlight the problem right so let's just say you have a RESTful model of application and it's actually quite nice it's very easy, it's easily laid out where each one of these things is a resource and each one has CRUD actions in the form of endpoints but let's just say you wanted to go ahead and split that up into microservices so you want to take these two different chunks and create their own services out of them so you go ahead and do that and see if it's doing this so now you can scale independently microservice 1 is getting more traction than microservice 2 you can just simply deploy more versions of microservice 1 but the problem is this creates more of a mental overhead especially in terms of monitoring each one requires their own logs and resources what have you also their own code bases when you program and serverless what you need to understand is that each one of your functions each one has their own they run in their own environment that's separated from all the others they have their own monitoring they have their own logs so really what I've noticed is we're just getting more and more fine grain so going from monolith to microservices now to serverless it's getting to the point where especially in a case like this you're actually scaling across verbs scales independently if you're getting more traction on create the cloud provider simply provisions more versions of that container to serve your requests if you don't file a rest something that is as clean as this will end up looking like this very quickly so do not do one-shot tasks obviously I find things like rest to be guidelines rather than strict rules so use this at your best discretion but you don't want to do something like this one of the things that we do on OpenFit is we had a lot of different pieces of functionality that just required basic crud on them so this is what it would look like if we followed a traditional monolith where each one of these different resources talked to them on database table we actually don't use a SQL database we use DynamoDB and for those of you who aren't familiar with what that is so no SQL database you can think of it maybe like an object store so what we did was we tried to consolidate functionality so what we actually did was we ended up with this where we only have four languages that handle all the basic crud on our application right and it works quite well for us there's only four separate functions to monitor essentially and it creates a very clean mental model to work with and to ship and to think about we don't do this for things like users we only do this for objects that require basic crud actions on them because if we didn't do this if we had independent resources for each and every single if we had if we followed crud actions and had crud services for each one of our resources like these we would end up with something that probably would look like this and if this looks confusing it's because I intentionally made it confusing right I don't want to you don't want to manage something like 16 lambdas for just four resources so what we've learned is to consolidate as much as possible so really in short the size and shape of your API matters and consolidate the functionality where you can alright so just some miscellaneous topics right some thoughts on how to introduce serverless into an existing infrastructure I would say that if you already have something like an sqsq you can have a lambda trigger off of that just as an event so if you need some post processing off of something that you drop in I don't know s3 or q you can have a lambda trigger right off of that I don't know how to introduce it you know alongside the current code that you're running another case would just be if you need glue in between two services so service A talks to service B but service B expects a different contract than A is sending you can have a lambda fire in between and just kind of broker the request and do the data transformation in between also another thing to talk about is business justification I would show people that slide where we're running our stack for only $70 so when to use serverless serverless is a really good fit for us at openfit because we're an exercise company and our traffic change is a great deal throughout the day so people have a tendency to work out in the morning and that's when we get a big spike and then it kind of goes away during the day and then at night it comes back so we really benefit from that because we scale on demand if nobody's using it we're not being charged so if you have an application or a problem that is very consistent load service might not be the solution for you and you can do a back of the envelope calculation for that the learning curve I had a background in just working with traditional servers where we wrote code, they ran out of a box and that was kind of it I wasn't really dealing with matter services or cloud providers for a long time so what really didn't occur to me was the fact that things like API gateway and that maps to your lambda is that run in their ecosystem so if I was getting a 4 or 3 bad request on the client I would immediately go to the lambda to see what happened but really those logs were associated to API gateway and that sort of decoupling in their ecosystem was something I wasn't used to so just be aware of stuff like that so yeah, in closing I think that it's very early days for serverless I think that as time goes on there's going to be a big shift actually to run code this way because it's very cheap it's very cost effective and you can iterate quite quickly and actually now this is my go-to solution I think about problems in terms of serverless rather than thinking of them running on a traditional box and here's some sources and that's it we've got time for questions I'll also be around, yes so we actually do serve our current production system so the question was do we serve a traditional web page would you mind repeating it hmm yeah, so how do web resources get information sent to them would we use engine action, would we stick to lambda we actually currently use the stuff that you had seen through API gateway so they're no different we have no partiality between our clients so we have a web player, we have a web app that you can go and use it's all in React Native in the same way as any other traffic oh I see, yeah all assets are actually stored in S3 so the question was the initial hit, where are the assets stored we have our assets stored in S3 and that's where they're pulled and then they make the API request and it kind of hydrates it yes sir, good question so the question was how do you keep your traffic warm or your containers warm when you have a huge number of traffic because everything runs in isolation there's patterns around that so if you want to use serverless there's plugins for this automatically I believe you can turn the knobs and make sure that they're available another way around it is to actually solve the problem through you can have an event in something like CloudWatch that's time so every 5 to 15 minutes it pings a lambda and what that lambda will do is ping other lambdas so you essentially multiplex anything else? well now it's 15 that's a good question I haven't dealt with that a whole lot this is a serious limitation I think of lambda is that you have that processing time so if you're doing something that's highly throughput in memory it might not make sense so if you're doing something that lasts at 15 minutes you might have to ask yourself the question are you benefiting from using serverless because it's cranking non-stop so I see yes we have a whole pipeline that actually gets data in those warehouses that's controlled through lambda as well but we actually don't do that batch reporting in lambda in our testing environment you mean? I'd have to look at the command we have a container with all our dependencies that runs in Travis and then essentially it'll switch and it'll hit the cloud provider rather than running the test on the local box yes sir so the question was if you started off with serverless and then you want to move back to a traditional architecture on a server what would that be like basically what would the migration path be like I don't see that being that big of an issue because you have functions that essentially map to HTTP requests already and if they're already unit tested if you're using python I don't see that big of an issue in terms of taking them and then moving that directly to flask or something like that I don't see it being that big of a deal going backwards I think going forward starting with a VPS or a server and then going to serverless anything else how do you do local testing if you have a large database how large is large I don't know I don't know we do we do blue-green right so we'll deploy a new version and then we're essentially able to hit the switch and then it just rolls over to route traffic to the new set of languages yes yeah so the issues with the contract breaking we've been very strict to them so we haven't run into that but yeah we essentially deploy a new version and then we flip the switch over to if it's green we switch to blue if it's blue we switch to green simple as that so it depends on what you're talking about things like SQL injection if you're talking to a SQL database it's something you're still going to have to worry about but if you're worried about some sort of Linux package vulnerability that's what the cloud provider essentially provides that's basically what you pay for yeah is there any level of encryption that happens between yeah I believe that it's supposed to be a secure transaction between the two of them so the question is how do we get around the problem of does that have to run into private VPC versus the public internet we haven't had that problem so we haven't dealt with it so I know that there's stuff out there and it's a fairly googleable problem but that's all I know about it is that it? Jeans or VP of product by the way he's talking yes everything is off the ball and we use Cognito to get around that so again we're just leveraging AWS in that so they're all hooked up to Cognito and then you have a session and a user token and that's how they're secure I don't know yeah it's just completely different that's a whole talk in and of itself okay so I'll be around if you guys have any more questions testing is it doing anything okay well well I mean I had to turn it off testing one two three test test test test test John it's actually okay don't mess with it don't mess with his board because I mean obviously to the two of you I could just talk but it's the recording that I wanted to at least play lip service to well at least I you know put up the slide so that people know and I opened the door the door was closed it's like but yeah John and I were in the earlier talk and it was um the guy ended at at least 30 minutes early so it was like a full hour between the two talks yeah the serverless talk oh were you in here too see you guys are going to keep me honest I actually have to give the presentation if I sit here by myself I'd just be like eh haha yeah welcome welcome haha if you thought you were going to fit in with the crowd not this talk exactly now now exactly I have to give my talk these two guys have actually met before so oh that's true that's true ah you're right so exactly we can skip right over who I am slide then haha okay it's officially six o'clock and I'm between you guys and dinner most likely so haha I have actually posted these slides online um and they're linked from the talk so who am I you guys actually I've personally met all three of you I've been working on Hadoop since 2006 I was working at Yahoo in the web search team when they were like hey you want to go work on this open source thing that we're going to develop I was like sure um and because I wanted to do open source for a long time and that team became that I was the first member of became the Hadoop team at Yahoo and um which is the one that really took um Hadoop from being a project that ran on ten nodes to being able to run on four thousand days um as the first committer added I was the tech lead for MapReduce, Security, Hive, Orc I've done a lot of different things I've also um worked on a lot of the different file formats over time and then we spun Hortonworks out of Yahoo in 2011 that's been another crazy ride taking a company from 25 people up to 1400 and now merging in with a 1600 person company has gone it's a wonderful success story I mean we actually took us three and a half years to go from getting created to IPO Cloudera actually IPOed a couple years after us but then yeah now we've merged together the two Hortonworks still we're not quite profitable but we'd gone really really close to cash flow neutral um so we weren't quite making money yet and Cloudera is losing more money than we were although I'm not the financial guy I'm the tech guy so um I shouldn't even be talking about the numbers um because I'm sure our financial guys will be like oh my god okay so welcome welcome um so what's the problem that we're trying to solve is that a lot of data is very very sensitive you've got personal identifiable information that people are very concerned with credit card there's a lot of um restrictions on how you have to store it and how you have to process it medical information is even crazier and these days companies run on data um companies of all sizes and varieties everything from the chains like to Walmart to Gap everyone has lots and lots of data and they're making a lot of value out of that data so they need to control it um and of course companies weren't controlling it well enough and Europe decided to force their hand a little with this thing called GDPR and if um every company I talked to has got a lot of effort going in over the last couple years to deal with GDPR but GDPR basically means that each customer has the ability to get their data to download their data um to correct their data and to delete their data and so it's a huge deal um now all of this doesn't actually go very well with the Hadoop ecosystem because with Hadoop traditionally you wrote it once and then it stayed there and furthermore if you had access to any of the data in a file you had access to all the data in the file and that's a problem because if you have social security numbers you want some people to be able to get it but the rest of the people you would rather them not see it and so you want to be able to control who exactly can see what furthermore you want to be able to do this stuff you want to be able to see who looked at which parts of the data and in particular if there's a break in you want to be able to see what the people scanned for example when there was a break in at LinkedIn one of the huge advantages was that they were able to look at the audit logs and see what the people had looked at and fortunately for the Hadoop system they didn't look at Hadoop because they didn't know what they could have looked at but the audit log lets you verify that kind of thing same thing with when you have an employee that makes a mistake or does something they shouldn't do you can look to see what they actually looked at and that's important and finally there's a lot of places where you need to control exactly what's getting the disk now this one I've got a lot less sympathy for if your company is doing their data retention right the only way that hard drive should be leaving your data center is as fine metal dust right you don't want this leaving your data center with anything on them and certainly even better is ground up into little dust so encrypting on disk isn't nearly as important as most people would think it is but that actually is also part of the problem because there are some industries where you actually do have to encrypt it on disk the biggest problem actually with encrypting on disk is exactly the fact that if the whole disk is encrypted that means that a lot of people have to have access to the keys and so if you've encrypted the whole system disk then the system clearly every time it boots up has to have access to these keys and those keys are what's protecting your data now obviously HDFS and the blob stores in the cloud world give you granularity at either file or bucket level but that isn't enough because you've got a lot of cases where you want to control more fine grain than that and there's no way you can teach the file systems about columns at one point someone was like oh yeah we can just tell HDFS about oh we'll set permissions on these ranges and those ranges of bytes and I was like there's no way you want an interface like that the users will never understand it well enough to make it work and you'll just end up with lots and lots of security holes by the way please feel free to ask questions there's only a handful of you so make this as interactive as we can all right the point is to give you guys information so if I'm saying anything that's confusing please ask as we're going okay so what are the requirements first that the reader should be able to and the writer should be able to do this transparently we don't want the readers and writers to have to make any changes to their program if the user has access to the data then they can read the data or sorry if they have access to the key then they can read the data otherwise they should get nothing and it has to be decrypted locally of course one of the things with Hadoop is that it's almost a perfect denial of service tool one of the sysadmins of yahoo used to say people do stupid stuff all the time Hadoop lets people do stupid stuff at scale and you really I've seen Hadoop take down pretty much any other service right it's taken out NFS filers it's taken out web servers we once had half a cluster go down because a QA person had referenced their home directory from a job and the cluster at that point was set to auto mount the home directories if they were referenced the result of that was of course we took off the auto mounts but fundamentally you have to be really really careful with how you what you do with Hadoop now that corresponds to when you first talk to the security guys they're like oh we can make it so that every record gets decrypted by the key management server no that just won't work at Hadoop scale you can't be sending all of your data to some server to decrypt it for you you need to factor differently so I'll talk about what exactly we're doing instead that said you don't want to pass the master keys out to all the jobs because the more people you have access to those keys the more chance of those keys leaking and finally you also need support for key rolling key rolling is so that you use a key for a given amount of time and then you roll to a new version you need to keep the old version so that you can decrypt the old data but you want to keep rolling through and creating new keys as you go okay so what are the solutions that people are currently using the first is HDFS encryption this was added a few years ago it's transparent but it does defines these HDFS encryption zones and basically their recursive HDFS directory trees where everything is under that subtree is encrypted with the master key that does in fact mean it's encrypted on disk and it does give you some capabilities for not letting your HDFS admin read the data so your backups can be secure and they can do backups even without doing the without the key the client talks to the name node gets the encrypted key for that file and then asks the key manager to decrypt it and then the client reads the data directly as part of this we defined the key provider api that we'll talk about in a little bit later so what's the big problem the biggest one is that it's a very course protection you can only protect whole directory sub trees so if you're laying your tables out or your data out with the normal patterns that means you could protect partitions but you can't protect even particular files right so you're going to be protecting everything or nothing and at that same point that means lots and lots of people have access to the keys because anyone who needs access to read those files at all has access to the keys and can decrypt this data and let's see the other one is when you're writing data with Hadoop or Hive actually Hive is the worst because every time you write data with a Hive query it moves it about three times now we fixed Hive in the latest versions so that it pays attention to encryption zones but if you aren't careful you end up moving between encryption zones which means it has to rewrite the data so that's really painful another thing that people do is push everything into Hive server too this means you can only process it with Hive and you submit all your queries into the system it supports LAP and Ranger Ranger gives you access to control rows and columns and lets you dynamically mask the data dynamically masking is kind of a cool thing you can actually have Ranger say okay this user can read the data between 9 to 5 but if they read it after hours they can't read the sensitive columns or the sensitive rows you can also set up things that say Owen can read the data for the US but Olaf can only read the data in Europe and those kind of rules automatically get applied by Ranger if you want to look at a picture the user submits to the client they run it on the Hive server all the security checks are run on Hive server 2 talking to Ranger I guess I should put Ranger into this picture but then they talk to the metastore in the HDFS behind the scene so all the data is owned by Hive you can't access it directly which actually is exactly one of the biggest problems with this approach is that you've all of a sudden taken what is one of Hadoop's strengths and converted it into you've cut it off right one of Hadoop's strengths has always been you can process this data however you want right so the data is open you can access it through spark you can access it through Hive you can access it through pig you can write Java if you want anyway you want to read the data is that you have permission to read works this forces everyone into Hive and so they have to write sequel to access the data traditionally that meant that you also were limited to getting the data back through one machine at a time but actually there's been a new work that's put in that lets spark connect to an LAP connector so if you have LAP and spark you can or actually if you have LAP then you can read distributed data so you can actually read large amounts of data out of Hive Server 2 through spark or other tools another solution I've seen a large customer use was to just pull the PII data out completely and so they pulled it out into completely separate tables and they could access the data they could control who access had permission to access the tables and so someone who didn't have permission to access the PII data could only read the public table and not the private table that works it creates a huge huge operational overhead because now you need to keep the two sets of data in sync and you need to make sure the permission stay right you can use HGFS encryption paired with this to make it so that the data is encrypted but it also slows down your reads so if someone's reading the private data and the public data now they've got to do a large join which is never fun if you've got a lot of data finally Hive has some encryption UDFs they're AS encrypt and decrypt this is really not recommended it does in fact let you encrypt it encrypts each value as you go but the key management is problematic you end up putting your key into the query so that people see it the encryption isn't seated so that the same input will always produce the same output and you can see the size of the value is the size of the original value so you can easily tell the difference between a null value which is going to be really short and a full string because you can see the length of the string and actually if you saw even the length of like email address actually does give you information and you can start breaking people into categories like the length of the email address so you need to be really careful with that stuff okay so what did we actually do fortunately I've been working on ORC for a long time and it writes data in columns and it does that precisely because it lets you highly optimize the compression because each column has similar values and so it compresses very very well so for example if you're compressing zip codes it's only zip codes that you're compressing even Zlib does really really well but and much better than if you're compressing across oh zip code address name and all those are munched together the other thing that it lets you do is it lets you read just the bytes that you need so often people have 100 columns some tables even have thousands of columns and if you only need to read two of them you don't want to read the entire file you want to read just the bytes for the columns that you need for that particular query and that makes things much much better and much faster fortunately in this model encryption works really well because because the columns are already separated you can just encrypt the bytes that you want and so the file format looks like that where you can see it's broken up into stripes and then each stripe is broken into columns and so we can encrypt each column individually oh you want to see the hahaha okay so what does this actually look like fortunately it's really really easy for the user you just need to set some properties on your table and say okay I want to encrypt using the PII key the social security and the email or say I want to use the credit card key for the card info and then the writer will just take it from there and the reader will automatically decrypt it on the other side you also need to define where to get the encryption key fortunately we plugged into the Hadoop API and so you can use the key provider from Hadoop or the Ranger KMS to get your keys from now we did a little bit more than that the Hadoop key provider API wasn't really set up with the cloud providers in mind and so we actually extended the Hadoop API a bit to make it more compatible with the cloud KMSs because for people who are running in Amazon we want to be able to make a plugin that talks to the Amazon KMS and not force you to run a Ranger KMS alright you can use the Hadoop or Ranger KMS the advantage of course is that the master keys stay on the KMS you create a master key for each kind of use so you'll create a PII key a PCI key a HIPAA key whatever you need to create and then you'll decide which users have access to each of those keys so you can use Ranger to say to the PII key but the PCI key he can only get between 9 and 5 each column actually creates a random local key local keys are just random bytes that are unique to that file but then the key manager will encrypt those and will save those local keys but it's important that the user never gets access to those master keys so what happens here is we don't want to give the master keys it encrypts or decrypts the local keys we provide for key versioning and it allows the third party plugins so this is what it actually looks like when you're flowing so your user submits the task onto the cluster they have the org file that they're trying to read the SSN is protected but the key is encrypted also so they first the task will read the key up on top then send the encrypted key down to the key management server get the decrypted key and then it can read the encrypted SSN and read the data does that make sense it's a little complicated because you're encrypting keys that are then used to encrypt other keys but I thought the picture at least helped make it a little clearer what's going on so the encrypted key of course is small it's 16 or 32 bytes and the SSN is encrypted SSN is the same size as the SSN data would be without the encryption but this is how you get around sending every row to the KMS but still have the master key okay so with those local keys now those local keys mean that when you're reading it you actually need to talk to the KMS to decrypt those and so we can track who's read exactly which file this also does limit the vulnerabilities in terms of what happens if someone remembers the local keys because of course one of the challenges here is that someone when they get access to a file could remember those local keys and then they could decrypt save them and keep them for later but those keys are only for that column for that file so yes they could keep those keys but the key doesn't tell them very much it just lets them read that file for that column of course just copy the data out anyway but so saving the local key would be a little bit smaller but we decided that that was good enough for the trade off that you get for getting to control who gets to see the stuff now one of the pieces that is very important with this kind of encryption is that you need an initialization vector for the encryption and that has to be unique so we actually generate it so that we can guarantee it's unique for each key now what happens if you don't have a key our first pass was like we were just like oh the user should just get nulls now that's better than what would happen if you didn't do anything which would you just get exceptions but the security guys we talked to were like hey that's not as useful as we'd like sure sometimes we'd like null but we'd actually like to have control over what people see if they don't have access to the keys so the default is still nullify where all values become null but we also wanted to copy the math the ranger can do so we put in a redact where you replace letters with x and numbers with 9s and it's actually a pretty flexible format you can't also define mass ranges of characters to leave unmasked so for example you could say oh mask the social security number with but leave the last four digits unmasked or leave the last four digits of the credit card unmasked right I'm sure you've all seen that when you get credit card receipts they mask everything but the last four and this is providing that same kind of capability another option is to do a replace the value with a Shaw 256 that's just a hash or let you do a custom defined one now one of the challenges with masking is please please be careful with this stuff anonymization is very very hard and there have been an impressive number of screw ups already in this space the first one was that I became aware of was AOL search logs AOL had this contact made a bunch of data available where they released their search logs and they were very careful and scrubbed all the IP addresses out but they wanted to let people see when users were searching for different things so they left a user ID in there but it was not reversible unfortunately it turns out that given a search log with enough people in it people will do vanity searches I don't know about you guys but I've at least occasionally searched for my name to see if there's anything out there I should be aware of right and I'm willing to bet that most of you guys have too and so that becomes a way of de-anonymizing it right if you get a search log and anyone's done a vanity search how many people are going to search for someone else random that isn't famous not very many and so you can take a good guess that that's likely a person and so if they then turn around and search for some illness you can say okay that person probably has that illness right so just being able to join on this stuff let you do a lot of de-anonymization which was bad the Netflix prize database was another interesting case they released a bunch of data wanted to have people build machine-learned models about doing recommendations right seems harmless they scrubbed all the user IDs but they gave you data about okay this user gave these reviews and and try to figure out which movies they'll like and try to figure out that there were enough public databases where people also did movie reviews that they could match the data up between IMDB and the Netflix data and now you could start de-anonymizing not everyone but some other people and so you could go through and say okay it's probably that this person is that person by matching them up I let you de-anonymize people very efficiently actually the other one is a great data set New York City every time someone takes a taxi they record the longitude and latitude where they got picked up and dropped off and the time of and what the fares were it's a great data set for people who like open data I use it for benchmarks all the time because it I hate synthetic data because it's got a bunch of properties you don't really like but real data is amazing and because it's public everyone can access it well the in general reporting the launch and latitude within New York City you don't get any user identifiable information out of that right if you get picked up at the corner of 5th and Broadway that could be anybody but it turns out that some of the drop-offs or pickups were out in the suburbs and even more importantly out in the places where the houses get really far spread apart those people you could pick out and you could I say this person was almost certainly going to that house because there aren't any houses nearby although as always you get noisy data part of the amusing thing is if you graph that data you see people getting picked up from the lakes in Central Park probably not so you have to be a little careful trusting the data too much but you can definitely de-anonymize some pieces the other pieces that they used hash in the early data set on the taxi medallion so you're like okay you don't want to leak the data about which taxi this was but you want people to be able to join about which taxi it was so we'll do a hash unfortunately the taxi medallions are signed sequentially and so all it takes is about a millisecond to run through the 100,000 or whatever taxi medallions compute the MD5s and now you've completely de-anonymized that data set and so please, please be careful of this stuff notify as your friend and you definitely want to analyze the security trade-offs now one of the use cases that's interesting comes up especially for the search engines they need to keep the data around for 90 days and then make sure it's de-anonymized now one of the ways they currently use the separate cluster approach but with column encryption they could actually do this more elegantly so with column encryption you could roll the key every day, write the data in today's key and then after 90 days delete the old version of the key so that no one would be able to read the old data after the 90 days now obviously you can go back and rewrite the data if you want but this would let you do way less work and not have the duplicate data and the multiple sets of operational overheads that would imply does that make sense okay so how did we actually do this we because it was already factored out to write the separate columns separately from each other we actually end up writing two variants of the streams one for the masked unencrypted data and one for the masked unmasked encrypted data of course we have to encrypt both the data and the statistics because orc is part of its file format encodes a lot of additional metadata about the streams for example each column we record the minimum and maximum value the sum and those are really useful they let you actually one of my one of the clients I work with was doing some benchmarking and they did a count star to get rough benchmark of how fast they could read things well with orc it just read the footer and came back with the answer back but not that fast and so the metadata lets you answer a lot of those kind of questions it also lets you do some predicate pushdown so that for example if the table is sorted on time then you can say okay need to start reading the file here and only read to there and read the set of rows in between there was a benchmark at yahoo where they compared spark versus hive and they were using a table that was set up that way where the key was sorted the predicate pushdown was able to make it so that even by launching tasks and running in parallel was running faster than spark because spark still was looking through the 100 terabytes it was all in memory and it wasn't reading from hdfs but hive because it had the predicate pushdown was able to just read the set of 1000 rows it needed and not anything else so anyway we have to encrypt both the data and the statistics and one of the other pieces we wanted to do is make sure that we maintain compatibility for the old readers so we're going to write the unencrypted data where the current old readers would look for it and we'll write new metadata that tells it where to get the encrypted data so if you have an older reader it'll get the unencrypted or get the masked values if you have a new reader it'll depend on whether you have the key or not and we needed to preserve the ability to seek in the file precisely because of the predicate pushdown now streams go through a pipeline there's run length encoding compression and delib and then encryption so really we're just adding one more step on to the pipeline we added the encryption okay why did we do the encryption last exactly exactly right if you did the encryption first then it's no longer compressible you'll end up with a huge huge file and so you need to make sure that you do the encryption last and the encryption is AES CTR it'd be easy to extend it to other encryptions although the CTR has a lot of nice properties it allows seek and doesn't add any padding to the data okay so what have we got the column encryption provides transparent encryption it lets you get multi-paradigm column security it gives you fine-grained auto-logging you get to find out who read which columns out of which files at least for the ones you care about gave you static masking one of the features of orc is that you can merge files so you can if hive writes a bunch of little files and then it wants to merge them together there's a function that the orc writer supports where you can do that quickly and we made sure that this doesn't encryption doesn't mess that up now there are lots of other pieces in the Hadoop ecosystem we've talked a little bit about Apache Ranger gives you a single pane of interface that lets you control the security of your whole system it gives you attribute-based access control so you can say okay for this data Owen can only read it during the day when he's in the U.S. not when he's in Europe or vice versa and so it will end up managing the access to the encryption policies and controlling access to the decryption keys Apache Atlas is one of the relatively new ones it provides metadata driven governance for enterprises say that 10 times fast but basically Atlas is a data catalog that lets you find out what data is where and so where Atlas is going to come in is Atlas already provides tags so you can say there's a PII tag on this column and this table and if someone copies the data and makes a drive table based on it if they use the PII column then the new table and column will end up with that PII tag and so we're going to make it so that Atlas tags end up controlling the encryption policy through ranger fortunately hyven spark don't need to change that's one of the nice pieces the one downside is that LAP, okay do you guys know what LAP is? okay LAP first its name is live long and process which of course is a joke on Star Trek's live long and prosper and our marketing team hated that when they with Apache projects its always the engineers that end up naming things and the marketing team kind of has to put up with whatever the engineers came up with and thought was a good idea well this was another one of those cases where they were like oh what have you done to it but it actually works really well LAP is a cache and a set of standing servers that execute your hive queries very very quickly and they cache the hot data both the hot columns and the hot rows and so for example in most companies 90% of your data is the stuff for the last month but sometimes your queries go all the way back LAP will figure out hey I need to keep this stuff in cache and so your queries run very very fast the rest of the stuff is cold so we can read out of HDFS when we need it the other piece that it really helps with is Java is really slow to start up the first second that your JVM starts up it runs really slowly because its reading the class files its doing all the just in time compilation that takes a lot of time and so your Java process runs really slow for that for a second well when we started working on LAP it was precisely because we were trying to get the queries from hives down into sub second range and one of the piece when we knew we were successful is when we were able to run queries on tables with a billion rows actually 6 billion rows in under a second using hives so that's really done amazingly well for pushing the interactiveness of hives down unfortunately LAP reaches under the hood of orc and so it's going to have some more changes in particular it's going to have to cache both the encrypted and unencrypted variants and remember the difference and then also update the audit logs so that we can tell who's accessed what and not everything is perfect we need the encryption policies for write currently atlas and ranger tags lag the data so you put new data then the tags show up that's not going to work very well because you're going to have to go back and rewrite the data there is also auto discovery with atlas that also is going to need to be able to be run before you start writing the data into your table ranger currently has the dynamic masking policies with the static policies or static masking if you decide to change what the masking policy is you're going to need to rewrite the data and we're going to start with a relatively small set of masks and but allow people to extend it with more and finally you can save the decrypted local keys as we talked about okay so that's my presentation any questions yes okay so the question is how does user authorization work in this model so if I go back to my slide let's see should I manipulate things down here this guy okay so from this one it's really when he's talking to the key management server that the authorization has to happen so the KMS is the ranger KMS is looking at the ranger policies and deciding but if you use the Hadoop KMS then you have to do the authorization check there so so that's basically where the authorization has to happen you control it through ranger and decide oh and has access to the PII keys or not so okay so the question is in the public cloud in particular Amazon so I'm repeating for the video if they're using a public cloud how is that going to work and the answer actually is Hortonworks and well now cloud era we're working on a project called ID broker which translates which gets the keys for the Amazon services based on the end users ID and so you're right just like the job needs S3 tokens to deal with it to deal with the permissions that the user should have the same thing would hold true of the access to the KMS so you need to get half tokens for the KMS that reflect the end users capabilities rather than a generic user so yes Hive is already integrating with ID broker and so they'll just extend into that exactly so there's still work to be done it's not done yet but that's the direction it'll go absolutely and Hortonworks and cloud are very very focused on making the cloud a first-class delivery so absolutely this has got to get done yes so absolutely basically the question is how do you deal with key revocation fortunately because we don't hand out the master key we can delete the master key and then anyone will be able to decrypt the local keys anymore so basically the key management server will lose the ability to decrypt that version of the key so I mean before you do that unless you deliberately are trying to delete the data yes you would want to rewrite the data using a new version of the key now some of the pieces will make it easier right if you've got a Hive acid turned on people will automatically use the new version of the key right because the Hive compactor I guess I should back out Hive acid ends up writing delta files and then at some point rewriting the base files with the the deltas rolled in and so if you're running the Hive compactor then you'll automatically get the keys rolled to the new version instead of the old version but without that then yes you'd need to rewrite the data with the new version and then before you delete the key on the KMS but at least deleting the key out of the KMS will in fact kill all the data that was written or encrypted with that version of the key does that make sense yes double encrypt things um I probably wouldn't but again I'd probably just turn on the column encryption on the ones I cared about unless I was in a world where if I had regulatory reasons that I needed to encrypt absolutely everything um so so you're absolutely right that you could absolutely combine the two end up double encrypting the two pieces um it would still work yes the two pieces of transparent encryption would work together but yes you would end up double you would take it and performance it right yes okay any other the target is to get it into the next month or so and then it'll percolate through the other projects as it goes through basically yeah it's LAP that I'm really not looking forward to the integration like damn it why did those guys go under the hood and of course they did it for performance and so we'll deal with it but that's the goal it is going to well it means work upgrade and then high and spark upgrades um alright well thank you very much thank you for coming