 All right, time to get started. I'm Tom. I'm the community engineer for Cloud Zero. And I'm going to be talking about iterative security, secrets management, when you're not yet ready for vault. Two things I really want to get across in this presentation today, and they're actually not about secrets management. They're, how do I decompose security problems and solve them systematically? And how do I solve system problems given the constraints of time complexity and risk? Not letting perfect be the enemy of the good. So I created this presentation originally based on a blog post I had written. And what I found when discussing online was people are actually very happy about seeing a simpler alternative than much harder systems like Vault or, say, Conjure and Conjure. Both very good, but they weren't at that level of sophistication yet. They weren't ready for that. And they were happy to be able to just move a little bit forward. And so what I ended up realizing was the value of that blog post and this presentation wasn't actually secrets management. It was, again, my approach to solving this sort of problem that so many of us face. So who is this guy up here? You've already heard the dramatic reading of me. Again, I'm Tom Glachlan, community engineer at Cloud Zero. That's one of my cats, Nutmeg. We were at the speaker dinner yesterday. I got a text from my girlfriend letting me know that, for some reason, the cat saw the oven open and wanted to go exploring and nearly jumped in it. That is also the one of our two that we considered to be the smarter cat. So Cloud Zero, we're an early stage startup focusing on cloud reliability optimization. And so our work centers around creating systems that fail less and then ultimately recovering faster when they do fail. We're also using AWS serverless architecture to build our product. So if you're interested in either of those two topics, come talk to me afterwards. I'm a former full-time ops engineer. I spent most of my time working on automation. I have a lot of pointed views about Puppet and how to do automation correctly. Today, community engineer, my job is to engage with the audience and ultimately learn more from all you folks. I've started adding this few slides into my presentations because I feel that if you understand how I view problems and what's important to me, you can understand why I've come to particular conclusions that I've made. I don't believe most technical arguments are actually objective or even technical at all. They're really disagreements over the problem to be solved and the sort of subjective weighing of pros and cons to the different solutions proposed. So yep, there we go. I like startups. I think they're frustrating and a total clown show. But I also think they're really freaking fun. They operate very differently than your average organization. And I've found that they actually, it works pretty well for me. It also means that my problems tend to be a lot smaller in scale than most people. I don't have Google, Facebook, or Netflix-sized problems. I don't want to be solving those problems. I'm usually dealing with the first iteration of a solution towards solving a problem at my company. So what I also do, my engineering is typically geared more towards company survival than absolute correctness. I don't have time to always be perfect. And we're going to come back later on down the road and kind of fill in the gaps of everything that we're doing now. Again, I just want to make continuous forward progress in anything that I'm working on. Engineering is just a title to me. I really just show up to solve problems and contribute some sort of value. It just happens that the majority of the way I solve problems is through engineering. If I need to use different skills, if I need to think a little bit more creatively, I'm going to use that. I love this cartoon up there. Because if Thanos can't use Majolner to hit Thor, he's just going to use Thor to hit Majolner repeatedly. So, OK, let's get started here. I have lots of work to do. Who here has lots of work to do, OK? Some of you, wow. I have lots of work to do, and I have limited time or resources to be able to do everything. What do I need to be doing? I need to be keeping the site up. I need to make sure it's performing. I need to make sure that developers are delivering features. Security is just one of those things in the many parts of my wheelhouse. And secrets management, these are complicated systems to understand. And for any of us, this actually ends up as a reality. If you take a close look in there, you will see a database user name and password that was living in code. And it's something that I've, you know, you can laugh. But person, your friend right next to you may well be dealing with this right now. I see a few heads nodding. I have totally been there. I have totally been in situations where I've been looking at code and realizing, oh, wow, all these, we have passwords here. We have API tokens all here. They're literally just sitting in Git in our code. And so we have all this technology. But why does this problem still exist for us? Why do we still have people taking passwords and tokens et cetera and putting them in code when we know we're just not supposed to be doing that? And I think much of the reason is this is how we present security today. There's no in between. You're secure, you're not secure, and there's a ton of assumed knowledge between points A and B. And all this leads to security paralysis. We don't have a clue what to do. We don't have a clue how to get from point A to B. Given eight other choices of priority, we're going to shift our focus towards the things we can actually accomplish. We're rewarded at work for actually accomplishing things. We're not rewarded for trying, struggling, and totally failing. And so long as a disaster doesn't strike, no one's the wiser. So we can just sometimes pretend that everything's fine. It's all good. Some of the things that you all are responsible for is ops folks, access controls. How much is too much access? Password policies. How often should I rotate passwords? Oh, yeah, great. NIST actually decided to change that. They have a new recommendation. Don't force rotation. Awesome. Patching. How often should I patch? How fast should I patch? I just patch every single night. Just ship it to production. Or should I have some sort of patching and testing cadence? And with all those questions in mind, this is basically what you're left with. Good luck. We're all counting on you. Keep everything secure. And we're going to blame you if something fails. We've put this in your hands. Now it's up to you to figure it out. Awesome. So one thing I want to be up here and be clear is I don't speak as a security expert. I'm an ops person who has been tasked with having to secure my environment in some way over time. And I had to learn how to better secure the environments that I've worked in. So what we're going to discuss is what I call iterative security. Its approach kind of falls in line very well with the approach I take towards startup engineering. We're just going to start gradually getting improving our environment over time. We're going to start with lower hanging fruit. We're going to start with a simpler solution. And as we develop immaturity, we can work towards more sophisticated goals. So why is security, again, so hard for us thinking back to that owl? Well, one thing is we get very distracted by shiny objects in the security world. What are some of them? Zero days. Nobody expects them. You show up one day. You roll into work. You're all set to do something. And next thing you know, some zero day comes out. And great. There goes my day. This is what I have to focus on. Hash collisions and other assorted cryptographic weaknesses shattered where two different files that produce identical SHA-1s. I saw people wondering if the Linux Git repo was going to be poisoned via this method. The government, the government can hack things. The government can do all sorts of stuff. They stockpile exploits. They treat cyber as its own battlefield. Logos. These are all the cool things that we roll in. We're sitting there on Twitter, Ron Slack, Ron IRC. And these are the sort of things that people are discussing. Awesome. Cool. And here are the things we don't really get excited about. Patching. This is my obligatory Equifax slide because it's simply topical right now. I am not looking to name and shame any companies or employees. What I really just want to point out is how such a mundane thing can lead to such a massive failure. And I don't like calling patching simple. It's not. Not every system can be patched. And not every organization actually makes patching that easy. Not leaving MongoDB exposed to the internet with weak credentials. There was a ransomware attack earlier in this year. Automated worm. That was, yep, I see you putting your head down. Publicly exposed MongoDB systems with weak credentials just sitting there out in the internet and a worm traveling and no credentials. Thank you. No credentials on the internet. Any worm that was going around and again, taking them over and Bitcoin rancining them. And then from there we went to not leaving Elasticsearch exposed on the internet with weak credentials. A few weeks later, the same damn thing happening again. They just changed their target. And then finally, actually learning from our mistakes after Elasticsearch, MySQL started being targeted. And I just decided to proactively warn the Postgres community that they were going to be next. No one's paying attention. And this was early this year. All these systems was over the span of three months. Every time people finally got around to fixing one problem, they just said, OK, we'll just target the next problem. So for many of us, we focus on the wrong things. We get too caught up about whether or not we can protect ourselves from the NSA. And we look at that little red exclamation point in the top saying we have updates. And we go, yeah, we'll get that eventually. Be clear, some of you do have legitimate reasons to worry about zero days and logo of the month. But many of us are not exactly taking care of the basics. And we should really be taking less time worrying about advanced persistent threats and more time actually securing our databases, our open S3 buckets, and so forth. So back to the owl. Draw the damn owl. We don't do a very good job, again, of explaining how you get from point A to point B. We leave it up to you. And what we should be doing is actually teaching how to systematically assess risk and improve incrementally. Where do you start? How do you progress? What should your security posture ultimately look at at the end? Are you just an owl head? Are you some side turn thing, face forward? What should your posture be? And you can find information that works on my machine. And then you can find information about developed by companies with ops teams many times larger who decided to half-reimplement Kerberos and open source it. OK? There's not a lot of in-between information. OK? Most of us are in the middle. And particularly towards the left, like I've just gotten off my machine, there isn't a lot of information out there. We focus on these much more sophisticated solutions. But again, many of us are not yet are capable of that sophistication. So we know not to put passwords, API keys, tokens, et cetera into code. And it still happens. OK? So we're going to solve this problem like it's an actual problem at work. And what we're going to do is we're going to develop a threat model for a hypothetical system. If you're paying attention to like nothing else in this presentation, pay attention to the next few slides. Like this is the meat, this is the stuff that you will actually learn something, and you can apply after you leave this. So again, we're going to start by identifying and assessing some sort of security threats within our environment. First, we're going to be realistic when we create our threat model, OK? Who you are affects the threats that you face and who is after you. If you're a large financial institution, you are probably worrying about very sophisticated attackers, nation states, all that sort of stuff. If you're a dating app, you should probably be less worried about China and Russia and more about employees, whether they're current or ex, automated attacks, or even like, what is your API exposing about its users to other people? Be realistic about damage potential if every single issue means your company is going to go out of business, you're not going to get anything done. Everything is a priority one done. Maybe you prioritize risk with financial penalties over things that are just kind of embarrassing, OK? So ask yourself, how well can you defend yourself against the person dropping USB sticks in your parking lot versus the man in the ceiling? And assess and rank damage appropriately. Companies survive being breached, OK? This is a fact of life. It's not a license to do nothing, but it means you can work at continuously improving. You don't have to start at completely secure. And originally, I was going to put logos of different companies that have been breached. I'm really against naming and shaming, so I took all those off. And so I'll leave it as a thought exercise to you. Name a bunch of companies where you've heard, where you understand that they had a data breach. And then ask yourself, do you still shop there? Do you still use their services? So I'll leave it all up to you. So we're going to start by identifying what we actually have to protect in our organization. That is our first step. We have intellectual property. We have customer data, the sort of data about our customers that we have. We have, in some cases, customers data. We have data from our customers that we collect. That is valuable to an attacker, even if they just want to actually attack that customer. Try and figure out next what you have to work with. Do you actually know what your environment looks like? OK, good, it is playing. Do you have a way of actually finding everything in your environment? Do you? All it takes is that one loan system you've kind of forgotten about into institutional memory that that's going to cause a really bad day for you. Also, I really like this graphic because if you watch a lot of the balls move around, you'll see all these complicated maneuvers they go through. And at the end, the ball just explodes or disintegrates. And you're just like, it went through all that work to literally produce no value at all. It's a metaphor for some organizations. So once you have an idea of what you're actually working with, it's time to start decomposing your system. Map out system boundaries and data flows. And what I have here is a nice hypothetical system. I don't want to fall off here. We have customers, they have agents, and they are sending data. It's going through an ELB. We have our ingestion service. The ingestion service, in turn, takes these events, drops them off to RabbitMQ. Consumers pick them up, and they write to RDS. Come all the way over to here. You have your users. Again, they're going through an ELB to our front end host. They want to know about the data that we've actually been collecting from them. And so again, they query a back end service, again, querying RDS where we've just been writing to. Next, we're going to take a look at the perimeters of this system. So what we have is we have our agents. They're sitting at customers. They are connecting to our system. We generate an API key when these hosts are registered. So that's their form of authentication. This link right here, we're using encryption. There is no direct access to our ingestion host. We protected that via our network configuration with VPCs and security groups. Going over again to this side, similar situation. We have users. They log in with a username and password. We're encrypted. Again, it's an encrypted connection. We go through load balancer. Again, no direct access to our front end app. And hey, we're even actually doing some sort of user input sanitization to prevent things like SQL injection attacks. We're making sure that when this person is trying to actually access their data, they're not doing it like little Bobby drop tables or something like that. So now we come to another point. Yeah. We ingest data from clients. We send it to RabbitMQ. And then a consumer picks it up and puts into RDS. RabbitMQ requires some form of authentication. How are we managing passwords? Not a clue, OK? Again, maybe they're living in code. Here's a problem in our system that we have actually identified. Getting access to RabbitMQ, that lets an attacker take a look at our customer data in real time. Moving over further in the pipeline, we have RDS again, where we write our event data to. And when we have the back end service that wants to fetch this event data, RDS, passwords, we're not sure how we're managing them. And here, this is actually really even more valuable. This is like the crown jewels. This is our customer's data over time. You get access to this. You can look through what our customers have actually been doing. So we've identified a few threats, some of which we feel are addressed, some of which not. At the network layer, exposed network ports, no, we actually feel pretty good that our VPCs and security group configuration is actually pretty good. Patching EC2 instances, the host layer, not actually sure how we're taking care of that. That's another presentation for a different day. We've identified weak secrets management as an issue. And users submitted data, we think we're actually handing again this pretty well. So time now to document our threat. What is our problem? We have weak password management within our environment. At two points, our infrastructure, we're not managing passwords. Both points, there's highly valuable assets. And a breach would be pretty bad for us. We have reputation loss, which leads to customer loss. We have data that could actually be leveraged against one of our own customers. So now it's time to rate this threat. So risk. Risk is probability times damage potential. That's the most simple way of actually thinking about what the risk of a threat is. Again, be realistic on this. We think sophisticated attackers might be after us. But again, it's probably current or former employees for most of us, automated attacks, some kid in Eastern Europe, et cetera. So to make rating easier and a little bit less, a little bit more precise and stop these subjective arguments, there are rating frameworks. This particular one is called DRED. And it makes it easier for us again to just agree on what the real risks are in our environment. So DRED is damage potential, reproducibility, exploitability, affected users, and discoverability. And each one of these, we can rank them high, medium, and low. And we can generate a score based on that. And then we can stack rank these scores against each other and figure out what needs to be taken care of first. So here's how I rated our secrets management risk. Damage potential. Yeah, this is pretty damaging. This is definitely a high issue. Somebody gets access to this. Again, it's something that they can use against. We lose customers. They can use it against our customers' reproducibility. If you have our user names and passwords, you're going to be kind of coming and going as you please. So you're here today, you're here again tomorrow, you're here again next week. Exploitability. It's somewhat easy to exploit. If you can get access to the code, you can figure out what to do. But it requires some sort of existing access. You're probably already in the network somewhere, and you're just doing a lateral movement now. So I kind of ranked that medium, figuring that it's not an outside attacker. It requires some sort of existing access. So it's not immediately easy to exploit. Affected users. Again, this is our database of all our events. It's all our users. This affects all our users. High. And discoverability. Again, it's a few hops away from an outside user. So it's not exactly easy to find. So for that reason, I decided to rank that as a medium. And so finally, OK, we're going to put this into action and actually clear up this issue we have with secrets management. So constraints, OK? We're going to impose some constraints on the solutions that we're going to deliver. In particular, time. How long is it going to take us to actually implement the solution? Complexity. How hard is it to actually do this? How well do we understand what we're implementing? Operational risk. What happens when it fails? And these are all normal constraints that we should be weighing with every single technical decision that we make. So a few of these slides are just from some of the discussions I had around the original blog posts. And these are the discussions that actually made me realize the importance of covering what to do when you can't be perfect and ultimately create this presentation. So time, something we all wish we had more of. Realistically, most of our work has to be time box. We don't have forever to work on a solution. And so in that conversation, somebody was asked, why don't you take your time to just implement the best solution? Why create some sort of technical debt? And the person was just straight up honest. I have nine other priorities. This is literally not the most important thing for me to do right now. I simply just want to take a small step to making things suck less. That again, recognizing that we have multiple priorities is just something that we have to recognize when every decision we make. Complexity. The devil is in the details. The person said, why not use vault? It looks simple. I mean, I just deploy it. And it's like, OK, well, how are you going to handle the master encryption shards? Where are they stored? Who has access to them? What are your storage at back ends going to be? How will you do authentication? These are questions you actually need to answer. You can't just sit there and say, I deploy this. Awesome. And then risk of failure. This was my favorite. It's all code. We monitor it using Nagios, OK? Boom, done, nailed it, OK? And I'll just let that comment speak for itself. I won't say I haven't done that before, but it's not something you should probably be doing if you can avoid that. So the constraints we've established here, starting, time, I want to get something done in a few days to a few weeks, OK? The faster we can get this done, the more likely we'll actually finish the work. Complexity. We're going to go with what we know. We're going to take less surprises. We have less to learn, less to get wrong. Risk. We're going to take only as much risk as we're ready for. We're moving fast. Personally, I want to limit the failure blast radius when something goes wrong. And so finally, secrets management, some approaches to actually getting you started when you're not ready for vault. So get crypt. Really super simple, OK? Encrypt secrets directly in your repository. Go through your code, audit it, find your secrets, rotate them, and store them. Pro, you've actually audited your code. You're like a step ahead of many people, actually, having done that. Cons, you have symmetric encryption, you have to worry about the master key, master key proliferation, et cetera. Your to-do is pretty much to prevent key proliferation and throw this out. But again, it's a build, satisfy your immediate needs, and throw it away later. You were at the worst point possible. You've at least started now improving. Configuration management systems. Let's start with Puppet. You have higher EML. Encrypt values directly into your higher hierarchy. You can use public key encryption. You have multiple back ends. I personally used it. I loved it. Pros, centralized. Everything is now living it. You have now at least coalesced everything into one repo. You have public key encryption support that is useful. Cons, may require manual intervention when rolling puppet masters. I ran into that issue. You may need to clean up your puppet code if you haven't already moved to Hyra. And now you're thinking, what? Puppet 5 is already out. And they've been telling us to use Hyra for years. There's probably several people in here still using Puppet 2.7. There was a puppet installation from about six years ago that you still know is in existence and in use. Your to-do, one of the issues I had written down the last time I'd done this was figuring out a master re-keying strategy in case we had to roll everything quickly. Ansible vault encrypts entire var files in your playbook. This actually ends up being a little bit just like getCrypt. You've done the exercise of at least auditing your code. You still have to deal with symmetric encryption. Everyone needs the shared password. You have key proliferation issues. And your to-do is really, you know, preventing the proliferation of the vault key and what do you do about re-keying and rolling secrets? S3 buckets. I found this when I was researching this. Sneaker encrypts store and retrieves secrets from S3. Awesome. Secrets no longer living in our repos at least, okay? We've gotten them out of GitHub. We've reduced our secret proliferation. Cons, how are you managing your S3 buckets, okay? Anybody here before you really adopted cloud formation and Terraform was S3 just kind of the wild west of just people creating buckets and so forth? Yep, you're nodding your head. Yep, it's true. So you're probably gonna have to adopt some sort of management strategy for your S3 buckets. You need to make sure who actually had, who actually has access to these buckets and only should those entities have access to those buckets. So again, this might be something where it sounds great but you actually have to handle now configuration management of your AWS resources before you can even do it. And so okay, what should we have gotten out of all this? Me up here ranting and waving my arms and stuff? What should we have gotten? What? Logos, yes, logos and zero days, just forget everything. So less focusing on shiny objects, like that's a really sweet car and I'm sure it's fast but it's also on fire, okay? And more of this, we can manage this. Let us systematically put out the garbage fires in our environment. So all right, nothing I have talked about here is particularly earth shattering, none of that technology. What is important, again, develop a plan, take incremental forward progress and finally thank you. And if you like this, I have a feedback form, stray cat slash feedback, S-T-R-A-T-C dot A-T slash feedback. Let me know what you thought of the presentation, what you thought of me, give me suggestions, awesome. Thank you.