 Welcome CNCF community and thank you for the opportunity to share how to verify your cloud security configuration with Paladin Cloud. My name is Steve Hall. I'm the co-founder and CTO of Paladin Cloud. You can reach me on that LinkedIn address down here in the lower left-hand corner. You can check out our GitHub repo. We are 100% open source solution on Paladin Cloud CE. Give us a star. If you want to check it out and you like what you see, we do have a Slack community, paladincloudcommunity.slack.com. Visit us there. And of course you can check out our website at paladincloud.io. So before we get into how we actually verify cloud security configuration, let's talk about what some of the driving factors are that drive cloud security challenges. And I've kind of put these things in three buckets. First one being complexity. And it's interesting to use the word complexity at least in my mind because there's kind of a dichotomy here. The cloud is so easy to use. You can spin up resources. You can do all kinds of great things in the cloud. But it's also highly distributed and it doesn't function in a model that is classically understood by all developers. So we do have a learning curve to get over here. But it is highly distributed. Something we built even five to ten years ago might be one monolithic application that now is 15 services across 100, 150 instances of those services to get something done. So that each one of those things by itself becomes a security challenge because every single one of those can be launched very quickly, very easily on all three big cloud providers. But at the same time, you may want to enforce certain rules like HTTPS over HTTP, certain things not open to the public, even though by default in some cases it's really easy to do it that way. So there's this ease of doing things that creates the complexity. But it's also the distributed nature and the sheer volume of things you have to keep track of that become part of the challenge for your average DevOps team. Another thing is just visibility. I ran a rather large group at T-Mobile and when you start having tens of thousands, hundreds of thousands and we even talked to people in the past six months that have well over a million assets in the cloud, how do you keep an eye on all those when they're spread across maybe a hundred accounts or even more depending on the nature of the customer. So just seeing what's happening in the cloud becomes a challenge onto itself. And then there's the age old balance of velocity. What's more important getting the latest business value out the door or getting your security or unless back off from security, getting your NFRs correct. And this onto itself creates a problem because as we move faster in the cloud, as automation, the promise of automation starts delivering on the time to value equation, guess what? Our business leaders want even more from us as technologists. And when they want more, they want more, right? And so things start getting compressed and you start getting dates that are somewhat difficult to achieve. And the first thing that goes out the door is your NFRs and probably one of the first things in the NFRs that go out the door is the security and a close first or tie for first would be your HADR stuff, for instance. So there are some challenges for these developers for sure. So how do we help the developers? And I want to emphasize who we is here. We is all the developers, all the people in the technology world, the technology leaders, because we do have this fundamental problem, as I talked about, about just pure velocity. We all want to build stuff faster and it's really cool when we achieve that. But we want to do it redundantly. We want to do it securely. We want to do it safely. So how I believe we can help as an engineering leader, what I try and do is, number one, I set priorities. And I make it very, very clear. I have the 80-20 roles where I always start with every delivery team, 80% on business value, 20% on NFRs. And then you adjust from there as you start trying to figure out, do we have a weakness in security? Is our HADR strategy wrong? Are we actually getting the right visibility purely from an operational standpoint, even logging, depending on who you talk to, is that an NFR or not? Are we logging properly? All these things need done. So maybe it's a 70-30, maybe it's a 60-40, maybe it's a 90-10, as you mature with your development organization. The other thing is, even though the cloud, I think the cloud is 15 or 16 years old now. AWS came out about 15 years ago, I think. And it's got fantastic value. I, me personally, I started using it immediately because I saw the value of, hey, look at AWS has all these Lego bricks. I don't have to go build these things. I can just start using them as quickly as possible. But there's a learning curve there. And getting massive, massive developer communities moving toward those things. And I call them communities because even in a large enterprise, you may have thousands of developers. And maybe even tens of thousands, if you include developers, SREs, DevOps, the testers, if you include the security partners, you're going to have some very, very large organizations there. And they're going to be varied skill from the new folks who you can't expect them to know everything about the cloud on day one to the people who've been operating, building and operating the cloud for years, that know a lot more. You know. And, but the, the, the teeter totter is definitely leaning more toward the new people than the people who actually do this well and are seasoned in the cloud. And at least that's my opinion. The next thing is really collaboration. So we moved, we all know in the last decade or so, there's been this massive shift toward automation. There's been a massive shift toward agile technologies and away from waterfall technologies. And there's been a progression from, you know, here's the developers, here's the ops, let's create DevOps, let's bring them together. And that's reasonably good. You know, I've seen teams that, you know, literally just kind of have a shadow little dev team or ops team sitting over there. And then even when you get into SRE, they're somewhat operational, but, but they're way much more than that. So don't, don't get mad at me, SRE folks out there. The value is huge there. But what hasn't happened is this dream of dev sec ops coming together. And I drew the Venn diagram on this slide the way I did because we're, I believe we're just starting that. I don't talk to a lot of CIOs or CTOs or even technology leaders that truly believe security and DevOps have come together in a meaningful way where they're solving problems together. They still exist as two different organizations a little bit overlap, but still a lot of silence. Okay. And so there's, there's, there's a problem there. It's a cultural problem. And it's going to take a while to, to work out. So my conclusion when I look at this is, Hey, we need better tools. The cloud is changing everything. You know, in, in the old days, you would write some software or let's, let's forget about writing software in the old days. When you wanted to do some in the data center, you would request a server or a rack or something. You would plan your capacity. And then you request it and an organization would get together and they'd either go procure all the equipment you needed or they go configure it for you. Then they'd write tickets to go maybe to the network group that told you what, you know, is it internal? Is it in this, this zone or that zone? And where is it out there in the world? Right. And that would take sometimes weeks, sometimes months. You know, today you just go on and you launch an EC2 instance, put it behind a subnet and a VPC and expose it to what you want to do it. And you have full control of it. But you, it takes down some of the natural barriers that the slower process on prem, the safeguards that were put in place from a security standpoint. And our security tooling hasn't quite caught up with all that yet. But there's fantastic tools out there and they're evolving. I mean, almost every year, it's just amazing to watch a lot of the new security tools that are coming out the door. And the other thing we see here is broad network access. As we were talking, I was mentioning, you know, somebody set up your network in the old days. I, even though I don't consider myself a really good network guy, I can go in and set up any kind of network. It may work or may not work, but I can go configure that all myself in AWS, Azure or GCP. Even though I'm not expert in it, and even though some of the stuff, I'm not entirely sure what it is to be honest with you. And that's a gap in my learning. So even leaders have gaps in what some of this stuff is. And then finally, we have getting back to my data center example of planning your capacity and everything. We don't do that anymore. So what we do is we go in there and we launch a few EC2 instances or we decide we're going to use Lambda and we have these really good short-lived functions or we put in a K8 stack and we don't worry about compute at all. We just throw everything into containers and go. But the elasticity and the rapid elasticity of the cloud is not something that regular tools can handle. And when I say regular tools, tools from years ago can handle, it's more dynamic now. And so tools have to be able to handle that dynamic nature. So I was thinking about this and I'm sorry, I had to put the fear slide in. But I think we have a problem. And the numbers are staggering. I mean, 95% of cloud security breaches through 2022 will come from customer errors fueled by misconfigurations, myths, and misunderstandings. I mean, that's a gardener statement. Do you look at ThreatStack? They believe that 73% of companies have at least one critical misconfiguration right now, something like an SSH port open to the internet or an RDP port open to the internet or maybe sensitive data in S3 bucket open to the internet. And by critical, they mean something that is imminently breachable, which kind of gets to the second statement in here, which is an attacker can typically detect the cloud configuration vulnerability within 10 minutes of deployment. Think about that. 10 minutes they can figure out you have done something wrong, right? And you couple that with the fact that 95%, gardener thinks 95% of security breaches will come from misconfiguration in 2022 and ThreatStack thinks 73%. So companies have that one critical misconfiguration right now. And let's be clear what these security incidents look like. All they're looking for is a crack in the armor so that they can get inside and they can start moving laterally. The biggest and worst breaches when you read about them, at least from the public information, it isn't always you break into the server. What happens is you break in somewhere and then you move laterally for a while and you poke around and you're undetected and then all of a sudden you find the crown jewels and you go after that. And that's kind of where you start getting caught because I think some of these hackers get greedy and try and do it too fast. But the forensics of this stuff is very interesting. And even in security, I'm gonna pick on the security domain here a little bit. A lot of breaches are privileged and they're quiet and we don't talk about them and we don't have a feedback loop to help us all learn from them. And there's a lot of reasons for that. Obviously, companies could get lawsuits and you certainly don't want to advertise how somebody got into your servers. But the ones that are reportable, they're reportable because they are exposing sensitive information. And then you start getting a little more information about how it came through. And that's where I think a lot of these stats have come from. So Paladin Cloud, we are in the security posture management space and what we're really looking for is the project that we have in open source right now is based on T-Mobile's original Packbot project. And one of the major goals, I was the executive sponsor of that at T-Mobile. And one of the major goals from the very beginning that we wanted to achieve was what we called actionable intelligence. We've all heard that developers have alert fatigue, there's too much information being thrown at them from too many different tools, that sort of thing. We didn't want to be yet another tool that just threw stuff at it. We at least wanted to attempt to get the violations prioritized, attempt to get the users focused on exactly the most critical, then work on the highs and go down from there. So the key to me in a lot of these tools as we move forward is to get into what I call actionable intelligence. Okay, so I am going to now switch screens and jump into a product demo and start walking through some of the capabilities of Paladin Cloud. I want to remind everybody this is an open source project. The first slide that I presented has the URL for the project where on GitHub, just look us up paladinclouds.ce. And if it's interesting to you, please get involved. We're building the community and we're just getting this new company off the ground. But I truly believe we will build a long-term better tool with the help of the CNCF community and the open source community at large. So let's jump into our demo right here. So I'm already logged into Paladin Cloud and I'm going to walk through just some major features here just to give everybody a feel for what's happening here. Reach out to me on LinkedIn anytime if you have further questions. This is an asynchronous thing. This will be broadcast, I think in the mid-October timeframe. So you can see up here in the right, upper right we have an admin role. I'm logged in as an admin right now. And we have two basic roles in the product. We have the user role and the admin role. And the admin role, I'm not going to spend too much time on this, but it does give you access to some admin functions that allow you to configure the software and change the policy and rules, the criticality of the policy and rules. You're also allowed to grant and revoke exceptions within this administration role. When you're in the user role, you're just a read-only consumer of all this information and you can request. You can request your exceptions, but you cannot grant yourself an exception. So as you can see here, the overall UI, we've got our menu along the left here. We have this thing up here called an asset group. And an asset group is interesting. And what we do here is we use this to slice and dice our cloud. And so you can see we connect to AWS and this is my Dev environment right here. So you can see there. We use it to slice and dice the cloud. And we also use this, we're doing in the Dev environment test. Sorry, I forgot to mention that. So we're on just 100% test data is what we're on. So some of these numbers will look interesting. You'll say, wow, that's really high for such a small installation. But remember, we're testing both the positive and negative cases of our policy 24-7. So we have 1,000 assets in AWS. We're over here looking at Azure. We've got 2,800 in Azure. And we have about 40 in GCP. An asset group is also, these three come right out of the box with the product. You, as a user of this thing, can then create your own asset groups. Santos has created an all cloud for us. And so basically what he's doing is saying, show me everything from AWS, Azure and GCP in one view so I don't have to click between the two. To see what's going on. Now that is like a very broad, very large asset group. You can also go the other direction. You can do a, depending on what your mandatory tags are, we have somebody using this thing in open source where they have business units. And I think there's seven of them. And so they do a asset group where the business unit equals unit one, unit two, unit three, unit seven. So they can see everything associated with that business unit. And these things do cross their multi-cloud. So it does cross AWS, Azure and GCP. So they get this one aggregate view of what's happening in that business unit, which is just a great fantastic way of doing asset groups. But we also collect all the metadata associated with a resource. And so you can create an asset group by looking at that resource. And any of the metadata can become a filtering criteria for creating these asset groups. So I'm going to stay, I'm going to go here into all cloud, just because it's bigger and it shows everything and I don't have to switch between accounts. You can see up here, like I said, actionable intelligence, what we're trying to do here is your critical violations. You should be working on these first, okay, your highs, necks, mediums, necks, lows. What's interesting about this is we are, we almost have a hundred percent coverage of the CIS benchmark policies or benchmark standards for Azure, AWS and GCP. So they're all in here and they give guidance on what is high and critical. It's type one and medium and low. And I think that's type two and I might have that backwards. So don't call me on that. But they have two buckets. We break the two buckets into four buckets. Basically what we're doing here for severity by categories. If you go in here, again, how do I immediately take action? I can dig into my policy violations and you can see we've got some SHH ports open to the public and we have some RDP ports open to the public. We have some allow list problems in here. You know, as we go through this list, there's 134 of them. We have some ICMP open to the public here. And again, the team is going through. We have a lot of ports going on here. They're going through testing right now. So some of these things we even have, we have rules related to encryption. Is your EBS volume encrypted? Yes or no, that's not what this policy is. This policy is checking, do you have your customer managed key? Are you using that or not? So one could argue whether this is critical or not because you're number one, it obviously passed the encryption test. Number two, how bad is it really that AWS is managing your keys? So you may choose and you can configure this, but you may choose that this is not critical and this is a high. So you can go configure that over here in our rules administrator. So you can see what each one of these policies are. We give a brief description and some added metadata here. We also fully document them on our open source Wiki. You can see over here every single policy and rule that fails, we create an issue ID and the issue ID is either open or it's closed at the end of the day. So if we go look at this issue ID, you can see that we've got an Oracle port open. You can see a description of that. If I clicked on this, this would just simply go to the view I just showed you. I know this is on Azure just because of the resource ID. They have very long strings that give you everything from the subscription to the endpoint that you're looking for. The issue was found on 720. They've had this one open a while and then it was last inspected today. And this is where I was talking between the rules. I'm an admin. I can grant an exception here. A user would have the ability to request an exception at this point. And of course, you can email this issue around to your friends because you may look at this and say, hey, we got to get this fixed, but you want somebody else to go fix it. And so you would email that to them or just simply give them this URL and say, go fix it. We also use the resource IDs specifically from the cloud provider. So you can look that up at any time. Here's the resource ID up here. This also gives you a great view of there are 12 policies that apply to this resource. And you can see that a majority of them are failing, which is clearly not good. So let's go to a better test here, a better case, just to show you some of the other capabilities. Let's go to an EC2. EC2 itself has 20 policy. You can see that it's in a running state. You can see that four things are failing and the other 16 are passing. You can see its IP address and whether it's public or private or both, you can see it's subnet, it's AMI, what type of machine it is. So again, then this is the metadata that I talk about that we're collecting for every single resource. And of course, that changes per resource. You can see a little policy violation Donut graph there to see what's going on. And then of course you can see related assets. Here's your security group if you care about it. Here's your EBS volume in case you care about it. So lots of good information there to make sure you can get to the root cause of what you want to remediate right away. And we should talk a little bit about remediation. There's effectively three ways to remediate what the findings are in here. One is you look at this and of course you go and SSJ's port is open, you immediately go find the security group and you close that security group up. Get rid of that configuration. That's one way to do it. And then of course if you really believe in automation and shift left, then you should do some triage investigation of how did that get open? Was it open during runtime or was it actually configured that way when it was pushed out as part of your pipeline? So you'll want to do that investigation. So that's really just a manual effort of remediating these things. Another way to do it is we have what we call a one-click fix. And the one-click fix basically exposes this thing to you and we give you the option to click a button and we'll go fix the software, we'll go fix it for you. Now of course that has ramifications and it requires elevator privileges to do that because you're changing a configuration and by default Paladin Cloud only has read-only access to metadata about the resources. The third thing, and I'll cover the third one and then we'll talk about how we do it, but the third one is what we call just an auto fix in which case you trust the software to go close that port for you. And you can put a workflow associated with it. If you like, for instance, we typically do a 72-hour email notification in the open source world and we say, hey, you've got this problem, you fix it. If you don't fix it, we're fixing it at 72 hours. Then we say, hey, if you don't fix it, we're fixing it at 48 hours, then 24 hours, then we just go fix it. So let me see if we actually have any of those in our health allocation. We do. So this is, we got to work on this name a little bit, but this is a security group auto fix and the purpose of it is to delete unused auto fixes, or I'm sorry, unused security groups. And so the idea here is, again, here's your email notification one, two, and three. And then finally, we just went and fixed it and clearly our developers are testing this. But what that does is there's an interesting dynamic here with auto fixes and even one-click fixes. And it boils down to trust and it boils down to forcing a conversation between security and operational uptime. So for a security group, because it's not being used, the side effect of deleting it is not all that bad. You might delete a security group and that might offend a developer who was going to use that security group at some point in the near future or something like that. But what you're typically not going to do is cause an outage. But let's say you invoke the auto fix for S3 bucket open to the public. You are most likely going to cause an outage by making that S3 bucket private. And it's an organizational question on whether you want to suffer that or not. And that's how really the one-click fix came about was we did the auto fix and people were getting surprised that the S3 buckets were getting closed and they didn't have notification. So we created the one-click fix and of course the developer looks at it and says, okay, I got eyes on glass. I know what's happening. I'm going to close that thing. Or they go in there and do some tweaks and they go fix it themselves. On several of them, they hit the one-click fix and once they had confidence that it didn't create any operational issues, they went into the auto fix. But that's still, again, culturally, how we work and with all the automation that we have, you still got to ask your question, why aren't we fixing that in the pipeline? Why aren't we preventing it from happening in the first place and you got to have that triage event to find out what's happening there. So let's go back up to the dashboard. We categorize all of our policy and if I go into high here just for completeness, it's just going to filter this list and we're in the violations menu now. It's going to filter the list on your highs and if I go to mediums and low, it's going to do that, each one of those severity. So as we go into category compliance, we have four categories. Security, that's where we're spending most of our time on getting those security policies in place. But we also have some cost policies and we're not even attempting to compete with the very mature cost management solutions out there. When we say cost here, we're just being opportunistic. What you see after you operate in the cloud for a little while is you get an orphaned ELB. You get an orphaned EC2 that didn't go away. An orphaned, what's another one? An EBS, those happen quite a bit actually and all those are costing you money. You also have utilization of an EC2 instance. Maybe you have an XXL out there running at 8% and you look at that trend line over time and it never gets above 15%. You probably save quite a bit of money by dropping that down two or three steps, maybe down to a medium, maybe down to a large, who knows. So that's what we're looking for in those sort of policies and then the operations, we're looking for those purely operational things like every one of our accounts at T-Mobile, we had standard regions, right? And if you deployed an asset outside of the standard region, then we had a problem, right? And so we identify, are you deploying? And it wasn't, it's not a huge problem, but the question is, why are you doing that? So we have policy related to your regions and are you configuring autoscale correctly and that sort of thing? So we have a small number of operational policies and we have a small number of cost policies too, to be clear. Tagging, we made tagging a first class citizen because if you get into the scale of even a medium enterprise, you're going to be at 100,000 or more assets before you know it and when I say assets, I mean resources. 100,000 resources are more very quickly if you don't have a mature tagging model, you can't keep track of that stuff. You simply don't know what's going on. So having a mature tagging policy and model and making sure you're at least tagging every single asset is huge. So we keep that as a first class citizen from purely a compliance standpoint. We show trends and this is messy just because of our dev and test environments. You can see we had a bunch of stuff tagged that took it all away, we added them, we took them away, that sort of thing. But no matter where you start in your compliance journey and how long this is, this will be the lifetime of running Paladin Cloud. You want to finish over here as approaching 100%. And depending on your company policy, you may ignore Lowe's for instance. We have one customer, I shouldn't call him a customer, one person operating in an open source that we've talked to where they have 24 hours to get their criticals done. They have two weeks to close their highs and their mediums and Lowe's are best action. So they're not ever really going to approach 100% on medium or Lowe's unless they're just so clean and start elevating those into highs or something like that. The inverse is true also, your violations, no matter where you start, you want to drive those violations down to zero is what you're after. We have a view here down here where every single policy that we've implemented, we give you the overall compliance rating. Okay, so we're going to have a lot of zeros here because of the environments that we're in, but you can see we're not doing mandatory taggings on a lot of things. We're not ensuring databases aren't a managed tier. We're not managing the key vault correctly, that sort of thing. So there's all these rules. And then of course, in the real world, you'll get into something that looks more like this or really more like this. And then the ideal state is clearly get yourself to 100%. And then finally on this dashboard, we go on here and we're giving you a view of all your assets. And what your count is. So policy definitions in Azure, if you've ever worked in Azure, there's a lot of these things. So that's our major thing that we're working on right now and testing. So we do have a bunch of them, but you can see even in our small environment, and some of it's contrived, security groups at 187, you got ENI, DHCP options. It doesn't take long for these to start growing very, very quickly. And in our world, these shift a lot depending on what we're testing and deploying for the tests. So that's the main dashboard. And what's interesting about this dashboard, the feedback we're getting is, yeah, this is good for the developers. It's actionable as the numbers get bigger. Is there further refinement or prioritization needed in any one of these categories? This is a question we get a lot from technology leaders. They like the summaries and the trend graphs. And they also like the idea of, just about every mature organization has an SLA for taking care of critical security problems and high security problems. They can manage that SLA based on what these numbers are. In several places here, you can see we have export this stuff to XLS, Excel and that sort of thing. We also have API available. So you can put this into your own Grafana dashboards or whatever your operational dashboards are that you're using. I'm going to jump into violations here really quick. Like I said, we really went through this and we talked about and we clicked through every one of these columns here. So there's a variety of ways to filter it, nothing special there. You can export this also to get more information. Assets worth a little bit of a deeper dive. You can go from every single asset, whether it's Azure, AWS, GCP, that's the beauty of this aggregated view here. But you can see the trend view of these assets throughout time, which is definitely something valuable. And then of course, you can start getting into the details of each one of these things and what's going on. Our policy knowledge base, right now in the open source world, we have 337 policy. Like I said, we're almost complete with all the CIS benchmark policy for these three clouds, GCP, Azure, and AWS. 244 of our policy are related directly to security, which I think is a good thing. And like I said, a handful, 45 for operations, five for cost, and 43 related to tagging things properly. And of course, you can filter and search this, have at it. The tagging, like I said, we made it a first class citizen right now because it is worth looking at that. And we're in a state where we're not doing tagging very well in our dev environment. But if we go into these lists, or I'm sorry, if we look at this, you can see 152 are not tagged exactly what they are. And you can see the asset list has some filters and we can clear that and do a different filter. So I already went through the fixed central to show an example of the autofix. We do have about two dozen autofix rules across all three clouds for various things. As I mentioned earlier, some of them cause operational issues. Some do not. So to me it's always a use with caution and make sure you have a great conversation among the DevOps operators and SRE of whether you're willing to take that or do you want to go through an intermediate step like a one click fix until you get comfortable. But I think my message with autofix is even though this is a cool feature and we can do it, you don't want to rely on this. Because remember, we're verifying exactly what's happening in your resources that you're exposing. And it doesn't matter what environment it is, it doesn't matter what cloud it is, we're inspecting that configuration. So somehow that thing got into that state and it's probably due to something in the pipeline. But it may not be. We've seen people will open up an SSH port or a RDP port, even though we demand that the cloud should be immutable, people still do it. And people are people. People do make mistakes. And like I said, one of the fundamental problems is education in the cloud. So until we get years down the road for now and people are really, really comfortable what security looks like in the cloud, we're going to see these human errors. And just for kicks, I'm going to close with this little statistics view just because it's an easy summary of what's happening. You can see we have 340 policies have been enforced. Nearly 1,300 evaluations have happened in the last day. 77 auto fixes were applied. We're running in nine accounts. And this is not just AWS, this is all three clouds. You can see events processed. We haven't had any today. And then close to 4,000 assets scan. And then, of course, you can see all of your violations there. So I'm going to end our demo here. I really appreciate for all those who stuck with this and are still listening, thank you for paying attention and thank you for being involved in this webinar. I guess from here, if this looks interesting, go back to that original screen. We have our Git repo there. And maybe I should make this easier for you. We'll end right back on this beginning slide, but join our community, get to the repo, check it out. If you like it, please star us. And then finally, you can always check out our company website at PelletInCloud.io. I am on LinkedIn. I love being on LinkedIn. I love talking to developers and I love talking to technology leaders about the problems in the cloud and what problems actually need solved in the cloud. So by all means, reach out to me. So with that, thank you. Everybody have a good day. Take care.