 Welcome to our talk about running certified PCI DSS workloads in the public cloud on Kubernetes. A short agenda about what we'll be covering today. We're gonna talk a bit about us, myself and myself while we're here. A very short what and why of PCI DSS, how we're able to meet PCI DSS requirements using CNCF and open source projects and their short wrap up recap at the end. So to start, my name is Stephen and this is Marcel. We work at Schubert-Philis as mission critical engineers and my background is running backend services and providing services to customers. And Marcel's background is more in supporting and enabling the developer teams that we support. We started this journey around three years ago, provisioning our first cluster in Amazon to help a customer migrate from on-prem to the cloud. They are, we will cover them in a bit, but basically they are already PCI DSS certified and part of the task for us was to help them still maintain this while in the cloud. We also hope to use this talk to help them mystify some of the aura around PCI DSS. When chatting to some colleagues at our company, outside our team or even outside our company, there's like this aura of specialness or complexity but with PCI DSS. And hopefully what I'm gonna show you is not all that bad. So short bit about Schubert-Philis and well mission critical engineer. So we use this term to describe the sort of work that we do. I guess it's more close to an SRE sort of role but it really matches the environments we look after. So we tend to look after like mission critical environments. So these are the bits of artsy infrastructure that's very critical to a company's success. So if this part is not running or functional, then it really impacts that customer's ability to operate and that's what we tend to help them with. Yeah, and I suppose, and Marcel and I and some others are in a team that is servicing one of these critical customers with this transition from on-prem to the cloud. And we're not talking a bit about that. So this customer is called CCV. They're more known in the Netherlands and in Belgium maybe in the Bernlachs area, but they are a transaction processor and they are one of two transaction processes in Netherlands and Belgium. So to be very short, if you are making an online payment or you are tapping your phone or card on one of these machines, there's a very hard chance that you are having your payments run through their system. So all we can't divulge the exact demands of transactions that we're processing. What I can do is quote their websites which says, we made millions of transactions possible daily. So a very short thing about PCR-DSS, but as I'm sure most of you are like etching to see how we actually do this with the tools. The PCR-DSS standard was launched in 2006 and basically was coming together of five of the big cloud providers to help set a, well, set requirements to ensure that any company that is processing or storing your transaction is doing it in a safe way. Now if you look at the why, the main thing here is that previously transactions and these records were kept in a physical manner. So if you had a theft or a data breach, it normally meant someone was breaking in or having some sort of physical access to these records and stealing them. But the world is changing or has changed and it's all digital. So these requirements are basically to make sure you're doing things in the right way, having out the right safeguards. And by doing it, you can also say to your customers, your providers, you're doing the best to protect their data. So what do these goals and requirements look like? We have 12 key requirements, seven C8 base requirements and around 400 test procedures. On the right hand side, you see six blocks and basically these 12 requirements are grouped under these six goals. And these are basically the goals of PCR that you need to meet or comply to. And we're gonna talk about how we use CNCF projects to make this happen. So often a question, why cloud, why Kubernetes? So when you're doing your PCR DSS certification, it's all about the stroke, right? So if you run the infrastructure, if you run the network, these are the things you need to validate, have audited, made sure it's compliant. And this would give you something to an AORC or an RORC which is based on attestation of compliance or reports of compliance, horrible things to say, so I'm just gonna not do it again. But these are basically like that the certified components of your environment, right? So what some of the major cloud providers have done is they've actually done their own PCR DSS assessments, they have their own AORCs, their own RORCs, and where it comes to certified environments, we can reuse or reference their AORC and our AORC which suddenly means that the stroke of what we have to maintain and audit gets much smaller. And if you ever talk to an auditor, it's always about putting things out of stroke. They're always taught how to reduce the stroke of the environment. So I guess the message here is that the cloud is not really a detractor, but more of an enabler when thinking about PCR, DSS, and maybe some other certifications. And then why Tribunettesis? So for our desk from our history and just from the past, we've seen how software or applications are deployed. You start maybe with like an installer or a VM or container and these were like your portable things, right? But for the Tribunettesis, we can actually package up the whole thing and that's our deployable artifact. So maybe for another thought one day, but we actually run like an exit scenario service for the customer where we take the whole Tribunettesis environment, go to Azure, spin up an ATS cluster, deploy the whole thing and because they are overseen by the Dutch national branch, having this exit scenario and start no blocking, yeah, it really helps, it really helps. All right, so disclaimer before we go on. This is quite a big enterprise customer. Under a show or we don't show how we use CNCF and open source projects, but to be transparent, there's also some enterprise stuff happening, but we don't try and not talk about that. So just fruitful thoughts, I guess. We are not quality or qualified security assessors or QSAs, this is like a special certification you need to have to certify a PCR DSS workload, but we have been working with auditors and successfully made this environment pass the test for the last two years. So we have a pretty good idea of what's needed and what's not, but you know, don't trust that word if you find an auditor basically. And lastly, that this will not get you certified as you saw 400 tests for our controls, 78 requirements. What we're trying to do here is show how you can use automation and tooling to reduce the amount of manual efforts and load you have to do to get to environments in such a state. Okay, with that, I'm going to pass to Marcel. I do want to mention we got a lot of content. We're not going to go very deep into things. It's really about how you could do this because previously we looked into it and what we missed was like a step in the right direction. So we're assuming some knowledge with the ETRAS system and some tooling and well, that should be quite the ride. So first goal we're going to cover is to build and maintain a secure networking system. So as you can see, those are covered by two key requirements. Install, maintain a firewall configuration to protect our total data and do not use vendor supply defaults for system, which is probably a good idea to start with even if you're not PCIDs as compliant. Like Steven mentioned, since the world of transactions is transferring into the digital world basically, so we needed a way to protect ourselves. As we see, there's a huge increase like exponential about the number of digital transactions happening. So firewalling was one of the first things we actually looked into. Our choice for CNI is Cilium. So it's one of the first thing we install in our cluster. It covered a lot of checkboxes. It helped us on multiple fronts. We're gonna mention a couple of them. Basically installing Cilium with the default and I policy ensures that we don't allow internet access to the CD environment and the other way around. This basically by installing Cilium, we already covered one of the requirements. Cilium also helps us controlling traffic by specifying so we can actually allow only the protocols needed for the applications between namespaces, between pods, using selectors, all the tools and bells and whistles you know from the Kubernetes environment. So this helps us stay in control. Enabling the default in CNI, filtering all traffics means filtering all traffics. So Prometheus, external load balancer, health checks, everything. So while it is super secure, it's also very difficult because installing something and make sure it works takes a lot more effort than you would normally do, but fortunately we also install Hubble in our cluster. This is also from ESAvalent from Cilium. It gives us a lot more observability. It enables to monitor Cilium itself by using the standard metrics but Hubble also gives those metrics to monitor the behavior of the network flows in the cluster. DNS queries, drops, flow verdicts, stuff like this. So basically with Hubble does, it conducts automatic discovery of all the surface endpoints and grabs a dependency graph resulting in a more user-friendly visualization of filtering of data flows. I grabbed that from the website. What it basically means, we get a very nice overview of what traffic is dropped and what traffic is allowed. It made us a lot easier for us to actually get all the software and all the network rules in place. Additional benefit for us is we also get access to the Hubble UI, to the developers. So even the developers actually now have Hubble UI, that at the first it was a bit bumpy but now they actually use it to when they are developing new applications to also immediately help us with creating CNI or Cilium network policies. So at the beginning of the project, here we are thinking everything was going very smoothly and we were on the right track, making the best and the most amazing platform. But then yeah, we hit a bump. So they told us that the application need access to on-premise services, which well, the only solution we had at the time was allowing the entire subnet to the on-prem environment, which as you can imagine, you need to be very specific, which traffic you allow to a quarter of the data environment. This also counts for the on-prem environment, which they also have running with, also has is CD PCI DSS compliant. So what we did is then we, well, we turned again to Cilium. Cilium has an egress gateway. Egress gateway is basically allowing us to route specific traffic based on labels. So for example, for PCI DSS complying or applications, we set a specific label like PCI in scope and then we can route that specific traffic. Vite egress gateway gives us a much more limited way of filtering and allowing the traffic to on-prem because our, well, the environment is air-gapped, as you might understand. So internet connectivity is prohibited. This is challenging when we're in a community. So pulling containers then becomes quite difficult. So we use our own registry solution. Normally this can also be accomplished by Arbor or any of the cloud vendors. So at the time at the start of the project, we choose ECR because that was what we were most familiar with. By doing that, we're also relying, I'm gonna try this word once as well, at the station of compliance from the cloud vendors, which basically is the document saying that they are PCI compliant with their services. It is not a hard requirement to run your own registry, but it is one of the recommendations. So they have a whole random of cloud and Kubernetes recommendations you need to do for your environment. But we did implement some of them. For example, we have two registries, one for development and one for production. This makes it so it's less likely that you're pulling in the developer container into your production workload, which has maybe more tools, vulnerabilities, and things in it. This also makes sure that we have different policies on our production registry compared to the non-production registry. For example, for the modifying and replacing of images, which is something we turned off on production. Then the other key requirements we mentioned before is to not use vendor supply defaults for system parameters and passwords, which like I said before, it's probably a good idea for everyone. The only thing we have, we are highly relying on Terraform and Workspaces. So we use AWS Secret Manager to where we put all our secrets. So we needed a secure way to get the secrets from Secret Manager into our cluster as Kubernetes secrets. What we eventually chose is external secrets operator to sync the secrets to the cluster, which it can do from various back ends. AWS being one of them, Azure Key Vault is also the other one you can use or Haskell Vault. So one of the other benefits we actually used by this is not really requirement, but we syncing the secrets. And as you can imagine, not all the applications that you are in control with can do automatic password rotation. Well, password rotation is a requirement. So instead of password rotations for the application that don't support it, we use an other open-source talk project, which is Reloader. So the moment we update a password, Secret gets synced to Kubernetes. We can then say when the secret is changed, reload the application. And because we wanted to give it as much hands-off experience as possible, when you have applications you develop yourself with, for example, the developers, you have other ways of actually doing this with the current and previous secrets, but then the application needs to be aware of those. You need to be able to rotate these in the cluster. But for the ones that doesn't support Reloader, it works really well for us. Back to you. Okay, so talking about cloud-holder data, there's two parts to this. There's the encryption of the data at rest, and there's the transmission of data in the environment. So naturally the storing of the data itself is handled by the application with encryption keys, but I can't talk about how we manage the access to the files. So Tetradron is a NAS project that does run time-enforcement. It monitors process, execution, you can write policies that allow or disallow access, and there's also real time-enforcement. So also disclaimer, currently investigates here, it's not actually running right now, but we have quite a nice use case, for example, where we are processing our transactions, which we do in batches. We write it to disk inside the pod, and then we want to send it off to its destination inside the customer network through the address data that was mentioned. And so if a bad actor was able to exact into a pod, there is that small amount of opportunity between writing the file to disk and uploading it, and then deleting it, when theory, they could maybe exfiltrate data or just read the contents of the file. And so what Tetradron could do in this case is write a policy that says if you are not this process or not this user and you're trying to access this file, then just terminate that system call. And we've seen some really nice demos in this case where from the outside, like you do, like an LS, that's all good, you do a cat, and the process just hangs. Because under the hood, Tetradron has just ripped out that system call and just basically dev know that thing. Again, our friend Sillian pops up with a transparent encryption. So where this helps us is the nodes you know in pod to pod traffic, like intro cluster or inter cluster. And this also helps us reduce the overhead of running a service mesh, having to fit all with like MTLS. There's a whole kind of worms there. And this means that anything inside the cluster is just handled by default with wider VPNs. So it's really fast and efficient. And then in addition, any e-dress traffic is configured for TLS. And those are more of a process thing to check that's actually configured in this way. And as a bonus, any traffic that happens inside the cluster using Hubble is written to standard out in the containers, which means then that's scraped by logging solution, which we then push into a seam. And well, if we need to do any retrospective investigations, we have all the information. Next goal is to maintain a vulnerability management program, which basically is two key requirements. Protect system against malware and regular update antivirus programs and develop them between a secure system and application. So the developer are not going to mention here, that's a different team, which they need to take care of themselves. But yeah, the first one is the antivirus. As you can imagine, running containerized apps on Kubernetes is not something that we thought let's install an antivirus system on all the containers and on the nodes. So we basically looked at what the main goal or aim of this requirement actually was. What it actually tries to protect you against is to run malicious code on the system or any software would basically contemper with the system. We tackle this by basically using Kubernetes with the security context. So as you can see, we have run all our containers with non-root, we set the immutable file system that basically makes it a read-only file system you can, and we disable some insecure capabilities. So by implementing all these, we're actually covering the underlying intent of the requirement, which we thought was a much better idea than installing antivirus in containers. Other thing you need is you need a process to spot security. This could have been a bit smaller at the, that's good for the Retro, but so we need a way to spot vulnerabilities. You also need to assign them a ranking, like high, critical, and low, and you need a process to actually act accordingly based on the ranking. That is a process part, so we cannot automate that, but we do need a way to see the vulnerabilities. The first we started at the start of the project, we were running ECR scanning, which at the beginning was nice when it wasn't to look at more static environments, we were still trying to install all the systems, so we had a good way of whatever was used in the ECR that we needed to patch this. When the container platform was getting used more and more, developers started to use it more and more, there was such a balance between what was in the ECR and what was in the system that was very hard to track if the actually container was being used in the system. So that's where we went to Trivie. Trivie runs as an operator in our system, which basically scans what's running in the cluster and reports this by a CRD. We also run a postee, so basically based on the vulnerability, we have automated, well, yeah, we are also ERA engineers, so we make a ticket for the process and we need to act on this accordingly. So Trivie scans the image itself, the file system, and it can help us find a lot of things like vulnerabilities, licenses, or maybe even passwords or stuff stored in containers. Then, well, we use renovate because we also need to keep our systems up to date. Well, we all know are aware of Patch Tuesday, which on containers is not something we can do. So we need an automated way of actually keeping our containers up to date. So we use renovate as a trigger to see if they're upstream containers or in our registry being updated. So based on the merge request, we can then merge it, container gets built, and we pull it through the system via continuous delivery. We'll talk about that later. We use our go to actually deploy this. So then the containers are being promoted from staging to have an all the way up to production because we want it as most, you know, hands-off experience. No one likes constantly checking your CVs and seeing if there are vulnerabilities fixes available. Max, you? Okay, so then we're talking about the implementing strong access control measures. And what we're really talking about here is limiting access to the environments and limiting how changes happen, especially to the production environment. So how we do this, you know, we also got on the top strain and we chose auto as our choice, but also Flux and other operators will be just fine. And together with strong RBAC, we cover this requirement. So we will say to a developer role, basically, that you cannot log into a, you turn the deck into a pod. You can't do port forwards. We really lock you down. We made sure that anything you can do is pretty much delete a pod if you're really stuck or push through new deployments or syncs through using auto CD. So by using a dip based workflow, we also get somethings for free. So we're forcing peer approval. So one or two approvers per change, depending on the criticality of the application. We have the audit logs included and versioning. And so if you ever worked in a highly regulated or compliance environment, the change process is really difficult. You spend a lot of time writing rollout plans, rollback plans, risk assessments or that. So when you have a version artifact, imagine that's a lot easier. Your rollout plan basically describes what you do in stage on non-production. You talk about how you promote nothing through. The rollback is nine times out of 10 rolling back the commit. And then for rest assessments, January, we in the low to medium category. If we're going above medium, then normally it means you should be informing your security team to join the assessments and that will slow down change. So, you know, we, yeah, doing like that. So this image is lifted straight from the PCR handbook for documentation. I think it's pretty, pretty traditional. You have your employee workstation, a DMZ, where you have your admin box or bastion host where you want all it. And from there you can add it into your production environment. So everyone knows why we do this and why we should do this. But obviously we want to be as cloudy and container as possible or not run any EC2. And so a project that I've been watching a bit was containing SSH. And containing SSH is a sandbox project that basically launched a new container for every SSH session that you establish. And then once the user session ends, the container drops. So then disclaimer, not using this yet, that's something we looked at on the very, how you say, close in the horizon, because in the most recent release, they just added O-R-D-C support, which means that will fall into our ecosystem using an RDP and authentication, O-T-C, all that sort of stuff. Now, of course, because it's a container, we can then also apply civilian policies to it and further restrict what it can do. So if you get into the container, you're still limited to wait and go and what to do with it. Yeah, so there goes about monitoring and testing networks. We need to track and monitor access to network resources to call our data, and we need to regularly test system and processes. So I don't have to explain this one. I guess everybody in the room knows this. So basically any application we just mentioned has these metrics, so we put these metrics in the system. So here we can check on the security and integrity of the cardholder data environment. So we, key requirement here is basically being in control, knowing what happens. So we have a baseline idea of what the environment should look, how all the metrics should look, and this gives us insight, because, and also because of the Selium and Hubble metrics, one of the other tool tricks we can actually do with this is basically when there is an increase in drops or in denies on a specific Selium policy or flow, we can then trigger an alert. So if we see a huge increase in coming, it's not easy in Prometheus, maybe Victoria metrics has better tools for this, but that's another discussion. But so we need to be basically doing anomaly detections. Well, yeah, it is possible with Prometheus, but you are basically doing very difficult queries. The other thing that is required actually looks quite good. We need to document all our network policies for other purposes, and we also need to do this. We need to review them twice a year. So Kubernetes has actually made this really easy for us to document this. So basically what we're doing, so we're grabbing all the CNPs out of the cluster and we're dumping it in a Git repo. By doing this regularly, the actual merge request of this dump is basically our changes in the network. So we have a very good visual representation of what all policies were changed if there were policies deleted, and what happened. So this makes it very easy for us to prove what happened on a network level. And all the policies being in Git gives us a documented way of actually going through them because we still need to make sure that all the policies are valid, that no one actually accidentally put or allow any somewhere in a list. So we still need to go through this twice a year. You could of course do this with older appliances and network equipment, but then with some fancy batch scripting and stuff, this is a lot easier. One thing I do want to mention is that we were looking at TestCube. I came into contact with this because of a request from the developers. TestCube is actually a framework to do automated testing for your applications, but you can also use it for, well, as a platform engineer, you can actually check for deprecated APIs, but we can also use it to automate some of our tests. One of the requirements is basically doing a regular pen test to prove the isolation network between the CD environment and other environments. This still is a manual process, but we also know that the human is here also a vulnerability in this case. So we wanted to automate this process and make it as reliable and repeatable as possible, and then TestCube can also store this data somewhere and actually then we can hand this over as proof. It integrates quite well with Argo or Flux or GearOps because you can actually store your tests in a YAML format and just deploy them to the cluster and trigger this based on events or Chrome jobs or anything like this. Yeah, this is also mine. So maintaining an information security policy. So we need to document all the security policies we actually provide in the system. So while we cannot automate everything, a lot comes down to processes in a PCI environment. So that's all a lot of like agreements. We need to document a lot in Confluence or whatever you use as a tool. So we're not going to talk about all these today. What we can cover is all the policies we apply onto our Kubernetes cluster. We chose Gatekeeper, but Keverno can do this as well. I must say Keverno made it a lot easier to put the policy on screen because I don't know if you ever tried to put a Gatekeeper policy on the screen, but it won't probably won't fit. So we document all our policies and for example, the things I talked about before, external secrets, if you want to use that multi-tenant in a secure way, there are ways to block this by checking the prefixes of the secrets you're doing. We are enforcing this with Gatekeeper. Allowed container registries with a mutating webhook we're doing with Gatekeeper and default security policy. We can actually enforce this. We documented all our policies. So the message is basically prefixed with an idea. So if the message is not clear enough for the developers or the end user, they can look up what actually they need to do to solve the actual policy. So one of the key requirements here is making your end users aware of what they need to do from a security perspective in the cluster. And then by storing all these in Git, we also have proof and documented all our policy in that way. Every policy has a documented read me and that covers that requirement. Okay, yes, almost there, almost there. So takeaways, three things we want you to focus on or three key things to take away. The first is when you're going off on such a journey or starting any new projects, do you take a look at the CNCF project list? The projects mentioned there are very good standard, considered stable and ready to use in production environments. Also take a look at the CNCF sandbox projects because these are projects that are in a good position or good state. So at least going up to recognize by CNCF and like the example of containing SSH wasn't quite ready, but it's getting closer and closer. Don't shy away from the cloud or Kubernetes because reusing these service or components means you don't need to. You can just focus on what you need to basically do for your customer or company. And then building a PCR, DSS environments in the cloud using Kubernetes doesn't have to be scary. So I think if you look at what we've shown today, this is stuff that probably everyone should be doing in a production environment, not just for PCR, DSS reasons. The main difference rarely is at the end of it, you don't have some sort of checklist or audits to review to basically back up that you're doing what you say. So these are all the things that we mentioned today. If anyone was trying to keep track, so we thought we'll put on to one slide. Yeah, check out the projects. There's some really good stuff out there. And thank you for coming to our talk. We have three minutes for questions if anyone has any. There's a lot of practical experience. One thing that I guess missing is ingress. How are you exposing your clothes? How are you protecting which ingress controller are you using? So this is one of the things I didn't want to mention because not CNCF, people are interested. Basically, we have an allow list on the internet that comes through a egress VPC through a network firewall on AWS. And there we terminate the traffic. We do RPS, RDS, the network firewall, re-entry the traffic and send it back into the cluster. And then over there, we have obviously the intelligence to do alerting on stuff. Thank you very much for sharing all the tools that you've been using to fulfill the compliance that is quite heavy. As you mentioned, the framework was created in 2006. So it's quite an old framework. And sometimes the way that it's being framed, the requirements is kind of difficult for cloud users, cloud heavy companies to adapt to it. So I assume there has been a lot of efforts to summarize, to rationalize the complexity of your ecosystem in order to be audited successfully. Are there any tips you can share in how you manage to simplify this picture to your auditors? Thank you. So I think Marcel touched on it a bit, but basically you really need to work with the auditor. If you're moving to a more modern environments, using containers, using cloud, it's not gonna work if you just say here's my assessments, because things like the antivirus and that, it's really about what are you solving? What's the underlying thing? And I think in V4, which was released last year, a lot of that's being addressed. And there's now like a companion documents to help you with container and cloud workloads. Yes. Thanks for the presentation. Quick question. How did you manage to put your HSM in the cloud? Or if you did? Excellent question. So right now we, so the HSM launch story short is running on prem, because we have some custom cartography, cartography things happening going on there. And we're waiting for the HSM as a service in Amazon to support these sorts of cryptography stuff that we are doing. So we have a direction act and we're in the same region. We try and keep it as close as possible. Okay. Thank you. Sure. We did? Yeah. Awesome. We're still here, but not. Cool. Thanks, everyone.