 All right, thank you so much for coming, everyone. I think we'll go ahead and get started. Our talk today, as you can see on the screen, is streamlining FedRAM compliance with CNCF technologies. So we wanted to start out by doing just a quick poll of the audience. How many of you are currently going through a FedRAM precreditation process? OK, a lot, actually. That's great. OK, how many people are maybe familiar with FedRAM like thinking about getting into the process? OK, also a lot. And is there anyone who maybe has heard of FedRAM but doesn't really know anything else beyond that? OK, also a couple. OK, great. Well, we have some introductory material but to sort of explain what FedRAM is, what the process looks like. And then we're going to talk about some of the challenges, of course, that we faced and how we solved those. So hopefully there's stuff that's relevant to everyone. Before we dive into the content, we just wanted to do brief intros. Apologies to those of you who came to our talk on Tuesday, as you've already heard these, but we'll keep it brief. My name is Ali Manfri. I'm a senior architect of our federal government business at Palantir. Essentially what that means is I lead our cloud architecture and kind of our cloud-hosted deployments across our federal government portfolio. And I also lead business development for our Apollo product and FedStar program that we'll talk a little bit about later today. And four years ago, I actually led Palantir's initial efforts to become FedRAM and IL-5 accredited. So that's how I got really familiar with this space and then kind of built up our whole technical federal compliance practice that oversees all of our accreditations and all of our ATOs in the government today. Hello, folks. My name is Vlad. I'm a lead in the production infrastructure group. I lead all the teams that manage and deploy kates in all the places Palantir does deploy the software. So that's commercial cloud, high-side cloud, on-premise, and Edge as well. All right. So just a brief overview of our agenda. We did want to do a little bit of an overview of Palantir as a company for those of you who aren't familiar with us and talk a little bit about our journey getting into FedRAM and kind of some of the things that we faced as we initially started going through it. And then we're going to dive into some very specific sort of technical challenges. So as many of you know, there are many challenges associated with pursuing FedRAM accreditation. But we wanted to pull out a few that tend to be the trickiest. And we're certainly the trickiest for us that we'll talk about in terms of challenges we faced. And then Vlad's going to talk about the solutions and how we solved them. And hopefully we have plenty of time at the end for Q&A as well for any of you who have questions that you want to ask us after we're done. OK. So introducing Palantir just very briefly. Palantir was founded in 2003 after the events of 9-11 with the mission to help the federal government make better use and sense of its data while protecting privacy and civil liberties. So all of our first products were about data integration, data analysis, kind of operational use cases involving data. And we first began working in the Intel community. But as our products and our business has evolved over the course of the past 20 years, we've grown tremendously and expanded into many other parts of the federal government. Some of our clients you can see on the slide here expanded into DoD and also expanded into kind of federal health and the civilian government space as well. So you can imagine that FedRAMP and Impact Level Accreditation as our business started to grow was top of mind for many years, honestly, before we started or decided to really go for it. So for those of you who aren't super familiar with it, FedRAMP and Impact Level Accreditation are specifically accreditation's required if you want to sell your software specifically as a cloud-hosted SaaS to the federal government as opposed to on-prem or like a self-hosted, self-licensed model where the government manages everything. So for us as our business continued to grow and scale, delivering our software as a cloud-hosted SaaS was a super important thing because we don't have that many engineers and kind of needed to support a lot of environments. So that's what led us down the journey of thinking, all right, it's probably time to do FedRAMP and Impact Level Accreditation. So we made that decision. We had a couple of back and forth, a couple of false starts, but we made the decision to really go for it in the fall of 2018 and we spent all of 2019 achieving accreditation. So kind of at the tail end of 2019 is when it formally came through. And as we're going to talk a lot about today, our original FedRAMP and IL-5 accreditation did not include any Kubernetes or any cloud-native technologies. So we got through the accreditation process, but we ran into a lot of difficulties with efficiently managing and maintaining the system once we kind of achieved accreditation and that's ultimately what led us down the journey of moving more towards a case-based architecture. So once we made that switch, a lot of the aspects of the accreditation process became easier and we've actually been able to reduce our dedicated headcount by 80% just by standardizing our environment using Kubernetes and meeting a lot of the controls at the infrastructure level as opposed to the application level. So we're super excited to be talking to you all about that today, what we've learned, kind of what we've done and hopefully some of this content is useful for you all who are going through the process as well. Quick overview of the FedRAMP process at a high level for those who are maybe a little bit newer to this space. I'll say upfront, I mentioned FedRAMP and I also mentioned this thing called Impact Level as being the accreditation's required for cloud-hosted SaaS applications. FedRAMP is the accreditation framework that they have for the civilian government space. So if you wanna work in the civilian space, FedRAMP is the one that they use. The Impact Level Accreditation process is the one that the Department of Defense or DoD uses. The Impact Level process can actually inherit from the FedRAMP one though. So for example, if you pursue FedRAMP high and then you wanna pursue IL-5, the DoD will actually accept your FedRAMP high accreditation and then only assess the Impact Level 5 controls that are sort of layered on top or above and beyond what you've already done for FedRAMP. So I bring that up just to say we're talking specifically about FedRAMP today, title of the talk is FedRAMP, but all the controls we're talking about are also relevant for the Impact Level process. So for those of you who are specifically looking into DoD work, all of this content is gonna be relevant for you all as well. So to become FedRAMP accredited, there's a lot of steps you have to go through. As you can see on the slide, the slide is also just from fedRAMP.gov. So feel free to take a look at their website and you can get all this there too. But at a very high level, you need to identify a sponsor or be accepted by the JAB, the Joint Authorization Board, which is essentially your kind of entry point into entering the process. And then you need to document and implement hundreds of security controls. You have to have those controls validated by a third party assessment organization or 3PAO. And then you need to do a final review with your sponsor and with the FedRAMP program management office. So our talk today is not gonna focus on kind of the entirety of this process, but rather on some of the security controls that are particularly challenging to meet. And again, how we use CATES and CNCF tools to meet them more easily. So we will head into some of the challenges that we faced. With FedRAMP and Impact Level accreditation, there are many different scanning requirements that you have to meet. So you have to both run these scans and then appropriately remediate any issues that they flag. So you can see them here on the slide. Vulnerability scans are a big one, virus scanning. You have to do something called STIG scans. STIGs are, these are essentially compliance scans that check for appropriate configuration of specific pieces of your infrastructure. So for example, there are STIGs for operating systems and databases. And so you need to scan those and make sure that those are appropriately configured to government standards and secure. And then you also need to do web application scans. The vast majority of these also need to be done weekly or monthly and you kind of need to be consistently remediating and patching your infrastructure accordingly. So in a pre-Kubernetes world where not all of our infrastructure was uniform and using immutable AMIs and container images, this meant we were scanning every single piece of live infrastructure that we were using, which for Palantir was actually thousands of hosts. So we have a microservice architecture and there's hundreds of microservices for every kind of implementation or deployment of our software. And then we also had multiple different stacks for kind of all of our government agencies. So the scale of this was just pretty tremendous. A lot of these scans do affect the performance of your system. They're pretty CPU intensive and just it was very difficult to keep track of the results of all these scans, aggregate them appropriately and honestly just sort of efficiently managed from a vulnerability management standpoint. On a related note, patching for a very similar reason was also a nightmare. So rolling out all of the changes meant patching and rebooting every single one of these hosts that again, we had thousands of. So in addition to that just being a time consuming and sort of onerous process, a lot of our microservices have sort of service interdependencies and there are a lot of uptime requirements associated with FedRAMP and impact level environments as well. So kind of orchestrating this whole patch and reboot cycle in a way that wouldn't cause any downtime was also just a very complex kind of engineering problem. So we were spending again, just tons of dedicated engineering time and effort and humans figuring this out. The third challenge that I wanna talk about is FIPS encryption. For those of you who are familiar with this process, you know how painful FIPS encryption is. Palantir is certainly no exception. It's really difficult. For those who are unfamiliar with FIPS, this is a government standard for encryption and all of your data both in transit and at rest must use only FIPS validated cipher suites and crypto libraries. So this list is really quite small and it frequently takes a long time for new cipher suites to become FIPS validated even if they're FIPS compliant just because that formal accreditation process with the government can take a long time. So encryption at rest has become a bit easier in recent years with things like KMS but doing FIPS encryption in transit is still very difficult. And again, when you have hundreds of services and you're also dependent on a lot of open source tech like Palantir is, maintaining FIPS validated traffic between from service to service, basically between all of these services and enforcing in every single one of the libraries is a pretty insurmountable problem. And Vlad is gonna talk a lot more about what we've done to make that a little bit easier for ourselves. The final thing here, we just wanted to touch on a couple of additional things that became challenges when we started thinking about moving to Cates. So they weren't necessarily challenges before Kubernetes but when we started thinking about that architecture there were just a couple of additional considerations to take into account that we wanted to sort of share with you all so that you're aware as well. First one is ingress and egress. So how you're managing kind of your front door and your proxy. We were previously using Nginx in a pre-Kubernetes world but FIPS encryption for Nginx is only available with their paid Nginx Plus product and additionally not designed for container first environments. So as we first started kind of experimenting with Kubernetes, we ran into some performance challenges with that as well. So Vlad's gonna talk more about that and kind of how we solved for it. And the last one here is monitoring an instant response which of course is a hugely important set of controls that you have to meet to make sure that you're monitoring the environment and that it's secure. We were using OS Query prior again in a pre-Kates world which worked very well when all of our software ran as a unique process in the host Linux namespaces but with Cates when all the process names are the same OS Query was not enough for us to be able to kind of adequately distinguish between good actors and malicious actors. So this is something we needed to solve for as well and Vlad's going to go into a lot more detail about that. So those are all the challenges and I'll now hand it over to Vlad for the solutions. First I wanna talk about vulnerability and compliance canning. Gonna talk about the operating system. So as Ali mentioned, you need to stick all the software that you have in your FedRAM package and in this case like the operating system, major vendors have this sticks published for their OSs. Canonical has one for Ubuntu, Red Hat has one for REL. What you should expect here is like lag when the sticks actually get validated by this. For example like Ubuntu 20204 still doesn't have a stick and I think REL9 has a stick as of last month but was released some time ago. So you're gonna face challenges when you want to upgrade to a new OS and in most cases you actually won't be able to. The next thing that we did moving to Kate's is we started treating all the nodes the same and we decided to move to basically run an immutable machine image. This allowed us to actually apply all the compliance requirements in CI and actually scan for them. So our developers get like feedback faster if they do something that invalidates the compliance of the machine. Another big change that we decided to do in our Kate's based systems is that every machine is gonna live up to 72 hours. It's like we're gonna nuke it when it hits like the three day mark. We did this mostly from a security reason but it had a very nice side effect so when it related to patching. So patching for us means like just bumping up versions of software in our machine image and then we're rolling out production and in three days we have the certainty that the vulnerability was actually patched and this helped us to roll out things very fast. For example when the run CECV happened a couple of years ago we were able to like deploy it across the fleet in like three days and we didn't really suffer from it. Moving to container images here basically for vulnerability scanning is like we run, we have an internal golden image that all the products use. This golden image gets updated daily with latest patches from upstream and then all the downstream products that use this container image get built automatically. So we have like a waterfall model first you update the golden image and then automation triggers all the downstream builds. On top of this we decided to embed 3V in our software development lifecycle. So during CI we actually scan all the container images that go to production to catch any CVs that appeared after the golden image got updated. Next gonna talk about encryption and network security in the federal context. One of the checklist items in the stick checklist is running a FIPS validated kernel and crypto libraries. Here you have options like Ubuntu Pro, REL and others. What you'd expect is again long processing times for NIST to validate new kernels. We wanted to use some new features from like 5.1 kernel or like a newer kernel for like EBPF features and we couldn't because Ubuntu Pro serverized all the kernel and NIST did not validate it yet. And now moving up the stack for service to service communication. We run Cilium as like the CNI of choice in all our environments and we had to encrypt traffic in services using like FIPS validated suffer suites. And for this we decided to turn on IP second encryption in Cilium. You can do it with a value in the home chart and this actually works in the various routing modes of Cilium that we use. For example, in Amazon we use the ENI, direct routing mode and in Azure we use the overlay routing mode. On top of this, the FedJAMP checklist has a bunch of rules about how you secure traffic. And for this we also decided to like use Cilium because it has like very powerful network policy primitives and we also run it in like a deny by default mode. The FIPS theme continues for like ingress and egress traffic. So this traffic needs also to be encrypted with FIPS validated suffer suites. And as Ali mentioned in the previous cloud architecture we were using engine X plus. We had to pay for it to use it here but also we ran into a bunch of performance problems with ephemeral ingresses that get mounted on the front door. So we decided to like look for an alternative and this is when we decided to use Envoy. It was actually designed for running in the container first world and Envoy uses boring SSL as it TLS provider. The nice thing with boring SSL is like NIST validated boring SSL and you can configure it and Envoy at build time just using a command line flag to turn it on. And currently we run Envoy both as like a forward and a reverse proxy as well. Another big thing in the FedJAMP checklist is host intrusion detection systems. We run OS query on all of our machines and OS query is basically an endpoint visibility tool. It exposes various information about the host in a SQL like database. So for example, you can run a SQL query and just ask which kernel modules is my system my system loaded and so on. The downside with this is that it doesn't have any gates integration. So when you have like multiple pods in the same node those pods basically run the same container image. The container image has the same entry point. So the binary is the same that actually gets spawned OS query just sees a bunch of processes that basically look the same, same process name, same arguments, just like a different user ID. To solve for this, we decided to deploy Isovil and Tetragon, which is a tool that uses EVPF to collect host process information. And it also integrates very well with gates in Celiom and gets all the information from other sources. And this actually allows us to answer questions like a process is accessing a malicious endpoint to expel some data, which service account actually deployed that pod, what labels it has, what container image it's running, what network traffic is doing and so on. So this basically gave our InfoSync team somewhat like superpowers to reason about like what's happening in the environment. Next, Ad is gonna talk about some other ongoing challenges that we're still facing. Yes. So as Vlad mentioned, we'll switch to the other one. CNCF Tech has really done a lot for us as he talked about and we've been able to kind of meet a lot of these controls in a more effective way. But there are still a lot of other challenges with FedRAMP and IL controls that can't fully be solved by that essentially out of the box. So we also wanted to talk a little bit about other things that we've done and sort of built on top of Cates that have really helped us meet a number of the other controls as well. So we built a program called FedStart, which is powered by our Apollo product that essentially solves the rest of these challenges for Palantir and also for any other companies with containerized Kubernetes native applications where you want to become FedRAMP or IL accredited for federal government workflows. So we host containerized apps kind of in the product and they're able to seamlessly meet the majority of the rest of the controls and we have a few examples of that here. So the first is change management. A lot of the FedRAMP controls, again, as many of you know, if you've kind of dug into it, revolve around how you are managing changes and updates to the system. And validating that those are safe prior to rolling them out to production. So for example, changes must be tested in a representative environment before rolling out to a FedRAMP accredited environment and security relevant changes typically have to be approved by an authorized US person. So figuring out how to kind of fully automate your change management, you know, especially with the rise of things like GitOps and kind of other automation tools while still maintaining all these compliance checks is something that can be really challenging. And so Apollo and kind of Cates has helped solve this for us with essentially policy-based rollouts. So we're able to do CICD while still enforcing the right approvals, those US person checks where we need them, and also automating essentially that changes rollout to staging or any of our non-regulated production environments before they go to our FedRAMP accredited environments. The second one is there are also a lot of process oriented controls. So, you know, for those of you who are not familiar or for those of you who are, a lot of FedRAMP controls are actually not technical. They're oriented around policies and procedures and kind of building out processes that impose best security practices on your system. So for example, your SDLC or your incident monitoring policy and incident response processes and your contingently planned and all your disaster recovery steps. So our standardized infrastructure based on Cates has made it easier to essentially templatize a lot of these policies and procedures, such that a large part is actually consistent across all of the applications that we run in the environment. So for example, we have a standard way of storing backups for all applications that are running in the environment, in the underlying cloud storage, which makes it easier to enforce that all applications are able to restore from backups rather than needing to deal with kind of a bespoke backup and restore process for every single application that we run in the environment. And the third one, vulnerability management. We've been talking a lot about this one. Vlad talked about how we use Trivi for scans, but the next immediate question task is, okay, how are you going to deal with all the vulnerabilities that Trivi has now surfaced? Vlad mentioned this a little bit, but we essentially manage this by utilizing our own minimized images to limit our CVE exposure for our tech and for a lot of the open source tech that we rely on. And so all the applications that we host in the cluster are able to make use of those as well. So basically by doing that, we're able to really minimize the amount of CVEs that we have to deal with, which otherwise can be a really unruly problem. So overall kind of all these things that we've talked about has taken the accreditation process for any new application that we run in the environment to something that previously could take years to something that's essentially weeks to months instead whenever we wanna make updates or kind of add new applications into the environment. So it's been a really great accelerant for us as we talked about, and hopefully this was helpful for you all too. So that is all the content that we have. And so thank you for listening to our talk and super excited to open it up to any questions that you all have. Yeah, do you wanna use the mic that's right there? Okay, okay, I'll repeat then, that's fine. Yeah, the question was what was the role of vendor support and vendor liaisons in helping us manage and kind of get through this process given you typically have a lot of questions for them about FIPS encryption and other things. Do you wanna take that one? Yeah, I think what we did in our FEDM package, we actually inherit a bunch of controls from the underlying cloud provider. So for example, encryption at rest, we do it via KMS and EBS, S3 and KMS and so on. So I feel like it's basically going for each control and seeing at what level do you solve for it and can you inherit the provider you're running on top. We don't really deploy custom vendor software that handles the data I think in our environments. Yeah, I think that's right. So for example, we don't use a vendor elastic search or something like this to handle our data and then we need to go talk to them, what do you do for all the things you just mentioned, like FIPS encryption and so on. We have that in-house and it's done by our platform. Yeah, and I think a lot of time for hyperscalers, they now have a lot of public listings too of which products are FedRAMP accredited, which are moderate, which are high, so you can kind of look there and at least get a brief view before you have to go talk to a person too, which is helpful. Like for example, Amazon has a list of all the services, like even the FIPS encrypted endpoints that you need to use. So for example, like talking to KMS, you need to use a special endpoint, like S3 the same as on. Yeah, two pretty basic questions. The first one is we make dual use software and we kind of sell self-hosted Helm chart type things. So in cases where you're not the one operating the platform, but maybe you're selling to the government nor the program is operating the platform, who gets the FedRAMP? Like does everything have to go through or is the process different? Or is it just, hey, you don't have to worry about it, it's the person, it's the program that needs the thing yet. Yeah, that's a really good question. So yeah, all the FedRAMP and impact level stuff is pretty much only applicable if you're doing the traditional cloud hosted SaaS, where you are the ones who are managing and operating the platform. And that is because a lot of the controls are around how you do your encryption management, but also how you're storing secrets and tokens and all of those types of things, which you can't actually do if you don't have control over the underlying infrastructure. So we do have a lot of deployments where it's not a SaaS. We've had to deploy into a government managed cloud or on-prem or to an edge device or something like that in a lot of cases. And typically in those cases, you just go through an individual, what's called an RMF ATO, which you may or may not be familiar with, but so once you get these central accreditations, each government agency still has to grant you an ATO or like an authority to operate for that product being used at that agency. So if it's an installation again into a government cloud or something that's not a traditional SaaS, you typically just go through a separate ATO for your software at that agency in that environment, which is more work typically for the vendors because you can't take advantage of the inheritance of a cloud accreditation that you already have. So it's possible to do that, but not usually FedRAMP. And more of the controls are shared between you and the government typically when that happens too. I don't know if that answered your question. Yeah, no, that's a big help. The second question is, do you know if there is any effort or maybe there's already a way to get a vanilla-stigged Kubernetes setup for like just kind of smoke testing your work because it seems, it's kind of intimidating obviously. Yeah. I did not find anything that you can pick off the shelf. The closest answer I can give you is OpenShift, OpenShift container platform. They have a compliance operator that's able to apply and automatically remediate based on the disaster-stig that they have published, but just they need to do a bunch of manual checks in it. It doesn't have support to like fill up all the gaps. But I think that's like the closest one that I've seen. Thank you. Thanks. Hi. Hi, great talk. I had a question about your 72-hour rotation policy for your hosts. How do you kind of reconcile that with databases or things that have huge amounts of data? How do you like deal with the fact that you're rotating these every three days and transferring like terabytes of data between them? I'm not part of the database team, but all the products that's running to our environment need to comply with this policy. I can get more info for you if you go down to our booth or you can actually ask Greg over there in the back afterwards and we can answer this. Okay, got it. Thank you. Yeah, at a high level, maybe the only two things I'd add is we also do make use of some of the managed database services. So AWS RDS, right? Where it's not, they're managing the compliance and so we don't have to deal with the figuring out the role there. But for the databases, we do run in containers, figuring out kind of all that HA. Yeah, I'll defer to Greg, so. Thank you. Hi, is there anything you'll be doing differently with the upcoming Rev5 from what you've presented today in terms of the things you're looking at? Love that question. Yes, we recently basically switched all of our documentation over to using Rev5 so everything is following that. I would say for the most part, it's actually been pretty consistent. Most of the controls got renumbered or something but maybe the language got tweaked slightly but most of it is the same. So nothing has really changed about the way that we run our infrastructure. The primary kind of new control family that was introduced, which maybe you know but all of the supply chain risk and mitigation. So that's been a big one where internally at our company, we've been rolling out new processes to assess basically what are all the libraries we're taking advantage of? How can we actually trace those? How are we generating S-bombs? Like how are we kind of better managing our supply chain security? But that again is kind of baked into like our overall SDLC basically and how we kind of roll out the infrastructure and the application so that we can kind of standardize it everywhere and maintain sort of the same setup that we have today. So that's really the only major thing. Not sure if you had anything particular top of mind but for the most part, we haven't needed to tweak too much with Rev5. I mean, our main concern I think is around the package sending requirements for things like in terms of libraries that we're using or third party stuff and everything else seems pretty straightforward but we haven't cracked that one yet. Yeah, we can talk more about that if there's specific things and loop in other folks if that would be helpful too. Yeah, after, yeah. Hi, great talk. I want to make sure I understand I'm pretty new to these things about Selium, what you said about Selium. So let's say you have a Kubernetes cluster, you have a couple of applications, Java, Ruby, .NET, whatever. Is my understanding correct that you just disable encryption in those applications so that you don't have to worry about FIPS compliance and you make sure that's the Selium service mesh? That's the encryption, is that the case? Yeah, so the thing that we actually do is like we don't change the Cypher suites in the applications themselves. We turn on encryption via Selium and the interesting part is like we do not allow any application to be deployed as host network. So everything's like part of the pod network. So that's how we ensure that all the traffic that goes from service to service, like pod to pod or pod to any destination that needs to go goes through Selium itself. All right, great, thanks. Yeah. So you mentioned policy rollouts. Can you shed some more details on that? And around the deployment and the CI process, what do you guys have to do, any changes there? Sorry, what was the second part? Like CI and the deployment point of view, were there any changes that you guys had to do? Yeah, so policy-based rollouts, yeah. So there's basically a couple of things here. I touched on a few, but we basically have kind of like waves or we call them release channels for how things are actually rolled out across our entire fleet. And so because the government environments have these more stringent requirements needing to test things in representative environments first, that's one of the things we do is basically just like phase all of the rollouts. But then the other sort of policy things that are applied are things like, again, has this change been approved by the appropriate US person with the appropriate role, et cetera, in order to be rolled out, depending on what it is. Has this change existed in a representative environment and not caused any downtime? Like is it actually like a stable release before it gets pushed out to these environments? Basically enables us to define logic with how the changes are rolled out in a way that's automated. So it continues to be compliant, but doesn't need as many engineers to kind of babysit throughout the whole thing. But in terms of your question of how we change the ICD, I do think going for FedRAMP and Impact Level, you do have to change some of your processes because you have these additional requirements. So you can't kind of do the same thing where developer pushes a change and it rolls out to the environment. You do have to meet kind of additional controls and that's where we sort of built the automation that enables us to automate it, but also meets those controls, I would say. So one more question, like you guys mentioned about like Helm charts, sorry, the images and for the host, is there anything you guys had to do for Helm charts or any of the manifests, Kubernetes manifests? For the manifest themselves, no, but all the images that we use, we basically republish internally based on our golden image. We have like more questions, but we're out of time. We can like take them in the back. Can I just actually follow on to that? Are you sourcing the base images from a third party or are you managing them in-house? We are extracting the binaries from the images themselves and then repackaging those internally. Okay, thank you. Thank you. Thank you.