 Thanks so much for joining us today. We're super excited to be talking to you on this lovely Tuesday morning of KubeCon. Before we get started, we actually just had a quick question for the audience, which is, can I see a show of hands? Who is currently deploying software to classified networks today? OK, and then who is not and is kind of brand new? OK, all right, we have a pretty good mix. So hopefully our content then is useful. We have a couple of kind of introductory, what does it mean, what are the first set of challenges that you'll encounter when you try to deploy to these networks and then a little bit more in-depth Vlad's going to talk a lot about very specific technical problems that we faced in some of our solutions. So hopefully content is relevant to you all. And then last thing I wanted to say before we get started is this talk is also part one of two. So we are speaking at the same time on Thursday as well about deploying Kubernetes and other CNCF technologies to unclassified federal government environments for FedRAMP and IL-5. So particularly for those of you who are a little bit newer to the government space, feel free to check out that talk as well. All right, we wanted to start with just some brief introductions, talk a little bit about ourselves before we dive into the content. My name is Ali Monfrey. I'm a senior architect of our federal government business at Palantir. What that means is essentially I oversee all of our cloud architecture and cloud deployments across our entire federal government portfolio. And I also lead business development for our Apollo product and FedStart program, which we're going to talk about more in this presentation. Hey, folks. My name is Vlad. I'm an engineering lead in the production infrastructure group. Since 2017, I've been leading the teams that deploy kates in all the places that Palantir is running software, that being classified cloud, commercial cloud, on-premise, and Edge as well. All right, brief overview of the agenda today. We are going to do just a very brief sort of Palantir overview and introduction for those of you who are not familiar with Palantir as a company. Just talk a little bit about what we do and kind of how we got into this space and our experience and expertise doing it. And then we're going to talk about some of the challenges. You can see the subcategories here. But there are a lot of challenges to deploying in classified environments. We're just going to cover a few. But hopefully, we'll leave plenty of time for Q&A at the end as well. So we'll cover challenges, solutions, and then ask for questions from you all. So a very brief overview of Palantir. Palantir was founded in 2003, actually after the events of 9-11, specifically with the mission to help the federal government make better use and better sense of their data while still protecting privacy and civil liberties. So our first products were pretty heavily oriented on data integration, data analysis, and making operational use of data. And we began in the Intel community. But as our products and work has evolved over the course of the past 20 years, we've grown tremendously and expanded into a lot of other parts of the government. So you can see just a couple of our clients here on this slide, but we have a lot of work across the DOD, federal health space, the civilian government space as well. So we bring this up just to say what this means we have been working in the federal government space for the past two decades. It's been a really important part of our company and even today our government work accounts for half of Palantir's overall business and revenue, and specifically we've been deploying in classified networks and environments from the very beginning. So all of Palantir's products are deployed across multiple high-side networks. We'll get a little bit more into detail there, but there's not just one. As many of you who are already in the space know there's many, so we deploy across a number of those networks and environments and our software is operational in all of them today. In the past several years, we decided to move to a primarily Kubernetes-based architecture, which introduced a lot of novel challenges on top of the ones that we are already facing. And we're here to talk about a lot of those today and Vlad will be going into more detail there. And we also expanded our product offerings. So as we kind of grew our scale in our footprint, operating in these environments, we had installations, many different installations across many different classification levels. And so we expanded our product offerings a little bit into our Apollo product, which we will also talk more about today that helps us actually manage all of this software at scale. And our FedServe program that also helps other companies essentially take advantage of that work and deploy their Kubernetes-based technology using Apollo for a federal government context as well. So we're very excited to be using this for ourselves, but also helping the entire defense industrial base by enabling others to leverage our 20 years of experience operating in this space. And super excited to be talking to you all today about some of our lessons learned too. So starting with the challenges, I am going to do just a brief overview, as I mentioned at the very beginning for those of you who are particularly new to this space, wanted to just talk a little bit at a very high level, what's involved in operating on these networks and deploying to these networks just to ground a lot of the details that Vlad is going to go into. So Vlad will talk a lot more about all of these things, but again, wanted to give an overview to just ground the conversation. Before anyone panics, I'm not going to talk about all of these acronyms. So they're good acronyms to know if you wanna work in the government space, take note of those, look them up later. But we're just gonna talk about some of the ones here on the left. So first thing to note, which may or may not be obvious, but classified networks are completely isolated networks, air-gapped networks. You can't just access them from your laptop over the internet like you can with other environments, particularly in a private sector and commercial context. So if you want to deploy high side, and the reason that I included high side here is, high side is basically shorthand for classified networks. So if you hear either of us say low side, we're talking about an unclassified environment. If we say high side, we're talking about a classified environment. So if you want to deploy high side, you first need access to a secure space, which are specific government approved accredited facilities that have the ability to access these networks. And then the other thing is there's no easy way to actually get your software there. So once you have access to the network, there's no kind of automated mechanism. And we'll talk about a lot of ourselves for this a little bit later, but there's not really a lot of automated mechanisms to actually get your software high side. So that is a significant challenge that you'll face upfront as well. So once you have access to the network, how do you actually get your software there? That's kind of problem number two. And then once you have access and your software is able to be put on the network, you will also need what is called an ATO or an authority to operate. And an authority to operate is essentially a government stamp of approval that says your software has met all of their security and compliance requirements and it is able to hold government data and be used in a production federal government context. So we bring this up because satisfying, getting an ATO essentially involves satisfying hundreds of what are called NIST controls and also STIG, Security Technical Implementation Guide controls, which are specifically applied to operating systems, databases, kind of specific pieces of your infrastructure. And these controls kind of span the gambit. So a lot of them are very technical. FIPS encryption is a big one that I'm sure those of you who aren't familiar with the space are intimately familiar with. Things like SDLC, change management, audit logs. It really covers all the things that the government wants to see from your software before they will sign off on it being used. So a lot of the challenges with operating in these environments are also related to meeting all of these requirements, which Vlad is gonna talk about in more detail. And then the final one that we wanted to talk about is DISA, the Defense Information Systems Agency. We're going to talk a little bit about them as well. DISA is responsible for all of the DOD-wide networks and information technology. So they publish all of the DOD's kind of authoritative IT security requirements. And they're also responsible for authorizing cloud service providers. So if you run in Highside Cloud, if you wanna do something like an IL-6 accreditation, DISA is kind of the entity that you'll be working with to do those things. So all this to say, there's a lot, but hopefully that was a very quick overview to ground some of the conversation. And I'm gonna hand it off to Vlad to go into the technical specifics. Okay, we're gonna start with the compliance piece. So as Ali mentioned, your software and system needs to be stigged. The way we tend to think about this is in layers. You need to stick your operating system. You need to stick your Cates distro and then you need to stick whatever's running inside the Cates cluster. For example, like databases having Postgres or Oracle over there. The thing you need to do also is to satisfy the stig at accreditation time, but then also you need to continuously scan that your thing is meeting the stig controls and give those scans to your sponsoring agency on a cadence. So for example, operating system like one generic stig checklist item is logging failed attempts. The major vendors all have this stig checklist published. For example, Canonical has one for Ubuntu, Red Hat has one for REL and so on. One gotcha here is that there's a lag as when this publishes new stigs for new operating systems. So this will limit the version of the OS that you can run. It's not that easy to upgrade, say like a new version of the OS just landed. I will use it high side now. Then moving up the stack for Cates, I just listed two generic stig checklist items for the API server. You need to disable anonymous auth and then for the cubelet, you should disable the read-only port. The same as with the operating system stigs, different Cates distros have official Dissa stigs published with more nuanced checks here. And then moving on to the cloud provider itself. As an overview, you should think that hyperscalers have classified regions available. They're physically located inside the USA and they exist for both like secret and top secret classifications. We observed that these regions are much newer, less built out, services, features tend to lag to land high side. We observed like lag for example, like a year, two years and in some cases like three years as well. One recent example is support for Amazon GP3 volumes. I think it landed in commercial in late 2020, like December and it just landed high side a few months ago. Empirically, we also observed that these regions have less capacity. You're like more constrained into what instance types you you can use over there and also they don't have that many available for you to use. And this is more hard with specialized instance types. For example, like accelerated computing for AI or ML workloads. We also observed that VM lifecycle times like just start tank like launching an instance or deprovisioning an instance takes longer as well. Nowadays, I think like most of us depend on SaaS services to basically built and run our production fleets. What we observed very fast over there, there's like no Git offerings, like no GitLab, no GitHub provider over there. Also, there's like no centralized container registry. It's like Docker IO, it's not thing over there. Quay IO does not exist. So this means you don't have access to your container images, but also you don't have access to your health charts over there. And then the team continues with like no no observability provider. Like there's nobody in the space over there and it continues also with identity providers. There's nothing that you can use. So at the end, you're just limited to what your cloud provider services have to offer as like managed services, or you basically just do it yourself. Like you bring these type of offerings over there and just like use it for your software. And then the last challenge I wanna mention here is around the PKI management, like certificate management. As Ali mentioned, the regions are air-gapped. They're on their own network. They're not connected to the public internet. And there's no public DNS registrar over there. So you can just like go and buy a domain to use. All the domains, all the top level domains are owned by just different government agencies. So because of this, they also control the CA, the CA registrar that signs all the certificates for the domains. And because of this, the certificates are not included in the default trust bundle that your operating system has. So you can just like run, get update CA certificates and just have the CA. And this is very tricky because all the services they need to talk to cloud provider APIs need to establish TLS connections. The next one that I wanna talk about is about like software development lifecycle. And the typical procedure to run software high side is like first you need to pass a vulnerability scan and a virus scan. And then once you pass these, how do you actually get your bytes high side like your software high side? It's like the procedure usually involves a human which is called a data transfer officer to burn the bytes on the DVD. And then to burn the bytes on the DVD or a Blu-ray disk and put it in the high side computer virus scan it over there and then you can run it. And this actually limits how much content you can ship in one go. And now the data is high side but you need to figure out how to deploy the service, configure it and then also upgrade it. So most of us are actually on call for these services but how do you actually are on call for something that you cannot access? You need to do it with people that have a security clearance. And this is very challenging now because I imagine most of us have dev teams all around the globe and then you have people with security clearance only in the U.S. For example, I don't have a security clearance but I'm still on call for services that's right here. And this takes incident response. So if you have an outage, your developers don't have any telemetry. They don't have access to any graphs, logs, stack traces to debug their software. Also there's no automated system to get this information from high side to low side. The procedure is actually the classified human operator, the cleared human operator transcribing data from like one computer to another. And this maybe works for like small pieces of text like a few log lines but when you get into graphs how do you actually transcribe with graph? You don't have a good way. And this is very hard because like you don't have a SaaS offering for an observability workflow over there. You just need to bring it around for this. And now moving to the solution space. One second, I need to drink water. So for the compliance part that I was mentioning about like state controls, there are many case distros that have published this as tags. We use these two case distros for our multi-node clusters. We use OpenShift from Red Hat and OpenShift runs Red Hat CoreOS and this is an attractive operating system because every change you do on the OS itself is driven via Kate's custom resource. So you drive everything from Kate's, for example, if you want to like land a file on disk, you just create a Kate's custom resource and the system does it for you. And this plays very well with the compliance operator tool that OpenShift has. You can think of the compliance operator as a tool that is able to apply a stick profile, scan and then remediate findings that invalidate the profile itself. For single node or edge clusters, we chose RKE2. It also has a this stick approved here by NIST and we run it on top of Relate. Applying the stick profile is actually a bit more manual in this case, but after you do it, you can continuously scan it with things like OpenScap which is an approved NIST scanner. Now moving to the infrastructure part about the PKI management. So before we are running Kate's, we're just loading the CA bundle that the government agencies maintain. We're loading it in the operating system trust bundle using POPET or other mechanism. But now when we move to Kate's, that doesn't matter anymore because container images can have their own file system or have no file system at all, just like run distro lists and you don't have any CA in your container image. So back in 2017, we created an API that allows pods to request CA material on the fly. And the way we do it is like we mutate the pod during admission with a net container and an empty directory. And the net container downloads this CA material from the cluster and just puts it in the empty directory itself. We're looking at swapping from this mechanism to more, more better things in the open source, in the open source like cert manager and so on. So now about the reliance on SaaS services. Like we realized there's no CI platform high side. So we shouldn't go into the idea that we're gonna build software only high side. And this is mostly because like, you don't have enough people that are cleared to do this. So we strongly believe in like not developing high side. And then all the decisions that we made about this were around the idea that you need to be very good at building low side and just deploy and operate high side. So for example, for our IL-6 SaaS offering that we have high side, we deploy Prometheus for observability and then a key cloak as an IDP. And thinking about it, it's like, you deploy a lot of open source software, you deploy your own as well. And you need to minimize the components. You actually need the high side. So one example, like how we try to minimize this is with OpenShift because over there you just need a client to install it. And then you need the machine image to get the cluster running. And then moving to SDLC. Here we needed to build software that continuously passes security and compliance requirements. So a well put in place vulnerability management story was very critical. We manage internally a golden image, a golden container image that all our products use. So all the patches get into that container image. It's up to date. And when that's ready, we trigger downstream builds for the rest of the products to pick up. So now that everything is patched and up to date, we needed the solution to transfer things high side. But we don't transfer only software, we need to transfer assets as well. So we need something like very versatile to transfer all the things that we need to manage. So here we started looking like using OCI artifacts. And this is very attractive because it makes everything look as a container image. And there are just a lot of tools to operate with container images these days. So as I mentioned, it's very taxing to transfer things high side. And it's a very manual process. Hyperscalers offer something that they call cross-domain solutions, which are designed to automatically transfer data between security boundaries. Amazon has two offerings, Amazon diode and diode software artifacts that can help here. We actually included the various ATOs that we have. And with additional controls on top, we made it fully secure and compliant with the mission requirements. And overall, this actually reduced our time to get bytes high side from, in some cases like days to actually single digit minutes. And if you think about it, the end-to-end process of managing vulnerabilities is getting them patched transferring software high side and then orchestrating the lifecycle of the software is very, very complicated. So to help us navigate the process, we actually decided to build our Apollo platform, which Hallie's gonna give some details about. All right, so yeah, last slide here. As Vlad mentioned, Apollo is really the connective tissue between all of these different components that he just talked about. So I mentioned it a little bit at the beginning. It is the tool that we use to both deploy across all of our classified networks that we deploy to and also across cloud, on-prem, edge, all different types of form factors as well, but also manage those installations on an ongoing basis. So just kind of bringing all these pieces together, we're able to configure our software against representative environments low side, automatically do all of the scanning and the vulnerability management that Vlad talked about, transfer the software high side using what we call our binary transfer service, which as Vlad mentioned is based on AWS diode, and then apply the appropriate change management controls with those cleared operators that we have and then automate the rollout of software on the high side knowing that the configurations have already been tested, we know that they've been vetted, but also the software is secure and has gone through all the requisite scanning and change management as well. So for example, since this is a Kubernetes based talk, we do all of our minor and patch updates, high side using Apollo, and that's what kind of automatically facilitates all of those changes just as one example on top of our own software. The other kind of high side specific features that are built into Apollo, if new CVEs are surfaced in the vulnerability scans that we do low side, Apollo will automatically recall those products due to that vulnerability kind of across the entire fleet, including high side. So even though these are air gap networks, it gives us the ability to have, at least some metadata and information that we're able to use in order to take the appropriate actions high side, even though a lot of the relevant information is low side. And Apollo also enables high side specific overrides for any configuration or information that cannot live in unclassified environments. So for example, again, a lot of you who are familiar with this know a lot of these challenges already, but things like domain names are classified. So when you're talking about the DNS records for your front door and kind of how you're configuring your cluster, that's just one example of something that you can't actually set up low side. So we have a lot of override mechanisms in place as well where you can configure something but then on the high side, someone can go in and make sure that it's fully correct and those things are automatically applied as new versions of the software rollout. And finally, Apollo enabled us to also reduce the operational burden that Vlad spoke about by essentially standardizing our deployments and standardizing our operations and monitoring as a result, which enables our cleared operators to more effectively debug and reduces kind of the overall burden on our developers. So across our entire fleet, we have actually 90,000 updates across all of our microservices and environments every single week and we reduced our DevOps costs by 50% just by kind of relying on this tool and automation. So just small shout out to Vlad and his team. I believe they have 12 people and we have thousands of installations across the entire classified environment. So that's how we're able to do that. So that brings us to kind of the end of our planned content. So happy to answer any specific questions that you all have about your own journeys, deploying Kubernetes in classified contexts. If you have other questions about Apollo, happy to answer any questions now or come find us at our booth anytime this week as well. So thanks. I'll break the glass on this one because I was very excited for this talk. One question that I've had is I know that your product exists in other places besides the Secret Cloud. Obviously you're gonna be presenting about that soon. Did any of the kind of technical decisions you made either in Apollo specifically or in your actual products, did some of those choices that you had to make for Secret Cloud actually end up going back into your other non-secret products? Yeah, we actually deploy in a lot of places that are air-gapped for commercial entities like for example banks and other financial institutions. So I think we started actually the other way around. First, we deployed high-side and realized don't build things high-side and that transitioned and helped us with the rest of the air-gapped spots as well. Does that answer what you were looking for? Yeah, I think the only other thing I would add is I do think having to operate from the outset in environments where you don't have a lot of the centralized infrastructure helped here too to Vlad's point about them when we're operating in other air-gapped environments. It's like we know how to run things in more of a self-contained way because we're not used to having all of these managed services that we can use for our software. And so I think for us that was actually really beneficial and I think it proved to be a useful exercise that I think is relevant across a number of different contexts too is just how do you run things in a self-contained way and not have a ton of dependencies and sort of other SaaS technologies when you know you won't have them in all the places you deploy. Like for example in some of the places we don't even have a DNS server. We need to bring in our own solution to this. Thanks for the great presentation. I have a vulnerability management question. You mentioned how the OS Kubernetes, they have Stigs so they're configured according to those Stigs. You mentioned how you have a golden image for your apps and they use that. That's how vulnerability management happens. But how about any of your open source software? Suppose you're using the latest version from the upstream. What do you do when it has vulnerabilities that exceed the target of mediation time? Yeah, for some of them we actually rebuild internally and like we patch things faster than the open source industry. But like for some of those we just need to wait for patches to land in upstream solutions. Like for example, HCD I think ran like some old Go version. Like there was like a Go CVE. It's like you need to make the choices like am I gonna recompile HCD and like take that hit. Is that accepted typically? Yeah, that's actually what I was gonna say is a lot of this is also an ongoing conversation with your government partner who you're working with. So there are kind of three strategies to vulnerability management. One is if you can fix it yourself by rebuilding your own version of it internally. That doesn't have as many bones do that. One is wait until they patch it and then you have a little bit of feature lag but you can push it through. Four things that are just straight up vendor dependencies but you need the new feature or the product. You can have a conversation with your government sponsor who sponsors your ATO and is basically working with you in all the controls and say hey this is vendor dependency. Here are my risk mitigations in place where I think that this is a moderate or a low level vulnerability and maybe something we can live with until there's a patch. And so it's kind of case by case and you need to work through things with your government partner but that's pretty normal. Like there's no software out there that has no CVEs so particularly when you're talking about moderates and lows that's usually a conversation you can have and kind of get to something that's workable to enable you to meet the mission and also be secure. So I know that's not like a perfect answer but the reality is it's a little bit fuzzy. Hey thank you so much for the very clear presentation. Working in this space I know there's a lot of ways to muck this up and I feel like you guys did a very clear presentation for that. I'm curious if you've worked with DOD platform one in utilizing their Big Bang, you know, I guess Helm chart of Helm charts when you're running your Kubernetes clusters and looking at the ATO. Yeah that's something that we have interacted with a bit. Yeah platform one for those who don't know is kind of one of the software factories that the DOD has and there's also a number of others as well that basically have container images that are sort of approved to be used in these environments. So there too it's kind of dependent on the specific deployment that we have, whether or not we're sort of interoperating and using some of those container images versus it's a separate SAS environment that a different agency has kind of spun up that doesn't necessarily use platform one. So yeah it's definitely something that you can interoperate with and I think getting your stuff posted to platform one makes it a little bit more readily available for use in the space but it's also not something that's applicable across the entire DOD because there are a number of other software factories and so we have places where we're integrating with that but we also have places where we're doing more of a true SAS model where we're not interoperating with one of those because that was the better thing for kind of the specific contract. So anyway happy to talk about that in more detail we could talk about it for a long time but short answer yes very familiar and longer answer is how we're interacting with it kind of depends on the specific project. Okay thank you. Hi thanks for the presentation. One point that you mentioned was that the hyperscalers offer across the main solutions as a service does that mean that the transfers between low side and high side are facilitated by the hyperscalers? You're asking about diode right specifically and the software artifacts? I was referring to one bullet point that you had in the slide about that facilitation. If that's diode I don't know. Yeah it's about diode. Yeah so I'm not trying to like how much detail I can go about diode but yes the Amazon services to transfer for us over there. Yes so yeah basically they have a service and an approved service that enables more automated transfer so rather than the prior manual way is you usually burn something to a CD right and then kind of move it to the classified machine and sort of upload it there and so diode is an alternative. But you still have to get that approved by your government partner and so we've built a lot of things on top we call it our binary transfer service that does kind of automated scanning puts together what's called a diode manifest and basically helps facilitate the transfer in a way that is like secure and government approved. So you can't just walk up and kind of use it for any use case you do need it approved by your individual government agency. Thank you. Hi thanks for the talk. During one of your slides you mentioned you kind of focus on developing low side kind of discouraging developing high side and I think what we're seeing a lot in the community is the request from high side customers to be able to develop high side. And I'm wondering how you guys take that into account when you have situations like say models right that depend on high side data to be kind of trained at the last mile level or just high side development in general that can't be done low side. Yeah I have a short answer to this and I'm curious if you have other thoughts too but yeah I think we're seeing the same thing and I think ultimately our philosophy is to the degree possible if you can do it low side do it low side only do what you need to do high side. I do think training is a really good example of something where you typically do need it on the live data and running in that environment to make it as useful as possible. So what we've tried to do is basically do the vast majority of our development in places where we do have GitHub and like all these tools that we rely on to do that. And then on the high side do very scoped workflows if that makes sense where we can actually build the product around it and you don't necessarily need an entire development environment to do that. You can kind of live with a more limited set of tools because you know you're doing sort of a more limited set of operations. So configuration overrides is one example. I think training is like a really good other one. I'm curious if you have other thoughts that come to mind. Yeah what I've seen is we are able to train some of these models with unclassified data like somewhat represents the data high side. I forgot the actual acronym, I think it's CUI data. Yeah CUI data on higher classification stacks that are still on unclassified environment. So we get like more people to collaborate over there and then we just like go high side and like do the final thing with like a limited set of folks. Thank you. It kind of a bit of a 101 question but so we have a commercial SaaS platform that government, DOD, express interest in running on Kubernetes as a self-posted option also commercial aircraft companies as well. In your experience you find are there cases where the government is operating their own cluster and you're just installing software on that versus no you're always rebuilding a new cluster because they don't get Kubernetes and they want everything super isolated so everything is its own cluster to begin with. It's a mix is the answer to that question which a lot of these answers have probably been unsatisfying because it's more or less it depends but certainly there are places where individual government agencies have Kubernetes clusters already and they want you to deploy to those environments. The platform one question earlier that's another one where they have a lot of case clusters that you can deploy to. There are places where they want you to bring it yourself as well. I think for us that's actually even a little bit easier because it means we can control a lot of the settings around it that optimize our product versus deploying into a cluster where we don't necessarily know all of the limitations. So you can always advocate for what is your preferred when you're kind of going into scoping and engagement and say okay I prefer to do it this way is that an option for you? And some government clients will have very strong opinions and others will have less strong opinions so it kind of tends to be a conversation but it's definitely a mix. Thank you. We're out of time but we can take the questions over here if folks want on the side. Okay, thank you. Thank you.