 Hi, my name is Eric I work at GitHub in the platform organization and for the past 18 months my team and I have been working on making spiffy inspire available for engineers internal to get up. And we have been running spire production for the past year. Get up submission is to be the home for all developers, we are the world's largest subversion repository management company, and we also holds host get. Also is something we do. So the goal of this talk is to provide something of a practitioner story for a team trying to make this available internally for a company that has around 15 or 11 years of infrastructure opinion is not fully running on a public cloud but runs on multiple public clouds some of which may start with the letter a. And I really want to talk about two implementation details of how we operate spire today. The first is how we operate our agents. And the second is how we generate custom node selectors support registration entries for vending as fits to workloads. And I'll try to wrap it up with some takeaways and learnings and outcomes that we've achieved on the team. And as a full disclaimer. This is how we do it. This is not how to do it. And I'd like to thank Ben Barry for my team who reminded me to give this disclaimer to people because we don't want to present our work as what you should do. It is a contextual solution for our setup. So, to start things off we talk about motivations. As I said a moment ago GitHub runs their own data centers. We run on multiple clouds. In the past two years, we've been ramping up the product offerings we've been generating more traffic more traffic worldwide. We've actually taken measurements internally or TCP flows and we've kind of shown like a linear growth in internal traffic. So there are more things talking to each other inside the DMZ than before. And there are more data centers than before. On top of that, there's been a lot of hiring and new services net new products packages is GA. Trying to remember what I can't can't talk about, but go to the change lock it's very well written. There are a lot of things coming out and there's a lot of software behind what's coming out. In addition to that kind of the past two years have shown acquisitions get Hawk was acquired npm Semmel. These acquisitions bring their own infrastructure their own opinions their own systems. And so we kind of took great pains to be sympathetic to how our colleagues are coming to the organization and how they want to work with us. How do we kind of plan for all this variation in what run at what runs a get have been and where it runs. We initially were interested in spiffy because it's extensible and open. For example, I think in Evan and Andrews presentation they talked about the upstream CA plugins of which spire is itself one. We run vault internally, and we don't necessarily want to build a parallel PKI infrastructure just to support spire. The goal is should should be to reuse as much as we possibly can. And to leverage everything that we have done that already works well. And then three, which I think didn't actually make the slide is, we have workloads that leverage L four load balancers. So, some groups use just jot SFID, some groups seem to use X number nine. We use both. So, we can support your use case whether you're mediated by a load balancer or not. Talking a little bit about the approach. Good tools have gradations of power. And we're trying to make our platform offering as modular as possible. So, visually you could kind of think of it as a pyramid where as you go up the pyramid, you have reduced sort of area where that's curation. But at the very bottom we have these interfaces of X 519 SFID and jot SFID, where if teams were to actually conform to these themselves. They could potentially just be in spec, because this is an open standard. This is more of a utility than a strategy component of how GitHub uses technology. And in the center, we want to be the team that operates centralized spire infrastructure. The servers manages the data store manages the infrastructure automation for agents and provide sort of a workload API out of the box for teams in whatever execution environment they're running in. And for teams that don't necessarily want to deal with raw infrastructure at the very top, where we hope to land almost everybody is development tools so shared libraries packages and we've also think everybody in industry has developed a side card. One point or another, it's kind of like making your own web framework in 2020 everyone just does it. I don't know. We've developed an external authorization speaking sidecar and external authorization as in envoy. So we can actually use envoy to inject and validate jot tokens coming in and out of your service. This use cases particularly applicable for dynamic languages, where we may not necessarily want to go too deep into the app, we don't want to do too much surgery on the workload, or be kind of intimately involved in the internals or something that we're really just trying to mediate authentication with using spiffy. So I'm going to talk about how we initially approach the spire setup with the agents. So, um, take one, we initially started experimenting with spire running in Kubernetes. We run kube internally. And we wanted to actually leverage that teams good work, and all of their kind of gains and reliability and operability to not necessarily be managing VMs metal ourselves. And the kind of reference architectures we've seen in the community are services as as in kube services to run spire servers and agents running a statement sets per note. We observed some issues after kind of kicking this around for a little while in the first first month or so. In particular with agents. So Damon sets can't be made highly available. They're, they're unbounded in downtime between deploys, and you're actually kind of relying on the kube scheduler to place a pod to replace the pod that was the Damon set. So that that was a challenge workloads also can't rely on spire being available at startup because of this non determinism related to the scheduler. So all workloads or whatever curation we provide to users would have to implement some sort of retry or blocking mechanism to kind of pull or wait for the workload API. Not not the end of the world but kind of another small piece of complexity rather than relying on the invariant of a workload API being there and ready, waiting for your workload on startup. And something that's kind of a subtlety is the dual of that race condition is draining a note. So we may actually drain the Damon set before we drain the workload. So if we pull the Damon set and spire disappears workload may actually be waiting or accessing the workload API with no agent listening. Also, not everything we run is in kube for very obvious reasons. We would have to sort of synchronize infrastructure automation to cover both agents in and out of hosts and some kube notes probably would need both there are probably workloads resident on kube notes that might have to reach in or so that would imply we would run to spire agents. That was that was the challenge. So we resolve these issues by in our case just running the spire agent as a regular Damon. So we kind of avoid the problem of pod schedule ordering by avoiding the Kubernetes scheduler entirely. And we kind of make spire part of the second party software we lay down on a kube note before the kubelet starts to take work from the API server. So this mitigates some of the race conditions I mean it's probably still good resilient practice to poll or wait for a workload API but this problem is largely mitigated by just making sure that the workload API is resident before the pod is started. And the dual maintenance goes away because everything is just one set of infrastructure automation. And that is actually the system D logo. I went into Google image search. I think it's a green light being pointed to or maybe it's it's the letter okay I was debating this with somebody yesterday on a zoom, but I've never seen it before. Record and play backwards. Maybe it's like the missing VCR button. I don't know. Maybe one last thing is a system D kind of allows an ordering of units. So obviously we can say for workload out of station we would like to start after the kubelet starts. So there's not kind of false signal about errors being unable to contact the kubelet things like that. Be kind and rewind. Yeah. So, if we're actually running this as a system D unit how do we expose it to pods. A kind of redacted modified version of the wall of YAML for deployment is on the left. And we kind of take the underlying domain socket, put it into a volume and just simply mount that into the container within the pod. This is kind of a mature skip doll but you know pods live in templates inside of deployments. That's what this illustrates essentially sort of the punchline is the view from within the pod is identical to a workload running on metal or via. It's kind of in this well known location. It can be relied on to be there. We also don't actually use mutating web hooks or any sort of pre deploy machinery to place these in. Our experience has been just instructing teams to add these few lines for the volume and kind of guaranteeing that whatever cluster they're running on provides this domain socket gives us a lot of mileage and avoids a lot of magic and people sort of have folks have told us that they, they appreciate the kind of transparency and how things work, and what's actually going on. So, kind of the other consequences, the domain socket is available to things outside of cube as well on cube notes. So there's that. The second thing that I wanted to talk a little bit about today is generating custom note selectors. So, as I said, earlier in the talk GitHub runs in multiple clouds. We use multiple container orchestrators and containers outside of orchestrators, which is also a lot of fun. We run Docker bear for some workloads. So the consequence of this is, we can actually build a service once and run it and ways in and in m places. If we think about how to bend identity to all of these workloads. There's kind of one dimension that's the same. Which is a selector, which we can gather about the workload using workload at a station using whatever workload at a station mechanism we're using, maybe the the Unix workload a tester on the k8 sat piece at workload a testers. But there's this second piece of where that's slightly more difficult because we run our own sites and have our own internal APIs. There's nothing out of the box that knows about internal GitHub API is in the project understandably because they're private and not public and not based on public standard. So things like the notetester for Amazon Web Services, GCP Azure, which are public products don't apply to us. We had to do a little bit extra work to kind of propagate similar notions that you get out of these out of the box, notetesters into our selector library. Right. Um, one thing I can share about how we run our sites is machines have their own per machine certs. So we can actually leverage the x 509 pop a tester and use sort of some of that search key material. To to pull some notion and verify the identity of something trying to phone home to this fire server. The challenge with the x 509 pop a tester for us initially was, it actually takes a fingerprint. I think it's a shall one fingerprint of the machine sir, which doesn't kind of at a glance really give you a semantic meaning of what the agent is or where it is. It's just a opaque checksum. So, using agent path template. In the x 509 pop a tester. We actually using this little bit of them go templating we pull out the common name from the per machine cert which does actually contain the fully qualified domain name of the machine we're trying to bootstrap an agent on. So we kind of go from spire agent shall one hash to an actual fully qualified domain name when you do a spire server agent list. Let's keep going. The consequence of actually only having this one verifiable piece of information because the server doesn't necessarily trust the claims of the the agent in no data station is that we have to key off of this one data. And kind of from the server side can consult some other trusted API to gather more information about the agent and where it runs. Is it a Kubernetes node is it a file server is it, you know, a bastion machine, things like that are not something we can take it face value from the noted tester. So we actually have to write a custom node resolver that pairs with the noted tester the x 509 pop noted tester. We actually bundled this as an OS package, because the interface is just a proto buff and g rpc. And we kind of provision the server with knowledge of an allow list of what metadata to pull back because every node has a set of metadata we actually for the purposes of registration entries care about a subset of that. So the real result of this is we get extra selectors specific to GitHub and how we're running infrastructure for use in registration entries. And as a reminder registration entries are can be written with workload selectors or node selectors. So that's kind of the application and then the where it runs piece of how you vend identity. This is a snippet of configuration to illustrate what my mouth noises actually mean in terms of what it would look like on a spire server. So what I've tried to highlight is, this is also x 509 pop, but it's not a noted tester. It's a node resolver. We pass both a plug in command and a plug in checksum. So we're not kind of arbitrarily exacting random things in the system path. And the way we kind of distribute this plug in command is just as a base OS package using our internal packaging machinery. The plug in data is interpreted in our own code for the node resolver to unpack the node attributes that we want as this allow list to pull things in and make them selectors because if we were to pull everything in. There's no actual utility of knowing maybe what top of rec switch machine is connected to or what that things like that are just kind of superfluous. So the result of this is we get extra selectors for spire to use in registration activities. An example of this is a verifiable claim, which is the common name out of the machine cert is used as the kind of key we key off of in our internal registry API to pull out these other selectors which I've highlighted in white and with kind of a canary yellow box looked like I was making it, but the point is the same. It's all highlighted there. And you can kind of see that these are prefixed similar to how we structure the node resolver. So it's GH API GitHub, rather than x509 subject x509 CA. And these are now kind of available to us to to write registration entries in addition to workload selectors. So trying to wrap it up with some takeaways. The benefits, we've had a we replace kind of custom non interoperable will up authentication authentication approaches with a single portable standard with spiffy and spire. We didn't have to redo any existing infrastructure concepts. We didn't have to kind of somehow shuttle in information from the inventory API is we just use them as any other client. And spire has a lot of points of extension to to call more data and to verify more claims about agents. And as a result of that, rather than having teams manually manage certs and onboard new clients, we kind of reduce the cost of experimentation for them, collaborating with other people in the engineering department, your new downstreams become just registration entries that are identified documents that you can verify from the caller. And we've kind of shifted a lot of work which could be thought of as strategy work into utility work. And the kind of future vision is to make this just a given for all teams at GitHub to use. Some other observations that we've made universalizing authorization I think this is something we talk a lot about in the community on spiffy for authentication and authentication only having no opinion about authorization that that actually has been kind of a point of leverage for us. Because systems and teams have either invented their own may have an interest in standardizing may not have an interest in standardizing and fine grained ACLs are not something we necessarily want to try to provide parity with, in our opinion, at the top of the pyramid in one of the earlier slides. We really just want to give people documents that they can verify are good or not good. There are no goals to kind of build policy languages or enforce policy languages to replace what we have already. And I think I'm bumping up against time but one other observation that we've made is being forced to write registration entries to identify the shapes of workloads is a forcing function for discussions about blast radius isn't security perimeters. If you can't actually differentiate between two workloads cohabitating a machine with a registration entry. That means they have kind of the same level privilege inappropriately. So, rather than kind of looking at, oh we can't take this and isolate it from the southern thing because they're running as the same user or they're in the same group. We kind of inverted that and saw these exercises and discriminating between what gets what as fit as opportunities to improve our security posture. So that's invited some interesting conversations about everything from how we kind of sort of inserts to how we run certain things in Coop. So, there's that. And I'm going to go to the last slide which is just the silhouette and stop sharing here. To make a great point on on recognizing workloads just as granular as possible. How granular is to granular. Is it enough just to distinct one from the other, or you try to get as prescriptive as you can of every single property and aspect to identify a given workload. We start with teams and it's situational. There are workloads that literally run in dual modes. So they actually it necessitates them getting both documents even though it's the same executable. We have kind of code paths where they're both libraries and they have main functions. So it depends, you know, team to team to team. The perspective is kind of it's sort of like when you enter a code base and there's like 15 million unit tests and you change one character and like half of them break, you know, that kind of anti fragility with wide enough registration entries is probably the preferred approach because if something is so brittle, where you need a control loop to kind of endlessly reconcile it by Docker image ID, or you need to keep track of node appearance and death, then that's probably the wrong shape, but every organization is different. You know, mandates are different priorities are different. We largely just partner with the teams and try to have the discussion and facilitate what's fire can do for them. That's great guidance. Thank you, certainly put for thought for for the attendees. Thank you very much to echo the last comment and chat Eric, you're an awesome presenter. Nice work. Thanks. Thank you all for taking the time.