 So, my talk today is about the way we think about services, especially as we stack them on back ends, have different services access each other, and the different methods of compromise that these sorts of services face. But to start off, we're going to have to look at how we define security, the different types of attacks that we've seen in the wild against these sorts of infrastructures, especially as things have moved more towards stacked services and microservices. The sort of solutions that people have today, other approaches to the security that exists in both practice and in academic work, as well as how we can achieve a more modern approach to least privilege architecture in terms of how services communicate with each other that sort of builds on all this progress and all this work that's occurred so far. I'm David Strauss. I am one of the co-founders of Pantheon, which uses SystemD extensively to run a container grid as well as a bunch of other services. I've also done work with the SystemD project, and I do work with the Drupal project on the security team there as well. So the way that I like to define security is the CIA triad, and there are a lot of different ways to define it, but the first two parts are pretty uncontroversial. One of the parts is confidentiality, the idea that someone can't see information that they shouldn't be privy to. Then there's also integrity, which is the question of security and whether someone can actually alter data, tamper with systems, and actually damage the configuration or operations of the system. These two parts, pretty much everyone includes in security, but I think the important thing to also add in is the concept of availability, which is that we deploy systems because we want them to actually do work. We deploy them out on the web because we want them to be accessible to the public. This means that we can't do what is the trivially most secure thing, which is just a stuff-a-server in the closet disconnected from the Internet, and that's a great way to keep it from getting hacked, but it's also a really bad way to get anything done. Some of these models will also focus on how they can be deployed in a distributed fashion that doesn't have things like single points of failure and doesn't require ongoing engineering and implementation and maintenance effort to be able to continue to use these systems. A lot of this comes down to the attack versus defense. The ways that attackers are confronting systems now have become much more based on not attacking directly the system they want to compromise, but on achieving some sort of foothold in the target infrastructure, often for an extended period of time. Or possibly they might exploit one system and use that as a foot in the door to exploit a different system. Some of the most damaging compromises in the past decade have occurred through methods like this. And what this means is that we can't think about defending systems purely on the basis of a system defending its own resources. We also have to think about systems in the sense of what sort of foothold they provide for compromising other systems. So just a few kind of breaches over the past five to ten years that have involved this sort of foothold architecture. In the case of Sony Pictures, for example, they convinced a bunch of users with phishing attacks to be able to give up their credentials. They sat on the system for approximately a year before actually doing the real payload damage and extracting data and threatening the company in various ways. So it's often about these initial steps that allow the compromise to fester. The Panama Papers breach had an instance of Drupal that was unpatched for three years and then they used that as a step forward to be able to compromise all the documents behind that infrastructure. And then there was one of the most recent ones was Equifax, where they used an Apache struts vulnerability in order to get remote execution capability on the web servers and then be able to use that to exfiltrate all the data from the database. But there's one thing that all three of these have in common, which is that the thing that they attacked is not actually what the payload was on. It was just a step forward toward that. So I have a fundamental problem with the way that a lot of these breaches have been treated in the media and by security professionals because in many cases what comes up is they should have kept up with patches. They should have been managing their systems better and that's true. I'm not saying that that's not actually the case. But what I think is more important when we think about security for these sorts of systems is that one vulnerability shouldn't allow such a large breach. We should be building systems where even if they remain unpatched, they're able to be resilient against attack becoming severe because patching can only go so far with these systems. When you have something like a zero-day attack and a vulnerability gets discovered or announced well before any fixes are available, you have to rely on things like defense in depth and sophisticated methods to contain attacks that don't rely on every system being perfect. So I think they should have kept up with the patches, but they also should have designed all these systems where it wouldn't have been so catastrophic. And we're also facing new threats against each of these sorts of controls. We have, in the traditional case, we've had things like a trusted LAN and VPN where someone has something like an internal office network and then they make a system whitelisted for, say, IP access from their office and then they require people to VPNN to be able to get access to those systems. This is where phishing and malware attacks come in. Basically, by compromising the systems on a trusted network, they're able to get a foothold to be able to access the systems that were supposed to be protected behind it. The same thing occurs on data center networks as well. A data center network segmentation is a really common way to basically break up the networks between different systems, but if you can compromise something on the target segment of the network that's within that trusted firewall or behind that firewall, then you have that foothold and that external barrier is no longer sufficient to protect all the systems within it. And then we have a lot of issues with the way that I see systems, for example, like Equifaxes managing things like encryption and access credentials. Because encryption is almost unbreakable if it's implemented properly. The problem with most encryption is that people can actually get the keys to it. They can decrypt the documents. And in many cases, the security with encryption is most weak by the way that the keys are managed for it. And cases like Equifax, for example, some of their data was encrypted, but the web application had basically the keys and the necessary things to be able to get to some of the data. So these are traditional defenses. They're still useful, but as attackers have gotten better and better at achieving a foothold behind the defenses, and then working from there, they are no longer the strong barriers that they used to be, because attackers are getting very good at this. So, and I think that microservices and the way that we're handling things like container environments, stacked services, cloud deployments are actually making the stakes even worse for these sorts of systems. And I'm saying this out of love for these things, because I want to kind of save the security of some of microservices and service-oriented architecture, but I see a lot of problems with it as well. One of the ways that I often see people deploying these services is where they'll have an edge that's sort of a firewall, and they'll have something like a virtual private cloud, or some sort of other work partition, and then all the services get thrown into there. In this case, what happens here is we have something like an edge proxy, it's performing any kind of filtering validation, et cetera, forwarding things onto these services. But this has an enormous amount of tax surface. It has much more tax surface in my mind than an equivalent monolithic application, because each of these services is working on a different framework, a different programming language, they may have different vulnerabilities. For example, if one of these services behind here has an un-serialization vulnerability that gives something like remote code execution, and I can get the edge to hand it to that service, then I only need one of these services behind this firewall to be vulnerable for me to have a launching off point for talking to the... And in this sort of scenario, it's often the case that there's not a lot of authentication to be once you're within that trusted boundary, and that means that one leaping off point can quickly become a complete compromise to the system. So I don't think that this approach is really where it's at, even though it's the easiest way to basically convert a monolithic service into these sort of service-oriented architectures and microservices. This is the approach more that we are using at Pantheon, and this is becoming increasingly popular in a lot of container orchestration systems. The name that this seems to have been picked up is micro-segmentation. This is the idea that rather than trusting that you have one big boundary around everything, you actually have a lot of flexible boundaries and tunnels between these systems. This is also something that can be enforced by more advanced virtual private cloud infrastructures as well as some of these sorts of network mesh underlying layers that can then create a sort of virtualized and dynamic network that allows services to talk to each other only when they're allowed to. In the case of what we're doing at Pantheon, we're deploying certificates into every one of our containers and then whitelisting certain attributes of these certificates basically create a set where we don't have, we get rid of the concept of there being a matrix where any service can talk to any service. This means that we've somewhat reduced the amount of compromise that can occur to the system because if you compromise, say, that rightmost service in here that only receives requests from these other things and nothing else actually permits to make requests, at least you don't have every single one of these be a launching off point for a complete compromise of the whole grid of services. There's almost an orthogonal approach that I've seen. This is what my understanding of the approach from things like GraphQL is and it's the idea that the edge itself is sort of untrusted, which means that you have a fairly permeable boundary, you're not really relying on the edge to do much filtering and then you can have each of the individual services basically defend themselves, manage all their own authorization, manage all their own validation of requests coming in, the edge tends to forward it to each of those services and the edge is basically unprivileged. These two approaches have almost opposite benefits and one of my goals here is to try and bring these things together and find solutions that actually provide both types of segmentation without compromising the usability of the system. And that was so the empty box in there I'll fix that before I export the deck, but that is like a neutral face. I tried to basically visualize what happens with each of these segmentation architectures in terms of what happens when something gets compromised. In this case with the trusted edge, being able to get access to any system, the edge system or internal system provides a great leaping off point for almost generalized compromise of the system. Half the stuff that I've seen on WikiLeaks is usually from this sort of approach whether it's through microservices or just the active basically putting something like an email server on the same network as the website. And then once they get one foothold, it's end of the story. The sort of micro segmentation approach that we have for the kind of pantheon and the very container oriented setups, it basically trusts the edge to defend a lot of the internal infrastructure and then deep privileges a lot of the deeper services. So this means that if you compromise a deeper service, you're not actually able to do that as a launching off attack to everything else. But if you can attack the edge, then you actually get an enormous amount of access and the edge is kind of fundamentally trusted to have validated and approved requests before they go back to deeper services. And the edge uses its own authority and reputation as a way for the deeper services to get the job done. And basically the deeper services will do anything that the edge asks for. And then going back to that kind of opposite approach where you have the more forwarded credentials where the edge is deep privileged but the deeper services perform all of the heavy lifting. On one hand compromising the edge or proxy at the boundary of the infrastructure is less of a problem. It still has a little sweat mark there because no compromise is completely without consequence. And this one would allow you to access the credentials of whatever users are active at the time but not necessarily to compromise the infrastructure in general. But the downside of this case where basically you forward things like user session data to these back end services is that if an attacker gains control of any of these back end services, you're basically forwarding user session data that provides full control of a user account or a full control of whatever that API token does for every one of those back end services. So if I can get code execution on any of the back end services that are getting the credentials forwarded, I can start harvesting those credentials and making use of them. And one would think you might be able to combine these two things directly but they actually have completely opposite philosophies which is really why I started examining this problem. That if you use micro segmentation, you're leaning on the edge so that you can have acknowledged deep services in the system. And if you forward your credentials, you're relying on your deep services so that you can weaken the trust in the edge. So there's not a trivial way to just combine these two models because you either end up with some combination of the worst of both in the sense that both rely on trusting different parts of the infrastructure to distrust the other part of the infrastructure. So how can we combine this stuff? I started looking for answers. I started looking for answers on this in some new and actually pretty old places. The first place that I looked was one of the oldest sources of design around things like capabilities. How many people in this room are familiar with capabilities not in the Linux kernel sense? Okay, I see a few hands. So Kerberos has a really interesting model that took the better part of 10 or 15 years to kind of iron out. This is just the diagram off of Wikipedia if you want to kind of see this in more detail and more description. But the heart of this is that this has decoupled reliance on the services from the actual user credentials and user session data. So what you have here is basically three phases. In the red arrows which are on top the user is authenticating, basically proving that they are who they say they are. And then in the yellow area the user is saying, given that you know who I am give me basically a permit to access this specific service. Often in the case of these traditional Kerberos things this could be anything like a file server or a pointer or an email server, even some web applications. And then finally that ticket that it gets to actually have permission to access the service is then used to talk directly to the service. But that ticket is only useful for talking to that specific service. So in this case, like a traditional case of mounting a file system, it would be you'd go through the first phase to off yourself, you'd get a ticket in order to be able to mount a specific file system and then you would talk to the file server and say, here's my proof that I'm allowed to mount this file system. And that is what a capability is in the sense moved from the Linux kernel. Since that it is something that through possessing it alone proves that you have permission to access something or do something. And this file server doesn't need to actually know anything really about the user because the ticket alone proves that you have control to be able to access that file system or have specific permissions on it. But because some serious problems when you start stacking it for services what happens is that since tickets are service specific, you end up with a case where either a service then has to be privileged to be trusted for all the deeper services it might access. Let's say that file server is part of a federated system that talks to other file servers. It could forward this ticket from the user but that would mean that this file service basically has similar access to the user on all these deeper systems. There's not a great solution here and in fact a lot of this stuff around Kerberos was designed with the idea that you would access a single monolithic service at a time and not have a service necessarily front a bunch of others. So there was some work done on this in 2017 and it's actually pretty recent that a lot of this work has gotten done. It's a neat paper called CapNet Security and Lease Authority and a capability enabled cloud and they created this neat way of stacking services where they created this thing called sealed capabilities where basically you could give someone two tickets and you could say this ticket is good for mounting this file system and you can also give this ticket in and then that will give permission to perform things with the deeper services. And this is quite secure in my opinion but it's also a huge break in abstraction in the design because if I have a nested service like this I now have to have the client be aware of both of the things that it's connecting to because it has to get a ticket for this service and a ticket for the service that's behind it. And this means that the client is now more closely coupled with the nested set of service implementations. It means that the system that's granting the tickets has to be aware of all these different services and how they're nested and it makes the whole system rather onerous to maintain because every part of the system has to be aware of every granular part of the capability infrastructure. Another kind of thing that I found interesting is name constraints for X509 in terms of like another inspirational thing. This is, I don't recommend using this because the implementations of this are wildly inconsistent across libraries and are often not enforced. So unless you really audit your libraries it wouldn't touch this but it's a neat concept. It's basically the idea that you can create a limited certificate authority that's only able to issue certificates within a certain scope. And in this example here you can have a CA that then issues a certificate that allows issuing further certificates for any of the subdomains of all systems go.io and then that intermediate certificate can then actually issue and sign things for things beneath that and then it basically walks up the chain and says is there actually a coherent structure of trust here with the permitted subtrees. So I started putting these pieces together of these different systems that inspired this project and I started coming up with this thing that basically combines these kind of ticket granting tickets as they're called in Kerberos where it's basically the ticket that it uses to request access to other services plus the ceiling and plus the name constraints and started working on a strategy where basically a user would authorize itself from an authorization service very much like Kerberos but instead of getting back something that's specific to a Singleturf service it gets something that's scoped and in this case it gets back a token that is designed to provide access to any part of the user profile of user A but it's also addressed and this is bringing forward the sealed, the capability ceiling concept where this capability is only going to be usable by whatever set is the destination so it can actually talk to a service here and say I want to pull my user profile and it's able to send its request it's able to send this token and then service P which is sort of the profile service is able to say yes they should be able to pull profile A but then let's say we have something that is like marketing database and someone has marketing preferences and then pulling the profile needs to also pull data from that service as well what this starts supporting is the idea that much like the sort of nested X509 thing with the delegation, if service P here can actually create a new token that it signs and delivers alongside the first capability token we start getting this sort of nested infrastructure where the privileges get dropped more and more as requests go deeper and deeper into the infrastructure and so service M here only gets a token that permits access to the marketing services so if you compromise service M you wouldn't be able to capture any credentials that are useful for compromising other services and if you compromise service P you would only be able to get the necessary data to compromise the actual tokens being passed to service P at that point in time so this represents a strict subset of what's possible to compromise versus something like the kind of more graph QL forwarded credentials perspective or the sort of micro segmentation perspective and it allows combining both of those benefits by taking these kind of proven security model designs and then actually pulling them together not just in a way where they're stacked with each other but actually integrated with each other and with and that actually reduces the attack surface quite a bit and with that I will open it up to questions I actually have a question mic if anyone has any or actually I don't know whether the question mic is the back one or not test it I think this one works anyone so we have we need to send all these tickets among different services and many times we have a lot of services talking with each other how does performance fare with this kind of system because we have lots of tickets going back and forth how does that compare with for instance getting just a grant for one ticket on the edge in terms of the performance one of the beautiful things about the performance of these sorts of systems is that it requires very little external look up for anything in the sense that it's usually implicitly validatable any of these sorts of tickets because they're digitally signed usually have an expiration and that public key gets distributed through the infrastructure so each of these services is equipped to validate the ticket that rides along with the request and fully make an authorization decision without consulting any other services with the possible exception of doing revocation on some of this but often this solution for revocation in this sort of infrastructure is to have very short token lifetimes and then by having a five minute token if you can tolerate a five minute revocation time then you don't actually have to have any revocation infrastructure just synchronize clocks I think there was a question from back could you talk a little bit about how the sub requests thing works like what gets signed how does the cryptography work so this does assume that each service and the user basically has some sort of public private key identifying that entity and that's kind of why I have a little bit of script below this that says like signed by user A or signed by service P and what's happening here so would it be sufficient to talk about what happens in service P when it's trying to communicate with service M so what happens is service P receives the user's request to download their profile and they get that request with the capability token A sub P which is on the left which basically has a scope of the entire profile contents of user A if you see the I don't know if this has a laser pointer on it I think it does but I don't know how to use it oh there we go so you can see that this has actually a pretty broad scope here and then what service P can do is because it's addressed as the destination service that means that it can actually use this token to sub sign other tokens with its own identity so service P has its own kind of certificate or something that isn't really shown here and it's able to take the token AP which is addressed to service P its own identity as service P and then use that to sign token kind of A sub M here which actually only has the scope of the marketing part of the profile but has a source of service P and a destination of service M and that basically means that it's doing a delegated handoff to deeper services but it's only it would not be allowed for to have a scope here that exceeded the scope here so it's only able to close down the scope at each stage and because each of these is restricted to a certain destination the broader scope tokens are not usable by the deeper services either because even though service M will be in possession of token AP A sub P it won't actually have the ability to use it because token A sub P is not actually addressed to service M it's not sealed for service M but by using this chain we can actually validate that because service P had this authority and service P used its identity to delegate to M this token it allows it to to tighten the scope and delegate and do the delegation in a way that doesn't allow service M here to actually have any authority beyond what it beyond the marketing scope does that answer the question okay let me give you a mic oh does that make that that we've I think one of these mics has an issue that's good okay I think it's recording but it's pretty bad if this gets hacked the saving grace of the authorization service is usually that you can break out any identity services from it in a sense that you can usually break out any kind of thing that the user interacts mostly with in terms of how they provide their credentials how those credentials get checked that can be separate from the actual thing that creates the signed tokens and really the other saving grace of this is that it can be extremely small and it really this this design relies largely on the fact that this is while a fairly central point of trust is a very small point of trust there's not much application logic in here there's just whatever look up there is between user identity and authorization oh one second did you know that you can defeat the timeout of the login on Fedora by like pressing escape twice if you get the wrong password it's something I use when I mistype it all the time so are there any further questions out of time oh there's one in the back you generally have to have something like this regardless of the infrastructure you have either you have to embed that in all the services or you have to break it out and have something make authorization decisions yeah I think my question is about that actually so in practice sounds like the drawback of this approach is that you have to duplicate your token verification logic to all your services which may be written in different languages on so on how do you deal with that in practice the nice thing about a lot of the tokens and token verification is you don't usually have to combine much business logic with it because since the tokens actually contain the authorized scope to access implicitly you basically have to validate that the token is legitimate and that the actual specified scope that it's accessing matches whatever resource it's accessing on that service and that's in practice in my mind a lot simpler than having services call out to some sort of RBAC system to be able to figure out what whether a particular user is authorized to do x or y or z on a service and getting even more practical with it if you start using looking at libraries like things for JWT for example there are JWT libraries for almost every major language and framework and that allows choosing a standard like that allows you to embed validation across the board in almost all the frameworks that you might use for microservices and that builds in all the validation for like checking against the public key checking for expiration checking certain scope constraints you can actually put the scope constraints into the validation that's performed by most JWT libraries you can do similar things with most other kind of public key infrastructure setups but I'm quite fond of some of the JWT stuff because it has so much consistent validation across languages and frameworks that you can just include back there. Hi, you've already spoken about the size of the services and not having to call to a centralized RBAC I don't know if you've explored some of the claims that Istio makes to do similar sorts of things can you give a comparison? I'm not actually familiar with the claims of Istio would you mind summarizing them? So in this case delegated authentication for in-service requests but I mean the structure is slightly different in that it runs its own PKI it would end up being a comment if I carried on further I'm maybe a comment is the right answer to this but the real question for me is do the credentials that it forwards contain sufficient material for the backend services to be able to get the same privileges as the client that made the request? It can do, but the architecture has more central points of failure I guess than this does. The only thing I'd like to avoid with the forwarded credentials is the idea that every one of those backend services becomes privy to whatever API or session token the request came in from the edge with at least that's what I've seen with the GraphQL stuff I haven't looked at Istio for it I think we're probably out of time, yeah. Okay, thanks folks.