 Hello everyone and welcome to the talk titled bringing speed to linker d for mesh expansion and in this talk we are going to take a look at how linker d the service mesh augmented its identity system in order to allow for meshing external workloads or actually making workloads that are not part of kubernetes part of the mesh. My name is Zahary Ditchiff and I'm a software engineer at Boyant the creators of linker d and if you ever want to connect or chat you can always reach me on any of these channels I'm always happy to chat so in this talk we are going to first of all take a look at make the case for mesh expansion so what is it and why might organizations actually need that functionality and then we are going to look at why workload identity matters both within and outside of the cluster and what are the guarantees that you get with the encryption and identity that's implemented in modern service meshes we're going to take a look at how that specifically works in linker d so with mtls and we are going to discuss some of the challenges that appear once you once you cross the boundary and you end up on a bare metal machine that's not part of kubernetes and then we are going to look at how we use spire and spiffy and leverage this technology in order to solve these problems and hopefully at the end we can have some time for a q and a if you want to ask questions about any of that so first of all let's actually take a look at what is mesh expansion and why organizations might need that so really if i have to put it succinctly mesh expansion is the process of actually making non-kubernetes workloads part of the service mesh and the main motivation for doing that is to enable hybrid cloud infrastructure scenarios where you have part of your infrastructure on let's say bare metal machines or on-prem boxes but you want to be able to connect this infrastructure to kubernetes and have the workloads in kubernetes communicate with it as if this infrastructure was native kubernetes workloads and this is really for the reason to kind of gain this to be able to be generic over the specifics of your infrastructure so whether this is whether this workload is hosted on a local data center or on the cloud like that shouldn't really matter you should still be able to get all the nice things that you get when you're using a service mesh namely security guarantees reliability features such as circuit breaking load balancing traffic policy and all of that so what are some of the specific scenarios that kind of motivates the you know using a hybrid cloud infrastructure and thereby mesh expansion well there are a few distinct cases and i'm sure there are more but this is what we're going to look at so first of all there is the case for the integration of edge devices so nowadays there are a lot of organizations that manage fleets of devices that just cannot run on kubernetes so you know they're like out in the field sensors that run on specialized hardware that you know you just can't run in kubernetes another reason is cost optimization and the use of specialized hardware so you know again there are organizations that need to use hardware that's either not available from cloud vendors or it is kind of prohibitively expensive to obtain and then there is the case for incremental infrastructure modernization where organizations are trying to shift to the cloud gradually and they want to migrate just part of their stack to kubernetes so let's actually look at like how each one of these might look like from an architecture perspective so let's say you're working for an organization that has a kns cluster and there is a bunch of services there that their purpose is to communicate with edge devices and you have your team working with like a dashboard that uses some of these services in order to observe these devices so you know look at health health data that's been sent by the devices and all sorts of monitoring data around these and then on the other end you have devices such as like watches that might be using some services in your cluster you have mobile phones that are also need secure and reliable connection to the cluster and you might even have software that actually runs on cars that also needs to talk to the cluster so naturally like these things cannot live on the cluster but you still want to expose the cluster to them and vice versa then you also might be in a position where you are in need of specialized hardware and there are organizations that namely financial institutional organizations that use data from providers such as Bloomberg and Reuters and the likes and they are in the business of gathering large volumes of financial data so say prices that for example might land in your kubernetes cluster where they are persistent in a database and you have some analytic services that is supposed to actually work with that data and you have a bunch of clients in your organization that are using the service to compute some statistical models on that data and we're talking about stuff like Monte Carlo simulations or heavy AI workloads and a lot of these workloads are fitted particularly well for running on specialized hardware so either specialized GPU chips or as they call them AI chips and you could be in a position where you want to host some of that hardware in your local on-prem infrastructure and not in a public cloud because it's either very expensive in the cloud or they just don't have this these pieces of hardware there or you want to tweak something specifically and be able to to kind of manage it yourself and you know but you still want this analytic service for example like in this case to be able to talk to this box this hardware box and be able to get the advantage of the fact that you're using a service machine kubernetes also there is the case for cloud migration so you know there's a number of organizations that still are running quite legacy software on-prem so things like cobalt services mainframe apis and and i know we most of us here probably live in this world where we are all running stuff on someone else's computer in the cloud and and and it's all virtualized but but this other world exists where there is organizations that are legacy nature and in their tech stack and they really want to move to the cloud using kubernetes cluster and oftentimes what what they do is they they can't just run this migration from day zero to one because it's just a big bank it's very risky so they would kind of try and shift some of these services onto the cluster roots their traffic through an api gateway or something and just set up the grounds for this more modern infrastructure and then gradually uh migrate their infrastructure to kubernetes but uh while they are doing that they need a connection to their legacy infrastructure in order to support operations and this way they could be very strategic about uh when they are doing this migration how they are doing it at what pace and this gives them optionality which is important so with all of these things being said there is one theme that is kind of recurlingly appears and this is the need for workload identity and security and encryption on this traffic that is flying outside of your kubernetes cluster and the way modern service meshes implements this is through uh mostly through tls right so um and this is nice because it gives you a certain number of properties that you know if you don't have them you can get yourself into quite hairy situations so the first thing that you get because of workload identity and encryption with a service mesh is authenticity and this really is the idea that uh the parties on each side of uh the connection can prove that they are who they say they are and this is important because if you actually don't have that what could end up happening is that you could have an attacker so imagine like this service in a kubernetes cluster tries to talk to some smartwatch well if you don't have a principled way of identifying this particular piece of hardware then you're exposed to attacks where another another actor could pretend to be that that piece of hardware you also get confidentiality which really is um the guarantee that no one could actually just read the data that you are sending and of course if you don't have that you end up again in the situation where somebody could listen in on your data and for some particular industries this is absolutely prohibitive so industries like healthcare and defense that are heavily regulated like they can't allow themselves to be in this situation you also get integrity which means that because you get the cryptographic integrity that helps you verify the fact that the same data that is received is the data that was sent and of course if you actually don't have that you could end up with men in the middle attacks that allow a third party to kind of intercept and tamper with your data which as you probably might imagine can lead to quite a lot of bad outcomes so on top of that when you have MTLAS you also get authorization which means that you can attribute granular access policies to certain clients so for example you can if you know and can identify this particular instance of a piece of hardware you can configure the system to only allow traffic to one service originating from this from this up here and ban traffic to another service so really let's actually talk about how this is currently implemented in linkardy so how do we actually achieve identity in linkardy and how does it work with kubernetes specifically so identity in linkardy is effectively MTLAS so mutual TLS for both client and server identities linkardy uses a short-lived x509 TLS certificates that each proxies issued and that are rotated periodically I think like every 24 hours or so and this document essentially the certificate is used obviously for both like proving the identity of the workload to its peer and also encryption of the data so the identity at the station framework that linkardy uses is based on service account tokens so when I say identity at the station I mean that it is as the service account token is the principal way of a workload proving its identity and kind of ensuring that it deserves to be given a certificate and because both client and servers are you know authenticated because this is a MTLAS connection then you can apply traffic policies obviously on both ends so how does that work in practice at the moment in kubernetes well whenever a pot starts up you actually end up with a service account for this pot that is either the default one or a specified one that you've configured now the service account a service account token is issued based on the service account and it's mounted onto the file system of the pot so when the pot when the proxy fires up initially one of the first thing it does is that well obviously it needs a TLS certificate in order to talk to other peers so what it's going to do is that it's going to generate a private key that never leaves the spot that is going to use for the certificate and it's going to generate a certificate signing request essentially saying okay I to the upstream authority that it needs a particular certificate with a particular identity and this identity is based on the service account so it's the name of the service account dot namespace dot service dot cluster domain so now what it's going to do is that it's going to package both the CSR so the certificate signing request and the token and it's going to send them to the control plane to the identity service that's the upstream CA now this identity service needs to ensure that this workload actually is the workload that it says it is because anyone can send the certificate signing request to it with some demands in order to do that it checks the service account token that's part of the request with the Kubernetes API and essentially it's looking for the answer well does this service account token belong to this service account that I'm being asked to issue an identity for and if that is the case and that's greenlit it replies back with a x509 certificate that contains the identity in a DNS like form in the subject alternative name the DNS subject alternative name metadata of the certificate this certificate is used as follows when a proxy initiates connection to another proxy first of all the server presents its certificate the client verifies that certificate in order to ensure that indeed the peer that it's talking to is the peer that it intended to talk to right so when that's when that's done it's the opposite way around the client presents the tls certificate so that's like normal mtls handshake the client presents its certificate and the server verifies it with the trust routes cryptographic cryptographically and it parses it and when it parses it it extracts the identity of the client so now it knows which particular client is talking to it and it can console the control plane for traffic policies and essentially determine whether this particular traffic should be allowed so you know there could be a policy that allows traffic only on port 8080 from this set of tls identities on this particular htp path so it's like this kind of this kind of policies are also a kind of layer seven aware so they're protocol aware policies so this is all great in terms of security right we get we get identity we can attribute policies to it we get encryption and it has been working for a while quite nicely in kubernetes now what happens about workloads that are living outside of kubernetes though now the moment you cross that bridge you end up with a couple of problems so first of all there is no real way for an external bare metal vm to attest its identity to the control plane because you know it doesn't possess a service account token which is the main mechanism for attesting an identity and getting a certificate and you know that makes it kind of hard to obtain a certificate which makes everything else hard because you kind of need tls to actually be part of the mesh and get all the benefits so what are some of the potential solutions to that problem well for starters you could use some mechanism for manually provisioning tls certificates to these boxes which is not ideal because you know the whole like part of the reason part of the appeal for using a service mesh is that you get certificate management for free like you don't really need to care about rotating these certificates how long they're gonna last what's the chain where is the ca kind of get it for free now you and you want to keep that because this is error prone stuff especially at scale you can manually distribute service account tokens so you know you could actually use the kubernetes api in order to get yourself a service account token for a particular service account then just put it on the box that you want want to mesh but then you end up in a situation where how do you actually do that securely like what you need to come up with a process that in order to do that securely and that's hard an error prone especially again at scale because it's just one machine yeah you can use a flash drive right but you know if it's thousands or hundreds of machines that's getting harder and harder and and it's also associated with a lot of management costs around like keeping track of rotating and the expiration of these tokens you can use as other mechanisms such as clan grants access tokens and JWT's and whatnot but this is just like a different piece of technology with different attributes or you could use something that we've leveraged which is piffy which is a framework for securely authenticating software systems across heterogeneous environments and that's kind of a mouthful but really what this gives you is defines a standard way to prove and obtain workload identities which really is in an essence a replacement for this attestation bit that we have in Kubernetes but we are missing a side of Kubernetes and Spire is the reference implementation of this standard that we are we are using in Linkardee now and it has integration with major cloud providers in upstream CA so it's very extensible it's very customizable and it can serve identities both in JWT form or X509 certificate which is what we actually care about so some of the core concepts of this is the spiffy ID which is a string to uniquely identify a workload or a group of workloads and this spiffy ID is similar to what in purpose to what we use in Kubernetes with the DNS like identities you've got an Svit which is essentially the document that you use in order to prove your identity that's either an X509 certificate or a JWT token and the Svit in this case contains the spiffy ID and you also have like the workload API which runs locally on a note that your workloads can use in order to obtain identity so some of the core components are a spire server which is responsible for managing and issuing identities based on registration entries so you know it it uses note attestation to identify agents that connect to it essentially trying to determine the identity of the particular note that the agent is running on and the workload is running on and based on registration entities it could give the correct identity to the note so to the agent that connects and this could be extended with upstream plugins so you could get it to talk to your upstream CA such as vault and the likes the spire agent is what actually runs on the note and exposes the spiffy ID workload API to workloads that that run on this particular note and it serves identity to them and it performs workload attestation and this is really kind of proving ensuring that the workload that's asking for an identity is indeed the workload that it says it is so if or rather it gives it the identity that's prescribed to the workload so how does that actually how does this attestation work in practice if you don't have a service account to prove your identity or if you're not in the possession of a service account token well so on the note level essentially you could use selectors that kind of tie very well with your infrastructure that you are running so for example if you are running on aws the agent itself can actually introspect the note that that's it's running on and determine whether this is a note that has a particular instance id pattern or contains a particular set of aws instant stocks and based on that you can have a policy to issue a particular identity to this agent and same for the workload but at a more granular fashion when a workload connects the agent that's running on the note when the workload connects with grpc api that the agent exposes on a unix domain socket what happens is that the agent could introspect this this workload and it can actually consult the kernel to determine a lot of things about this workload such as what is the unix user id that this workload is running under or what container id it is running under or what is the image that this container id is using so in that sense you can define very rich and granular policies for distributing these identities and this side these identities really come in the form of x509 esvitz and these are the actual tls certificates so these are non-ca and leaf certificates and they must contain exactly one uri subject alternative name which is essentially the uh which is essentially the the spiffy identity and the spiffy identity if you read the standard there's a bunch of detail around how these identities need to be structured and how they actually differ from your eyes so you can't really have query parameters or port numbers and all that because these things don't make sense in this particular scenario but it really is in the form of a the spiffy prefix and then the trust domain and then the actual location of the workload so it could be data center one slash databases slash instance a for example so this is good right now you have a framework that can allow you to actually obtain an identity in a secure fashion on devices that are not part of the mesh and you don't need service account tokens so how is all of that at the moment integrated with linkardy in order to have this feature that's called mesh expansion well first of all you have your kubernetes cluster and you want to mesh an external workload you want to make it part of the mesh now oftentimes if you're running in production you would have an upstream ca such as uh vault or the likes or aws secrets that contains and stores your trust routes so the root certificate with its key so this will probably be integrated with if you're running clinkardy with something like a cert manager that would be responsible for periodically issuing um periodically issuing certificates intermediate certificates that the control plane uses in order to sign and certificates for the proxy so then you will have your pots running in the cluster and the proxies and they will get these certificates periodically but now you also want to mention an external workload well say you have a vm with a bunch of applications that are running there now there's the proxy that is on the vm and traffic is going to get rooted through this proxy but it needs an identity so essentially what needs to happen is that you need to have a spire server hosted somewhere that is also integrated with your upstream ca so it can issue certificates that are essentially rooted in the same trust route as the certificates that the proxies get and then the spire agent will be running on this particular note that that's these applications and the proxy are running and the proxy will contact the spire agent over the unix domain socket using the grpc api the spire agent will know that this proxy runs under either on a particular aws instance or an azure instance or as a particular user account id and it will give it the corresponding identity so because now this is the proxy gets an identity that's rooted in the same trust domain like it can mesh with proxies on the cluster so and and it can just like tls can be terminated so you get the identity you get encryption you get all of the security features and you can apply traffic policies and you know exactly the identity of this proxy that that's uh is living over there on this vm so this is uh pretty cool but there is still quite a bit of work that needs to happen and we have some future plans for extending all of that and namely we're thinking some of the things that we have considered are unifying the identity models between linkardy and spiffy spire so really kind of maybe use or have a mode where spiffy id's are used only not only for external workloads but for proxies that are outside that are in kubernetes as well or allow kubernetes proxies to actually obtain their certificates from spire and not rely on the control plane at all and these things are perfectly possible like you know it's just a matter of changes that we need to get through but they are not particularly challenging we also need to work on private network supports so when there is no connectivity between two networks so that's on the networking rather than the identity site and also we're thinking about also dns proxies which will open quite a lot of doors for optimizations that we can that we can do so in summary i would say that mesh expansion has multiple applications across wide variety of industries and it's something that we are going to see more and more neat of especially with like auto migration to the clouds the integration of existing workloads and the likes however securing workloads is actually quite challenging especially when you when you have completely different constructs compared to what you have in kubernetes and that's why we use spire and spiffy in order to leverage all of that and and actually achieve this the same things that we achieve in the cluster work outside of the cluster as well and now there is some time for q and a i think yeah thank you so i think there is microphones that you can use if you want to ask questions or i can just repeat the question so um what i'm a bit unsure about is this request is by us or receive they say hey give me a certificate that is signed how does it know that this is not a random request from an untrusted source i don't i think i got a bit lost there are you talking about in the world as spire and spiffy or in kubernetes in in the picture you had before where the external workload an agent according to the spire server yeah i didn't really get why the the spire server trust this agent and give it a certificate because the spire server supposedly has a mechanism to actually introspect that like that's kind of the implementation details of spire and spiffy so essentially it it it has its own attestation implementation details that integrate with the particular cloud provider so it would use like the apis of the specific vendors to verify that oh so it's an external mechanism that is like a trusted party that provide the info yeah okay thank you so they yeah essentially they use the underlying infrastructure as a trusted as a source of trust okay thank you yep hi yeah hi so most of the talk or when you usually talk about a mesh expansion you're talking about expanding the mesh to between the cluster and say a vm yeah but if we have say vm a that talks a bit to the cluster and also to vm b so will the mesh expansion work also between the two vms uh completely outside of the cluster um yeah absolutely so that's a matter on of discovery rather than identity but it should there is no reason for it to work so for example you could have two vms represented as services in kubernetes because that's what ends up happening right like each vm is a service in kubernetes that could be addressed from services outside of kubernetes and it can also address services in kubernetes so if you have two vms that have corresponding services vm a can just target the service that corresponds to vm b that's on the cluster and essentially use that as a discovery mechanism and the traffic will flow through these two vms without even actually going through the cluster so yeah that's that's actually now in the center what you're asking and that is possible we verify that this works yeah nice thanks hey thanks for the talk um so the last example used like an aws vm and it used the the aws vm id to allow it to issue a certificate so how would that look for i think you used an example of a smartwatch before like how would how would that work there would you have to like preload the smartwatch with a certain certificate or how would you allow to issue a spire certificates uh now i think you need to preload it again like i am uh i'm a developer i'm not an operator like i don't really provision you know smartwatches with software but uh i i they are as i said this was just an example of a certain way to use the metadata attributed with the workload to attest your identity but if you look at the spire implementation there are a number of plugins that you could use so i would not be surprised if there are plugins that tie specifically to things like um and again i'm don't quote me on that but i could imagine plugins being written where you have um your identity being dependent on say a serial number of a particular device and the agent being able to actually read that data via plugins so then you end up with a different different kind of metadata so each device somewhere will have some metadata and if that's obtained securely by a trusted party like spire then that's fine to use right and this metadata could range from instance IDs all the way through serial numbers on on on you know um on just physical devices makes sense thanks a lot all right well thank you a lot uh for listening in on this talk and have a good couple of days