 Yeah, my name is Levani, and here with me is Jonathan, we work at TransRise, and today we're going to talk how we have secured our Kafka infrastructure using Stephen. So, first, I will briefly describe what Kafka is. So Kafka is essentially a stream processing platform. We use it to transfer as our kind of a backbone for event processing, and most of the async cross-service communications happen through Kafka, so it's kind of our central infrastructure that is very important for TransRise. So on the Kafka side we have different components, and on the client side we have a producers and consumers, and on the kind of the management and the server side we have a Kafka broker. Kafka broker is the one who is responsible to actually manage the clients as well as store the data on the disks. So usually in the production you have many Kafka brokers, and on the broker side we also have the notion of a topic, and the topic is kind of a logical obstruction on top of the data that is stored on the Kafka brokers. Topics are split into multiple partitions, and those partitions can be on the single broker or they can be on multiple brokers. And you and one partition, and one broker is a leader for a single partition. So usually clients are talking to the leader in order to write the data to the topic. So producers produce the data, and this is appended on the log files that Kafka manages the data formats and consumers in parallel can consume the data from those partitions. So I won't go into all the details around the configuration of the brokers, but I will cover some important bits and pieces that are that is important for this talk. So we need to understand before describing technical details how we have done this, how this client broker connection works. So on the broker side we have a configuration that is called listeners. And this is where the Kafka broker process creates the server sockets so for example, in this example. In this slide we see that we create new listener called plain text internal, which binds to the port 1992. And we also have another listener which is called as a sol internal and it binds to the port 1993. On the other hand we have another configuration which is called advertise listeners and this is the configuration that is used by the clients in order to communicate to the brokers. And this should be, and this must be unique for each broker across the cluster. So for here we are using DNS record for example Kafka zero dot internal in order to uniquely kind of reach to the broker. And this is how what the clients will use in order to communicate to the single broker. Since clients need to talk to their leaders, they have to kind of reach to the particular broker. We also have listener security protocol map and this we kind of define what's our security protocols for the particular listener. Here we are saying that plain text internal is a plain text so no security and SSL internal is we expect SSL on that listener. On the one side things are a bit more boring, we just have just bootstrap servers and this client uses this bootstrap servers in order to kind of get cluster state. So, how it works all together on the one side we have a broker, and then we have a client. So, first client will use this bootstrap servers configuration. And we'll ask the broker, can you please give me the cluster state. On the other hand, we'll return the cluster state. And it will really will use advertised listener configuration for each broker to move to and it will pass this configuration to the client. And then client will use this configuration in order to identify the leaders and how to reach the leaders on the partitions. So this is kind of flow how the broker client on communication works. So we have to cover the, like, basics of the connection now let's see how the mutual can work in a traditional manner for this kind of setup. So for example we have on the other hand on one hand we have a client we did with it some client key store that imports client certificate. And then we have a broker with broker key store that imports broker certificate, and they are signed by the same CA in this example. So how this gonna how this mutual to us would be established. First client will make a connection to the broker. Then broker would represent broker certificate to the client since we are talking about mutual TLS client will try to verify the broker against the CA, since they are signed by the same one everything passes here so we are all set. The client will present the client certificate to the broker and the broker will do the same one and will verify the client certificate against against the CA. And when everything is done as a final step the connection is established so this is kind of a classical flow of how a mutual TLS like simply by the view of how mutual TLS can be established. Now this setup obviously has problems. The first one is that the long live certificates are very hard to manage. We are a transfer as we have more than 300 microservices and having them kind of certificate management for each one of them would be a nightmare to manage for any platform team. We have also diverse set of clients, not every client and every language supports the TLS will together so it might not be really possible to implement something like this in some languages. And it's like if we are talking about migration to this new setup it would be quite hard to do if we are if we have like that many microservices you know infrastructure because it would require some of the code changes on the client side. So this is where we have started looking into now spiffer with spire and see like how we can utilize this technology in order to automate some of the some of these processes. So let me now describe how it works. So for the scenery mentioned that the on the one side we have a microservice that runs inside the Kubernetes. And this microservice talks to the envoy as an envoy runs as a side car. And envoy talks to the spire agent over SDC. So this is kind of classical setup. And on the other hand we have a broker. And we need to make this kind of connection between microservice to the broker work over and so over the proxy. So in order to do that we for the broker side we have to add some additional configurations. And for the listeners so we can add that can define a new listener named envoy, and it will find through the port 1994. And we can define advertise listeners for anyway which will be a local host 9101. Now this is an important bit here. Because before we have used the DNS record or maybe I peel together to uniquely identify the broker. But now since we just have a local host since envoy runs like locally on the as a side car next to the microservice, we need to use different port to uniquely identify different brokers. So that's why the support needs to be unique for each of the brokers. And we also have listener security protocol map. And we are saying that envoy needs to be over SSL. Now on the, my on the invoice site. We can have as simple as like static envoy conflicts where we are mapping that local host 9101 needs to be proxy to the Kafka zero dot internal 1994 and local host 9102 need to be proxy to Kafka one dot internal. And then we can have an impact between the ports and the actual Kafka brokers that are running inside the infrastructure. Now this can with this setup we have kind of connection figured out so connection between the microservice and Kafka broker will work over and avoid now we need to teach Kafka brokers and to understand the speed is So for this we are using just the library, which has a, which, which makes a connection to the spire server using the agents of profile that is running that is created by spire agent with the trans on that same host. And with that we have kind of integration between the broker and when the end the spire server done so it's kind of works out of the box. And we have to, we have to change some configurations, but nothing that serious. So we have to what by integrating JavaScript library into the Kafka broker will automatically get the speed for trust manager and the key manager implementation so we can say to the Kafka process that use key manager algorithm and trust manager algorithm as beef as well. So it will use the speed for us manager and key managers. And we also tell Kafka to make the client as well authentication as required so we expect identity on each of the on each of the from each of the client. So from the how it would work in practice now that since we have this integration ready on the microservice side will have the it connected to bootstrap service and until use local host 9101 to make the connection. So it will make the connection to local envoy on as a plain text so it doesn't really need to know or it doesn't even expect to talk to broker over TLS. So plain text connection to envoy on this board and envoy will make a connection to the Kafka broker and it will upgrade the connection to empty less because we have the spire agents running on both on both places. And obviously spire agents on this microservice side and on the broker side they both talk to the spire server, and they get this certificate rotation like they will get them. Spire server will manage all the rotations and we'll get all this kind of nice automatic setup out of the box. Now I will hand over to Jonathan who will talk more in details how we run envoy transfer. Thank you. So, Lovani mentioned and why quite a lot there why why envoy and how we're using it. Trust wise well, we already have an employee in place. We have a full service mesh setup already. This is deployed across our entire state so across our communities clusters. It's present as side car in order that service pods. It's deployed on EC2 instances where there are services running on various instances. It's deployed in various data centers around the world where we're running services, some data centers. This is used for all service to service calls and those calls are really secured using s fits ex 509 s fits provided by spy agent via the envoy sds protocol. This is driven by a homegrown control plane. But as this is present everywhere across our infrastructure already the components were ready in place to just add Kafka support to this. Next slide please. So how do we go about adding Kafka support today. The Kafka brokers are static. So we didn't even touch the control plane. We can avoid configs are templatized so helm for communities ginger to for everything else. And so by having that static list of Kafka brokers for each of the clusters, we can just feed that into the templating and produce bootstrap configs which already know about all the brokers. Static configuration is not ideal. But it was an easy way to get started doing this and get it into production. So avoided touching the control plan as well. Next slide please. So what is the conflict do this look like. Well, we have a listener configured an envoy for each broker that we need to talk to. That listener has a cluster associated with it for each listener. Each cluster has a single endpoint, which will be the Kafka broker that that listeners intended for the listeners configured with the envoy TCP proxy filter. And the clusters configured to have a TLS transport socket so that then pulls certificate information and the trust bundle from spy agent via the envoy SDS API that the agent supports. We've also got the listeners configured with the Kafka filter. This is completely optional. It's not essential at all. It doesn't do anything for routing traffic. All it does is pass the messages they go by and produce some very basic metrics about messages. It's nice to have additional metrics. So that's what we have it turned on. And with this configuration, you now have a listener per broker. And so your bootstrap serves on the client side. It looks something like localhost 9101, localhost 9102, and so on in this case. This is still not perfect because it means that the client still needs to be configured with this list of ports. And if you want to renumber brokers or you want to remove brokers and things you'd have to go around changing configurations. So we can make this slightly better make bootstrapping even simpler. Next slide please. By adding a another listener with a cluster that has an endpoint for every single broker. So you have this in this example, a listener on 9100. If you connect that it will connect you to one of the brokers that it knows about so any one of the brokers. And that's enough bootstrapping that broker will return the cluster metadata which will contain the advertised listeners for the rest of the cluster while at itself. So that will give the client all of the individual ports that it needed to know to connect to the individual brokers. This means the configuration is as simple as having a single port in the bootstrap service for each client. And it's now completely oblivious to how many brokers are really behind each of these bootstrap ports. Next slide. So what problems are with this? Well, the state broker configuration is far from ideal. People want to deploy Kafka into Kubernetes clusters, for example, and other dynamic environments. Even adding new brokers at the moment as a pain, you have to basically regenerate a large number of envoy configs and redeploy them. It's not as bad as having to change the client configuration for every single service, but it's still not ideal. We'll simply move the configuration into the control plane at some point in the future when this becomes a significant problem for us. It's not a showstopper. We just wanted to have the simplest sort of path to getting this into production that we could and so avoiding touching the control plane was the best way to do that. In terms of overhead, there were worries that introducing envoy into the mix would have some high overhead. In reality, we found that offloading the TLS work to envoy has better performance than using the Java client using TLS on that directly. Anyone who's dealt with the Java TLS implementation knows that it's not known for its performance. So probably with hindsight, there shouldn't have been a surprise. We actually have lower latency and higher throughput with running it via envoy. Java, of course, we're a Java shop primarily, so this was our main use case. That's what we benchmark most heavily. For non-Java implementations, we'd expect their own TLS implementations to be significantly better than the Java one. But in benchmarks, the overhead of empty less via envoy compared to just plain text straight to the broker wasn't actually that high at all. It was, unfortunately, but yeah, we found that insignificant enough for our use case that we don't believe that there's a problem with our languages either. Next slide. So what did we gain from doing all of this? Well, unified service identity across our infrastructure, obviously a big win, spiffy IDs for everything. So service calls are authenticated with the same as vids as we use for connecting up to our Kafka clusters, which is brilliant. So we didn't have to introduce Java spiffy, for example, or equivalents in other languages to our clients. We didn't have to add TLS support for the Kafka library that don't have it for a couple of languages. And so diverse clients, not a problem either because all of the client libraries support plain text out the box. You can just speak plain text to envoy and it all works. And of course, the big win, no longer live certificates to manage anymore. Everything is espids managed by Spire, which was brilliant. So I've also included a couple of resources. There's an example, envoy config, a template for generating an example, envoy config in that GitHub repo. Another important component is the Kafka spiffy principle library. We didn't write this, but this converts. This reads the spiffy ID of an X509 Svid in Kafka and converts it to a Kafka principle. So you can then use it in the Kafka ACLs. That is obviously an important bit of this all. And that's basically how we've secured access to our Kafka brokers using. There are a few questions from the chat. I don't know if you rather read those or take them offline. Maybe we can take them offline. That's probably easier to spare some time. Great. Well, this is fascinating. Spiffy with Kafka is a very sought after and desired integration. Thanks for assembling the pieces and lighting the path for others to be able to refer to the configuration about how you've gone about it. And also, well, the insights you've gained throughout the process.