 My name is Anand, currently I'm working as a staff engineer at Zscaler. Today I'm going to present production grade Kafka and Kubernetes. We'll see how Kafka can be inbuilt as a first-class citizen in Kubernetes so that people who are familiar with Kubernetes, they will have a ease of deploying Kafka. So, we'll look at the agenda first. So, we're going to cover introduction to Kafka. We are going to look at Kafka capabilities or typical traditional deployment and how the STMC project is going to help us in this overall design. We'll look at the overall deployment architecture, Kubernetes-operated design for Kafka and we'll look at a demo as well. So, Apache Kafka, so basically for folks who are new to Kafka, I'll just take another five minutes to quickly brush you through the concepts like Kafka is an event streaming technology, has a capability to handle trillions of records. It is essentially a commit log with a basic data structure since being created as open source by LinkedIn. In 2011, it has been pretty much being used as a full-fledged streaming platform. Brokers are actually a cluster of Kafka brokers, handles the delivery of messages, and a broker uses these Apache ZooKeeper for storing the configuration and for the cluster coordination. Typical leader election mechanism is also taken care by the ZooKeeper. Capabilities are microservices actually used for sharing the data. Highly useful if a data requires a high throughput, low latency, guarantees messaging ordering. It provides a divine replay kind of a mechanism so that you can reconstruct your complete application state. Message compaction is provided. You can horizontally scale your cluster configurations. Replication of data to control your FT modes, retention of high volumes data for immediate access, all of these you kind of get with Kafka. Some of the use cases, of course, it's very popular in event-driven architectures. Also used in event sourcing to capture changes, message procuring, activity tracking, operational monitoring through metrics, log collection and aggregation. Of course, commit logs and this for the distributed systems, but also stream processing so that applications can respond to the data. Majority of pipelines, if you're building pipelines, Kafka would be the central nervous system in that pipeline. There are certain concepts and terminologies. Let's go by them quickly so that you will be understanding what we mean. Broker is basically a server or a node that orchestrates the storage and passing of messages. Topic is actually a more logical, but it provides a destination for storage of data and each topic is actually split into partitions. Cluster is a group of broker instances. Partitions, basically partitioning takes a single topic log and then breaks it into multiple logs, each of which can have a separate node in the Kafka cluster. This way, the work of storing messages, writing new messages and processing existing messages can be split among many nodes in the cluster. Partition is a key concept in achieving your high availability concept in Kafka. Then you have a partition leader which handles all the producer requests and you have a follower which can replicate as well as consume the request. In total, Kafka cluster comprises of multiple brokers. Brokers contain topics that can retrieve and store data. Topics are then split by partitions where the data is written and then partitions replicate across the topics for fault tolerant. Let's look at the typical component interaction. Here you can see the Kafka cluster which we actually consider as a broker and then you have a zookeeper. These internal communications are managed with TLS and then if you want to build on top of it some metrics, so you will have a Kafka exporter. If you want your clients to be talking through HTTP, you can have a Kafka bridge. If you have use cases where you want to integrate an external system directly to the Kafka or your Kafka needs to directly send the data to an external system, so that's where the Kafka connect comes into the picture. We can use a source connector and sense connector to do this integration. Then Kafka mirror basically allows you to replicate most use for the data replication scenarios and then yes, that's pretty much a typical Kafka component you would see. Let's look at a traditional deployment. If you just imagine if there is no stringy project, how would you look like? How would you look at a Kafka deployment or Kubernetes environment? Basically, you will create stateful sets because you need persistent volumes, you need a stateful set in the end to make sure your logs, your commit logs, what you are actually storing are quickly accessible. You are going to create stateful sets. For zookeeper and broker, you are going to deploy these replica sets, manage these endpoints for external access, you have to manage the versions of all the resources. Remember that for a given broker version, you need to have the right zookeeper version as well. Then you have to build your own observability stack, you have to perform upgrades, rollbacks, manage the scalability challenges, and then you also will have to build a lot of tools to just maintain this complete stack. So it's complex and just imagine if you look at a production-grade scenario where you have more than 100 brokers, more than 20,000 partitions, how would that setup would look like? It's going to be very complex, given that you have to manage all of these resources. And that's where StreamC comes to a rescue. So StreamC provides a way to run a complete Apache Kafka cluster in Kubernetes. It provides lots of deployment configurations. So for development, it's as easy as just running it on a kind and on for production, it gives many capabilities such as rack awareness, deploying on different availability zones, applying taints, tolerations, making sure Kafka runs on dedicated nodes, all of those as possible. And then it also allows us to expose Kafka to end clients in a more secure way. It provides access like a node port, load balancer, ingress, and open shift routes. Also in security, it provides MTLS, Cramshaw, and a layer of authentication plus authorization use cases as well. The Kube native management of Kafka is just not limited broker. So basically, StreamC allows you to manage even the topics, users, mirror maker, connect everything using custom resources. So it's kind of one stop shop for you to deploy everything related to Kafka. This means now we can get more and more familiar with Kafka Kubernetes processes and tools to manage a complete Kafka application. So the whole idea is make Kafka the first class citizen in the Kubernetes world. Now it gives the benefit to all of our SRE because for them, they are looking at Kubernetes resources and now even our one-off critical application is behaving like a Kubernetes resource only. So that's where the Shumzi project comes handy. Let's look at some of the features. It allows you to deploy and run Kafka clusters, seamless installation, seamless deployment, upgrade process. You can manage all the Kafka components. You can manage the different dependencies as well. Whenever you're deploying a particular version of a broker, it will make sure it will spin up the right version of the Zookeeper as well. It makes sure very configurable access to Kafka. It provides a secure way of accessing. Upgrading Kafka is easy. Apart from all of the deployment process, you can also use the same for creating and managing topics and also managing the users. So all in all, anything related to Kafka is being managed. So you don't have to look at any external tool to manage these things. So let's look at the design. In Shumzi, majority of things are actually governed by operators. Different operators have their responsibilities. So here, if you see the cluster operator is responsible to manage and deploy your complete Kafka cluster. So that will be responsible to upgrade your brokers, upgrade your Zookeepers, making sure you are having the right deployments, right set of replicas running. All of that will be governed by the cluster operator. The topic operator is pretty much managing anything related to Kafka topics. So you can actually create Kafka topics on the Kubernetes using the Kubernetes CRs. And similar goes for the user operator. So in general, now you are actually giving the operators more power to actually manage your Kafka clusters. On the same hand, you also have the isolation of the roles what it offers. So in fact, it offers that if you are not interested to use user and topic, you can only deploy the cluster operator and just use it for the cluster management and use any other Kafka toolings to create the topics and users. So it does support these different deployment options as well. Let's look at the complete deployment architecture, how it would look like in a particular scenario. I want to present a more of a 10,000-feet view how you would see a STRIMSI-based project getting deployed on your Kubernetes cluster. So left-hand side section actually talks about your Kubernetes cluster where all of your client applications are running. Sorry, the right-hand side is actually all your services where your clients are actually connecting to the front side which is your Kafka deployments. So here is where you will be deploying your Kafka and these will be running on dedicated nodes. So here you can see it's running on a dedicated Kubernetes cluster. It is under a Kubernetes Kafka namespace and then you can see we have also divided them into different availability zones. So brokers as well as two keepers are on different availability zones and you can see it has been exposed as a load balancer service. So any services which are trying to communicate with the Kafka will be using this load balancer service. Of course, there are different ways to expose this out. You can have the ideal production practices to have both of these, typically in the AWS scenario, both of these VPC-speared and you mostly expose this NLB as an internal NLB so basically the internal NLB will make sure that no outside access is granted. Only the applications can actually communicate with these load balancers. Similarly, you also have a scenario where your SREs can actually communicate more efficiently. So here you can see they will be actually directly talking to the operator. So let's look at the use cases how your Kafka would work. So here you can see an operator which actually goes and deploys and manages your complete deployment. Your operator actually talks to the APS server and constantly reconciles make sure you have the right number of replicas of your brokers and your zookeepers running. And also, SREs can actually run these CRs and make sure that they can build more tools on top of it, kind of create a Kafka Connect model or a Kafka mirror maker. All of these are kind of then honored based accordingly after your Kafka is deployed. So most of the times the zookeeper load balancers are never exposed outside unless until there are certain use cases where you want to do debugging. Otherwise, mostly it is restricted pretty much to the SRE or ops teams only. So what it is holistically when you say there is a complete Kafka system or an ecosystem which is needed. So what SimsE offers is a complete set of ecosystem which is needed on a production grade systems, right? So we just discussed about the Kafka components like broker and zookeepers. We also talked about Kafka cluster operator which kind of allows you to upgrade and manage, maintain your operators. On the other hand, you also have these Kafka resources operator which is for mostly creating these topics and users. You have an observability stack which allows you to generate metrics of your resources and track those resources and also create alerts on top of it. It has a very nice set of configurations. A lot of open source configurations are available which you can actually tune with it. SimsE comes up with a lot of sample Grafana dashboards which we can use to actually export all of these to kind of look at all of these metrics. Then we also have a cruise control capability where you can make sure all of your brokers are evenly balanced. It identifies anomaly detection to make sure that none of the brokers are put into some thresholds. It also makes sure that you can actually avoid throttling based on CPU, based on request, based on memory, based on the events coming in. You have many options based on the number of topics, based on the number of resources being used. You can also make sure that any given broker is not overloaded. All of these capabilities can be taken care with the cruise control. Cruise control is another interesting project from the LinkedIn and it kind of makes sure that your cluster is evenly balanced on a scale system. Then you have a connectors ecosystem. It's one of the ways where you can optimize how you communicate with external systems. Very classic use cases. If you are actually generating events onto the Kafka and you want to actually send this data back to some external system, you don't need any microservice. You can directly have a connector, a Kafka connector. For example, here in this case, we have done for Snowflake, in Neo4j where you can interconnect the data right from Kafka events to Snowflake tables, all of them happening directly with the help of the connector. On the other hand, you can also mention how many number of tasks can be running on that particular Kafka Connect model. All of those are also configurable. Then Kafka Bridge comes where you want to have an HTTP connection model rather than having a general TCP connection that is also supported. Kafka Mirror Maker gives you the disaster recovery solution for your complete Kafka system. Currently, the more prevalent and used is Kafka Mirror Maker 2, which actually uses Kafka Connect, a design pattern to do this complete disaster recovery. It also supports Active Actor. It also supports Active Passive, both the ways. Plus, you can actually build more and more tools on top of it, something like a Kafka UI, a simple Kafka-Cowl-based UI. You can actually just embed into it. Now, the great thing about this is your complete project is very much extensible. StreamZ allows you to plug and play all of these components very easily and kind of, again, on the same side, have a way to kind of manage all of these resources on a central plane. So let's look at the demo. We'll see how things work. Okay, I have an operator running. I'll just show you my project structure just to make sure that I have nothing deployed. So currently, my cluster looks like this. I only have an operator running, and you can see the operator logs running here. So let's go ahead and do a deployment. So I'm going to deploy a CRD. We'll talk about the CRD in a minute. Let's look at the operator. Okay, some action has happened, and you can see here it has created this. Let's go and look at the activities which are happening here. So it has started to create a zookeeper first, and you will see that with the CRD, we are making sure your deployments are streamlined. It is sequentialized. You will first deploy the zookeeper and then only will deploy the broker. And all of this is taken care by the stream itself. You can see now the zookeeper is running, and it will schedule the broker immediately, and we can actually also look at... Yeah, you can see the broker has also started. Another way to look at this is just by doing kubectl get Kafka. And basically, you are now able to track Kafka resources using a Kafka CR. Now, this is a one-stop shop for me to look at everything related to Kafka. This says that I need one desired Kafka broker and one desired replica for my zookeeper, and that's it. It gives me something called as ready and warnings, unless until it's not ready, I will not allow the connection to be open to my end clients. If there are warnings, I will make sure that unless until I fix those warnings, I will not allow... I will not make sure my connections are open. So here it will make sure that the zookeeper Kafka is running, and then we are also running an entity operator. Entity operator is nothing but basically it has three containers, which are like a topic operator, your user operator and other sidecars, which kind of combine together as an entity operator. Unless until all of these are running, this will never show up as ready. And basically, if I just do a describe on this, so you can actually see the complete CRD here. We'll go in detail about the CRD, but the status will always make sure that unless and until this is not met, we will not open up the connection. Here the deployment is still in progress. Yeah, I can see that this is also done. Now if I try to get Kafka, I should see that it is already ready. Everything looks good. Let's look at all the components, what are all there available. And we can see it creates a set of pods. You can see these are the three pods. One is the broker, the other is the zookeeper, and this is the entity operator. It has created certain load balancers, and one load balancer is for an external bootstrap, and the other one is for the Kafka broker. And then you can see an entity operator, which is a deployment, that's the pod for which you can see. And these are the two stateful sets actually, the Kafka and zookeeper, both are stateful sets. So basically just with the CRD, we have actually deployed a complete Kafka cluster, and you can actually see all of these resources available. You can just keep on increasing the number of pods, number of brokers, and number of zookeepers, all of this will get added, and you have managed only the CRD. Let's look at the CRD for a minute. So this is the CRD we have used, and here in the CRD you can see what all we have. We have an external bootstrap. This is the particular kind of an endpoint which we expose out. We have a support for NLP. We have support for making it internally exposed rather than making it outside. Similarly, we have affinity strategies. We have pod affinity and affinity. And just remember that this is a kind of Kafka. This is the CR which we're using, coming from the API version, Kafka, and then we assign some replicas. You can just update these replicas and redeploy, and there will be new brokers coming up. You have different options of exposing the endpoints. We never recommend something like TLS falls, kind of an unsecured load balancer for external usage. We always recommend something like an MTLS base. It's just as simple as making this as true. All the certificate management will be managed by Strimsy. Similarly, you can have a Scram Chef H1 2 based mechanism as well, or a combination of TLS with authorization as well. The interesting thing is you can use storage as JBots and make sure you can have multiple volumes available with it. So as and when you have more and more load coming in, you can just add a number of volumes, and these volumes will get attached to your existing brokers. And all of the configuration, whatever is possible, which is supported by Kafka, you can actually mention all of these here. And Strimsy will inject this as part of your broker and your zookeeper deployments. So anytime you make any changes here, the rolling update happens for all of these pods. It also supports making sure that you can have deployment on a dedicated mode groups as well. You can have teams, you can have tolerations, all of that is also supported. Similarly, for zookeeper as well, we saw all of these and these are the entity operator which contains our topic operator and the user operator. And then you have also use a cruise control to make sure that we can provide some capacities and make sure that your pods never go beyond these capacities. And if it goes, it kind of then give you some hard goals and give you some suggestions and optimization goals based on which you can actually trigger those optimization plans. So this is a single blueprint for me to deploy a complete Kafka rather than looking at different resources. I play around with these values and I can have my Kafka deployment kind of triggering based on the change happening on this because for any change event, there is a reconciliation happening by the cluster operator. So that's what I wanted to present in the demo. These are the references we have. You can look at the website, the GitHub project. You can look at the Slack channel, StreamZ on the CNCF project. We will help you if you have any interesting use case, we will help you to get you guys onboarded with Kafka. That's what I had to cover. Thank you so much. Please ask questions if you have any. I'll be happy to answer them.