 Hello everybody. So we're going to get started. We have Kaufman and G here from Confluent who will talk to us about deploying Kafka on DCOS Hi, I'm Kaufman and So so what we're going to talk about is basically Basically the kind of what how like the gatchas how effective how to effectively deploy Kafka DCOS So with DCOS where everyone knows there's comes with a lot unique You know a universe of packages, right? So But there's like some nuances you might want to kind of watch out for so the target audience is mostly for the For the ops guys administrators whoever am means I'm ministering your DCOS clusters So about myself of this is the agenda About myself I'm a solutions architect from Confluent so and Beversley I've been to a company for over yet one year now and previously I was I was a club they're doing similar role and I've contributed Things patches fixes here and there to mostly to Kafka and parquet So about company just a one-slide pretty short. So basically the company is exactly three years old so it's founded by The creators of Kafka So there are three founders Jay Neha and June. They are the original inventors of Kafka and they came out they came out of that make LinkedIn engineers and And currently doing we have about like hundred something people now hundred four years people and with a good portion of them are engineers and We think we are the largest contributor to Kafka to the Apache Kafka project so below down below basically those are the VC firms backing us So what's the relationship of? Come to Confluent platform, which is our our product versus Apache Kafka So you can see kind of so basically we we are building the enterprise offering on top of the open source project And you can see the core The core Kafka logo here is the open source Component and then what the Confluent platform is it it's being offered in two versions One is the the the community or the open source version. So which is what you see in the blue circle there It comes to a bunch of add-ons connectors for Kafka connect Some schema registry things for looking at the schemas and other things and other non-java clients to Kafka And then on top of that we have the enterprise version, which is what which is our commercial offering and From that we provide Control center, which is our management tool some operational tools as well Anyone familiar anyone not familiar Apache Kafka Okay, so everyone knows so I can kind of fly by this very quickly. So basically Kafka is a basically a Pups up paradigm distributed message message bus and A lot of people kind of think about it as like a message queue JMS MQ and the like and It's the main kind of distinguishing feature of Kafka is it is highly fault-tolerant is highly performant It also allows Stream processing I'm gonna talk about what it means in a bit So I'm gonna fly by this lie since you guys know Kafka already so Yeah, so basically to interrupt the Kafka you basically need two Two types of clients. So one is the producer one is the consumer and the names are obvious so one is basically for writing out messages to Kafka Kafka is in the middle one here and then and To read messages you need consumers As simple as that kind of same as a lot of the message queue systems So why was Kafka invented and at LinkedIn a couple years ago? What what they experienced was they have a number of systems. They need to have systems talking to each other So over time You'll need to in order to talk from system one to system two you need to build custom integration points and then it gets You can see them the diagram here. It gets messy, right? so That's why Kafka was invented as a centralized message bus Kind of acting as an active as a message hub for all the systems and because it's centralized you don't get the You don't you don't get the duplication of messages you get a consistent interface to us to a centralized message message bus so typical Kafka on the infrastructure side typical Kafka cluster consists of Couple of things here. So the broker is basically the the main Server component of Kafka So basically co broker is basically stores your messages and also serves search your messages down to the clients And zookeeper is basically used for coordination mechanism So and in a lot of distributor systems like Hadoop and Kafka they they rely on zookeeper as a coordination mechanisms for functionality such as leader election Finding out who the active guy is There's a special role in the Kafka cluster called controller. So those kind of state management functionality is handled by zookeeper and So that's a requirement and the the next two things are pretty are more like optional things You can use with Kafka cluster connect connect is basically basically like a Like a mechanism for importing data and exporting data to and from Kafka and Kafka streams is the streams processing framework When events come in to your message bus if you want to do something with it, you can process it aggregate it Run your business logic on top of it and so on so you would then use Kafka streams to process your data in While they are in Kafka so on top that so the first four things here you can see they are the The they are part of the Apache Kafka project so the next couple of things is what confluent platform offers is the schema registry which kind of provides you with a Like a like a like a registry of data schemas. So Your fault your message format if you if you enable with schema registry Schema registry will help you to make sure your data is consistent Kind of like a database DDL For my relational database tables. Okay For one second lost a screen there and then Rex proxy is Basically a restful simple lightweight restful server for If you don't want to deal with the Kafka API, you could use the respo route to talk to Kafka that way so basically it provides you with the restway API to interact with Kafka and You can do similar things as the normal Kafka clients. You can consume you can produce There's other tools in the confluent platform Mostly in the enterprise features enterprise addition of the platform So why Kafka on DCOS? So as I mentioned Kafka cluster consist of multiple components here In order to provision provision a new a whole Foley working cluster you probably need a couple of machines Typical footprint would be at least like six machines and so on for the brokers for the zookeeper and for other things so You need a management layer If you deploy on tram, maybe that's not a big problem. A lot of people do it with DevOps tools provisioning tools like Chef puppet Ansible that can be done as well But what if I want to run some of the things as a lightweight containers, right? I want to deploy this schema registry I want to deploy zookeeper as a as a Docker container for example, right? So you need a tool for actually managing all these, right? and some of the things some of the components in Kafka are Stateful and some are stateless and how do you manage them? We're going to talk about those nuances about how to how to deal with the stateful services in general and Also by provisioning in Docker containers, right? You need surface discovery and routing addressing all these things, right? So DCOS naturally provides all of these for you So if you look at the diagram here, basically these these are the typical Kafka cluster components and you can see from from From the icon here, you can see there's like a I put in a disk So basically those things are the stateful services. They usually require the local storage So you can have multiple instances of these for examples, you can see we have now had three instance of Kafka brokers the three zookeepers and Multiples three instances of Kafka streams and so on. So the other things down here. They are more They're stateless and They tend to be laid away oops The connection is getting flaky, sorry So if you look at what The the actual states what they store so brokers basically basically if you go If you look at the top layer, it says Kafka streams. So basically Kafka streams is the What you built in your what you built your app job applications in so with streams It requires your local state store. I'm going to talk about what it means later on with Kafka brokers It stores data. It stores your messages. So it also needs its own Storage for the messages for the events coming in and with zookeeper. We all know that zookeeper has its own local Storage as well for transaction locks and snapshots and so on So these are you look at these things here. There's at least three types of components that are there are more stateful So how do you manage all these? and also to With Kafka cluster you typically you would include you would put on other systems as well to talk to Kafka, right? You need your custom applications. You may have connect. You may have You may have a relational database that that's piping data into Kafka. So So there's a lot more things going on than just Kafka itself And then you have to kind of address services discovery Where those where those containers are located, right? addressing right and Low balancing as well. So where do where do we place these guys? so you so brokers are stateful as I mentioned so because they're stateful and and And they are relatively of a bigger. They usually have a bigger footprint So we don't recommend them to be co-located. So Meaning that you should never place two brokers on the same host or the same slave on DCOS and Because brokers also have dedicated discs, they should have that is a dedicated discs so So what that what that means is when you when you're writing data to Kafka eventually? Those data will be actually written to the disc flushed flush to disc and when brokers start up They need to be they need to read the data from discs as well. So That's why they should be dedicated for performance reasons and same for zookeepers and should they be co-located probably not because Even though zookeepers are more lighter weight mechanisms Brokers themselves are more Heavyweight when it comes to memory IO network and zookeeper zookeeper is more like a IO sensitive system Or network sensitive rather. So If we put them together, you're likely getting to contention issues And then you have to think about which container should I use right for each of these services? Could it it typically with the most common option would be a darker and with mason's mason also comes with its own Containerizer as well. So for So with DCOS it comes with a lot of like the niceties features built-in You can easily configure your service Your footprint your system requirements and so on for the service And then you can actually also place these cons what they call constraints Meaning that you can place service certain things of services on certain notes or maybe pinned down to seven certain notes if needed And it can handle DCOS can handle stateful services once they are pinned down The services will stick on that note regards of how you restart it So a typical clock Kafka cluster size how many of them should I use typically we recommend three brokers? Three is a good start because It doesn't actually it doesn't have to be three It could be two. It doesn't have to be like an odd number even a lot of people use three because They like the replication factor to be three So what we're back if replication factor is basically for every single message that puts into the topic It will get replicated but that number of times and three times is usually good for for tolerance and Kafka brokers themselves They don't need a lot of memory They are like they are Java processes, but they are like unlike J2E applications, they don't need a lot of heap a lot of the memory requirements is actually off heap so the JVM process itself is Time relatively small one to two gigs four or five gigs is probably on the high end But but people have done it before And it's not a CPU heavy system either. So a few cores is okay It's not it's multi-threaded, but it's not like very highly multi-threaded. So so a relatively low low number of cores is fine and On the DCOS side Typically on the marathon configuration you would put something like this in order to do your placement constraints So you guys know what that means that line here Oops So basically what that means is I'm placing one max per hosting It's pretty actually it's pretty simple Alternatively in the in the in the DCOS service configuration. There's also something called placement strategy so in this case Basically, you're printing down the placement onto a particular note and You can combine these two together by the way So this is how the the broker service Config.json looks like in DCOS so this is just a just a sample of it and So this thing we got we just talked about max per note You only place one per note, right? And the next thing is that you deploy strategy So this is a DCOS thing how how you deploy instances or tasks of the service so Remember we talked about Kafka broker has typically has three notes or more, right? So this kind of controls how you deploy the individual individual tasks One broker after another serially So you have to make sure this is in place and We also talked about Broker should have dedicated discs So what that means is if you could if you could then you should explicitly put a mount Put a volume dedicated volume for the data for the for the broker data So on the DCOS side basically on the math on config you do something like this. This type equals This type is mount Yeah, so so I can't talk about before Kafka messages They are flushed. They're not immediately written So the reason is actually Kafka brokers actually caches the messages before it gets flushed So that's how Kafka brokers achieve high performance So on the story side Because they are delayed flush flushes We recommend in general hard drives are better than each than SSDs So SSD is good for random writes but for like a big data systems Usually standard hard drives is better. Oh, yeah, I'm talking about broker notes here. Yeah so But that doesn't mean you don't you can't have SSDs in combination, right? So a typical server node Could have multiple discs You could probably could have like a SSDs for your boot volume for example and if you want to do This is more maybe maybe maybe not very commonly done in the DCOS world is on the love on-prem systems Like a standard on on-prem Server-grade hardware a lot of people do multiple discs on server chassis, right? You can mount easily 12 24 discs, right? So in that case, right? A lot for a lot of big data systems. You can set up as a more like a J-Bot setup So basically what that means is those 12 discs Would appear as this 12 this 12 individual individual discs Or Alternatively you can do something like a rate you can put an rate on it enable ray 5 ray 10 then Then with the rate of volume, it's a it's a gigantic virtual volume So there's a pros and cons on each one Usually rate is rate is better for Kafka There's a couple reasons for it So one of the main reasons is Kafka brokers themselves They handle one single volume better than a J-Bot. So if you have multiple discs, then it's probably you You should use rate Mount we saw before in the config So root so basically mount in DCOS means you are actually writing data in a dedicated volume other than separate from the boot volume so let's say I have In my cluster in my DCOS cluster I have Maybe maybe I want to have dedicated discs, but I can't afford to have dedicated discs on all the slaves, right? I want maybe I I might have three powerful Three slave nodes that have they have more capacity, right? So how do I pin them onto the slave nodes onto those nodes, right? So you could do something like a explicit IP Addressing so in this case you can put the placement constraints With the pipe pipe lift the limited list like this So this is the alternative to what we saw before earlier minutes ago So minutes ago we saw the placement constraints would be maximum per Maximum one instance per host and this is more explicit. This is more narrow, right? so DCOS as we know Comes with zookeeper already DCOS comes with the exhibitor tool it has the master node to use zookeeper, right? Should we use that and by default So this is kind of the this is this is our This is somewhat debatable and in some it depends on your use case if you have a small cluster maybe If you have a large cluster probably not And keep in mind if you have other services that use they also use zookeeper You probably should not should think about having a dedicated zookeeper quorum As far as I know for for master for the mesos master. They already write a lot of data on to zookeeper, right? So If you have multiple user what they call user land applications that you also use zookeeper You should consider setting up your own dedicated zookeeper But for dev dev environments, it's this is probably okay. So next we're going to talk about some some of the caveats gadgets Oops, so you can restart services as we all know with the DCOS Web UI, that's probably the easiest Sorry, this is flakering so But and alternatively you could also use the DCOS CLI What the CLI offers you today is? The CLI can give you the fine more finer control of what to restart So with brokers is a little bit more trickier because brokers they appear as one service But underlying they are deployed as multiple instances multiple tasks, right? And with Kafka broker We with a distributed system like that you should be called you should be cautious about Restarting everything at once Alright, right, right, right at the all everything After each other. So what we recommend people doing is when you do a rope we should do a rolling restart and also Give some time between the rolling restarts of each in instances So right now we recommend people doing the sea the CLI restart way So with the CLI restarts you can actually specify the broker ID to restart With the with the web UI you can't we're gonna jump into some details about Kafka streams and other failures Yeah, so this is what I just talked about No rolling restarts in the web UI and then it's better to do the CLI restart like that. This is the example The broker ID is usually is pretty simple is it starts from zero one two and so on And then you should also also check the logs, of course The reason why you want to check the logs here is even the broker has We started successfully It takes some time to catch up to the other remaining running instance in a distributed systems brokers have to replicate the data from the from the running instances So that might take some time depends on how much data you have So next thing we're going to talk about the gatchas on with Kafka streams So this is kind of a high level of what Kafka streams is basically It's a it's a application custom built application using a library to talk to Kafka to do event-based processing That's all it is so So basically your deployment from a deployment standpoint You would be writing your own applications down here down in the boxes there and Because the because with Kafka streams you're doing your own Business logic calculations applications and so on so usually in the with the API itself it comes with states the way the way Kafka streams API Keep states is using something called state stores, which is a better databases. You see in orange thing. He's there So let's say I have I have an application three instances So each one has its own states if one of them fails What happens is Kafka streams will actually we balance the workload You see the red and better they store got got replicated over So this is what this is what built-in in the Kafka streams if one one of the instances dies You get the you get the low you get the balancing of the workload on to the remaining instances So with that said right? This is stateful this this is a stateful application and In order to start the state To recover the state from the instance three on to the other instances I need to catch up on the states the rep the red thingy there Right, so naturally you can think imagine that might take some time I need to replicate a state from the Kafka brokers actually Republic Republic the state and stop the new tasks and so on right So I mentioned already every every Kafka streams application has its own state store state stores have to be synchronized and and Assigned to their instances and that could take time depends on how much data you have how much aggregations you have Depends on how long you've been running as well so when they when when they are restarted if they get restarted they will be spawned on different nodes and They will be started from the empty state store meaning that I have nothing to begin with right then I need to populate the state So there's some other types of gatchers To watch out for so this is kind of the more like the failure In case of failures How do you how what's a property to do it in DCOS? So normally with my iPhone, right? my marathon will restart Restart the tasks on new node possibly if they fail or if the hell check fails, right? So oftentimes you see you may see servers as being killed or and restarted somewhere else So for for stateful applications in general This is something that you may want to watch out for is if it happens, maybe it's okay, but you you need You need to watch out for whether these things are I'll start it properly and working properly with Kafka streams As I mentioned you need to allocate more time and Administrator someone should look into it and also What we recommend people doing is do a learning What I mean by alerting is use Pedro Dili and Argyles and the like so These are kind of the the The information about the Kafka services if we kind of drill down on one of these guys So so what so confluent and let me give you a little background here So confluent and mesosphere are partners. So what we did was we kind of co-developed some of the DCOS packages and with input from both side both companies and That gets deployed in as currently DC into the DCOS universe. So the source of those Packages are right here. They're all open source. So if you click on one of these guys here You can see a typical DCOS package deployment Called definition rather So we can see the standard. Oh, I need a critical scapegoat screen here. Sorry. You can't see mine Can you guys see it? Yeah, so this is no different from any other most other DCOS services. You can see the same bunch of config package resource JSON files So here if I go to So this is get ready the documentation about about the confluent Kafka package. So Recently the diversion names may be a little bit complicated. So right now the latest confluent platform version is 3.3.0 as you can see in the The later part of the number numbering schemes here. The first number is actually the kind of the framework version the DCOS framework version and Also, there's a couple other tools. We recommend people doing installing So from Mesa Spear they come with they offer the the Kafka client Docker image So both what that is is basically a bunch of command line in the standard Apache Kafka with no services running and There's also the confluent platform C.L.I So confluent platform basically is a confluent packaged version of Apache Kafka binaries. So it comes with the standard Apache Kafka command line as well and you can use any combination of tools if you like So I'm gonna show What it looks like here, can you see my screen? Okay So here I'm entering DCOS Confluent Kafka. So these are it comes with a bunch of sub command and You can see yeah, we're not gonna go through that what that what those are but most of them are kind of the Some of them are like administrative commands. Some of them are for looking up things You can see like you can restart things if you want and you can look at the broker list and I have I think I have I've configured to point to a running DCOS cluster Let's see. Yeah, so these for example, this one gives you the auto broker broker IDs and Another simple command I can run here is topic list Here you can see these are the predefined Kafka topics on my DCOS cluster. So If you're depends on the tool sets So the DCOS tool sets gave you this see how I gives you this these output in a JSON like format and Alternatively, you can do something like The standard Apache Kafka command line as well for example if I go to so now I'm on the I'm on the master node So what I previously already have the Docker the Kafka client Docker image installed pull down already So on the master, I already have it. So I just simply run it and Inside a container you get a bunch of commands like this So basically I'm not gonna run them all so basically this list of commands is a standard Commands as in the Apache Kafka project Done here any questions? Great, right? It will get used to you get leverage. So basically the in the container You probably want if I talked about earlier is the JVM process itself doesn't occupy a lot of memory Right, I mentioned about one to two gigs, right? But but you may want to allocate more free memory for the container in this case For the page cache. So this is how Kafka actually caches your data before it gets flushed to disk So before they get flushed the the messages being freshly produced they will be sitting on the page cache and When consumers read from them, they are actually Kafka brokers will serve them directly from the page cache Instead of copying copying over to the JVM and then serve it out to the client So this is what they call the zero copy Mechanism so the more page cache you have the better in general Depends on how much data how much data volume It's very common people have playing maybe at least like 32 gigs. Yeah Questions Maybe give a chance to someone else Sorry, yeah, yeah, it does so in general I mentioned Confluent and Meso spirit. We kind of have partnerships together. So we kind of worked make sure Whatever our offering our latest version of the Kafka Confluent platform particular gets updated on to the DC OS so Yeah, so that the three three version is what what you see in the latest. Yeah graceful is a little bit somewhat of a like very loosely defined term because Because as I mentioned How long it takes actually varies depends on how much data you have and how How behind you are on the brokers too. So let's say my brokers was down for a day I have one day's worth of data to catch up on So that could be a lot. I don't know depends on your use case So that's why we don't we don't really have like have a built-in wait for wait for one minute or wait for one hour Someone should actually monitor it So for so for now restarting Kafka brokers is still somewhat manual it involves human supervision Yeah, so the Meso's Kafka. I think I think you're talking about these just without the Confluent prefix, right? Yeah Yeah, so those are those as I know is based on the standard Apache Kafka project and So the difference is the Confluent platform versus the Apache Kafka Package So the it's pretty they are pretty compatible in general The Confluent platform helps has more tools built in command line tools built in and some enterprise features built in as I mentioned That's about it. One more question for replacement elaborate, yeah, we recommend in general you should try to manual manual do a manual replace Because if you use the amount of volume as I recommend before right you chances are you probably need to In the cloud environment, you may have to detach the EBS volume or something right and then reattach the volume to a new replacement So right now as far as I know, I mean, I think you need to build it on your own Even DCOS doesn't have the capability to do that for you, right? Remounting at this this is something more on the AWS side of things. So Yeah, I think you have to administrators have to orchestrate that to make it happen. So it's still somewhat more manual Yeah, all right Thanks guys