 Okay. Yeah. So this is a talk about Kafka and storm. My name, that's my Twitter handle basketball code. I work at head shift, working on platform and product products. So this talk is basically about, you know, distributed systems, stream processing. And my experience actually building a Kafka client in-house at head shift. And just want to share those experiences with you. But before that, let's see why you are here. So this is a big data analytics conference. And you have come here, I could probably guess to learn to share, to teach, you know, your experiences from the same. So this is what, here's an outline of how this talk is going to be. So you could be making a mobile app or, you know, some website or it could be a hardware startup and or some internal API. So regardless of that, you're actually pushing some data to most probably through to HTTP or MQTT or some sort of protocol to some backend. And voila, you get numbers, you get insights. And if you have a large enough team and time in your hands, you can be able to predict stuff. But, you know, I am a simple man. And what this talk really is about is sleep, you know, you only will be able to sleep peacefully. Because let's face it, you know, we're all working towards, you know, getting the data right, analytics right. If you do insights and, you know, predict, you can actually get a raise and eventually good sleep. So I'm here to give you good sleep. Okay, so the first step, let me just break it out. You start using Kafka simple. When you leave this room, you should have probably some insight. I mean, somebody asked you, you went to fifth elephant, you know, what is your what does this talk about Kafka storm? So I'll give you a little 10 second tidbit. You can tell your friends. But honestly, so we have a bunch of different teams and Kafka is that glue where we can, it's like those Mario Brothers or Sonic, they had the little flag you need to cross. Say, okay, the next time I die, I can start off from there. So that's the sort of, that's how it's been for us. So Kafka is that flag at the, you know, running down that game can actually cross cross the chain and, you know, you can sign a sigh, leave a sigh of breath, especially when you're working with different teams. So yeah, so, so it really is a power combo. If you leave this room, hopefully I'll be able to, you know, explain how. So here's a little philosophy, but I like to add those to my talks. So, you know, in the, so the two sort of growth phases, the developed world and the developing world sort of had, I know, if you look at the past 50 years or whatever. So you know, let's say they went through the wired communication where they are broadband wired copper wire and so on. And then they moved to wireless satellites dish and so on. Developing world like India, a lot of us actually before even getting a landline, we jump straight to handsets. So I think the same thing is happening with data. So here's a little snapshot of, you know, how long these queries for the same data it used to take. And who knows, this might get, I mean, eventually spark might be at the beginning and you'll have, you know, 10 other things, thunder and storm. I mean, there is some now, but so it'll be in micro possibly. And then I'll be hopefully, you know, in fifth elephant 15 or no, no, sorry, 25, I'll be like, you know, that's where I was. So, so the same things happening with streaming and batch data. So once you have the data ingestion part, there are a lot of cases where it's going straight to stream processing. Okay, so I hope you got that nice example. I thought it was a little clever. So next time you meet your friends, tell them this whole telecom thing and stream processing analogy. So a little history about Kafka. So there were these data scientists at Kafka who wanted to massage and do a lot of learnings and insights with all the data they had, but there was no infrastructure to actually do that. So what these engineers led by Jay Kreps eventually ended up doing is build that infrastructure to let anybody, you know, push data into it and anybody else take it out. So that was pretty much and actually that team from LinkedIn has now gone on to make a company called Confluence, who does pretty much just that. So and Storm came from a company called Backtype that Twitter acquired. It was basically to give analytics about each and every tweet, like who all saw it, when it was seen, when they might see and so on. A lot of analytics around that. Both are under the Apache incubator first and eventually became full-fledged projects. So let's get back to a diagram. So your typical, again, whether you're a mobile app or a web or something else, you probably have a load balancer and some app servers, some APIs, and you have a set of, you know, databases, caches, search engines, and so on. So the first thing, so how many of you have a system like this running? Wow. Okay. So here's the first thing you have to do. Cut it off. So all you have to do is have just auth logic here. So just make sure that coming for your devices, making mobile apps. Okay. Sorry. Yeah. Tell them only to work on the auth part. So one, have one team work only on the authentication to make sure that this is coming from the mobile app and then have another team working on. So when our request happens, it's HTTP stateless, right? So it's flaky. Okay. I think I'll just use a mic. Yeah. So it's HTTP is stateless. What you do is you take the incoming request, let's say some form validation, whatever, and you put that into a DB. So here's what you do. Do some authentication, push it into Kafka. Hello. It's fine. Yeah. So the guys with 100 users authentication. So just push it into Kafka. And the rest of you make two teams. One team will actually just so what Kafka here Q binary want into Yeah, this even this break. When I move it. Okay. Fine. Anyway, so yeah, so this is the buzzwords you can insert whatever buzzword you want. Okay. So a Kafka consumer put in some, right? Yeah. Okay. So step pretty much should do. And this is pretty much what we do as well. But our scale is something I will talk about. Yes. So how does Kafka help me sleep better? Yeah. Yes. So we get 300 million requests every day to pretty much the same architecture that you saw there all from mobile phones. We have SDK mobile SDK integrated into around one billion devices. So if you use both, if you use the board or WordPress or clash of clients, Microsoft Outlook, a lot of Microsoft apps, any of those European gaming companies, most probably we are inside your phone embedded in the SDK. So it's worked out pretty good for us because over a year we have done pretty much nothing apart from adding going from three servers to 11 servers. So they're 22 cores actually have handling those 300 million requests a day. And on a monthly basis, we're pushing around seven to eight billion events into Kafka. So a little note to the guys who have 100,000 users. I actually lost respect for the word billion because I was one of you guys basically who had almost no users. And suddenly you know, we just scale the servers and went to a few bills. So it's don't be infatuated by the billion word. So okay. So now a little note about how we have made it modular. So we have released two of open source projects. One is called eCaf and that is the actual Kafka client. And then we made another open source project called Kafka Boy. And that's the HTTP wrapper, which calls eCaf. So both are open source. You can check it out at github.com slash shift. And then team number one, which is again me worked on the orth logic, which is, you know, make sure that the signature verification happens and the devices are who they say they are. So now let's get into one specific component, which is, okay, our request comes in. This is some event, some JSON. And let's push it to Kafka. Let's see what happens under the hood. So there's one particular secret source that I want all of you guys to go home with. Now let's take a few guesses. Now, yes, it is open source. You know, there's a great community behind it, but that's not the secret source. Let's try with a few examples actually to figure out what is happening. So in Kafka, you install, let's say you designate a few servers called brokers. So they're like your the actual servers for your Kafka. And then producers can push to brokers, consumers can pull from brokers. Okay, that's pretty much it. So what do you do on day one after tonight? Create a topic for events with one partition or three partitions. So what happens when you do that? So Kafka says, okay, we're going to, so a partition is basically they were extracted and told us the implementation detail of where, of how they're going to actually store these events and in a distributed system. So when you create, let's say you have four brokers and you create one topic with three partitions, they will designate three servers to handle those three partitions. And that's only basically so that if one of them goes down, the other two, one of the other two will become a leader. And you always push messages to the leader. Okay, so always until that leader goes down. So, okay, so what else could be the secret sauce? Is it the fault tolerance? Yeah, so it's not, this is great, but it's not the secret sauce I'm talking about. And what if you said, okay, you're a broker one, where should I push this to? So a lot of libraries make it a black box, but not Kafka. It actually says, why don't you hit us with a metadata request? So it actually tells you what it just did. So it says, okay, hit us with a metadata request and we will tell you what all brokers are there. And for this topic, you should open a TCP packet and send data to this server on this port. That's it. And that's all you need to do for making a Kafka producer pretty much on the simplest level. Okay, so this is great. So let's push a few messages now. So what happens now? So Kafka on each of these servers will create a topic slash partition, a file, and it's pretty much like an append only file where every message pretty much goes at the end of the file. And that's the reason why they say, oh, that's my recent time. Okay, that they say that Kafka is pretty much inspired like a commit log and pretty much is a commit log for your infrastructure, for your products and so on. So, and you also get ordering within that partition. So if you want to make a completely, let's say you have 10 servers and you want every all of them to push messages to Kafka and you want it to be ordered, then you have to make one topic with one partition. So it'll only go to one server and all of them will push to that one server. It will go to all of them pretty much go to that one file as well. But on the other hand, if you don't care about where, you know, you just want to push. So you can create how many more partitions you want. Let's say you have 10 servers, 10 partitions, it'll go to all 10 of them. And so that's how you can scale it out. So similarly for consumers, they also ask for metadata. And the only thing they so they take that same file and say that I want bytes from this index onwards. That's the only thing that consumer API does. So the brokers don't even know how many consumers are there. They don't maintain any state about consumers. So consumer opens up to the circuit to the broker and says, okay, in there in that file that you have, I want, let's say from a message number zero or message 100 or whatever it is. Okay. Yeah. So usually they save that state in zookeeper. So Kafka consumers. And they also do smart things like avoiding, you know, so when they actually send the data from that file, they do something called OS level send file where it does not even have to move to user space. If this copies and send it to a socket that many bytes, there is a OS level API where you can say it send this everything data in this file, this many bytes, send it to this socket. There is an actual API for that. So you guys should consider using that. So, so, so that's cool. But let's take another example. So if this guy says, okay, topic one, I want three consumers reading every message. So let's say one of them gives it to your staging environment. One of them takes it to your, let's say DB and one of them, let's say cash. Okay. So you could actually do something like that. You could also have another topic where, okay, I'm pushing everything to this, but I want three consumers splitting, you know, them 30, like equally. So how you say that is a concept called consumer groups. So if let's say if you're three consumers already in the same data. Okay. So, so, so you, you don't have to worry about, you know, the same data being read double times and so on, because that's exactly what you want. So if one consumer's working today, tomorrow you want to say, okay, I want another consumer also reading that same data. So you can compare two different implementations of something. So one could save in Postgres, one could save in MySQL, and then you can, you know, do nice comparisons layered. And the splitting the data between the consumers, that's what you call a consumer group. So if you tell register yourself with a consumer group ID, the broker will make sure that it splits the data. Okay. So you have broadcast and you can split the between the consumers. But yes, this is the secret sauce. It has a very good well documented protocol, binary protocol. And it's one of the, you know, biggest things that attracted me to Kafka, because my background isn't actually reverse engineering IM protocols and packets and looking by a shark and so on. So these guys actually say all the data is through TCP sockets. And they say that, okay, it has a header. This is how you send the message. And so on. And because of that, the ecosystem really has developed because anybody, any school student who can open a TCP socket, if they can read the spec, they can send a Kafka message to really lowers the barriers there. So they spend time working on the spec. Even today, it's very well documented. And that's also they know, take inputs and so on. You could do a lot of, so this is the other thing that I actually have a pet peeve sort of thing I don't like about the rest of the Hadoop and ecosystem, because you don't know what is actually going under the whole over the wire, because even if you say, let's say it's a map test job or map reduce or whatever it is, it's from, I mean, if you know something more about this, it'd be great because I actually want to know what is actually sent on the TCP socket level. Because then I can inspect that. I can see verify that. And I don't even need Hadoop library. If I have a socket and say, this is how you add a job to or let's say job tracker or whatever this, I can just, you can write your own programs to do that. So where are the docs, binary protocol docs for the Hadoop? Rest of the Hadoop ecosystem, I'd like to know. So because the reason is protocols, you know, win over APIs and, you know, whatever drivers and classes and so on, because it's, it's the purest level, you know, socket communication really turns me on. So it's like, you know, it's the lowest level, you know, you actually send receive data. So and that's the reason why we had more and more integrations because it could be the log level. So just log and all kinds of crazy producers and consumers have come up because it's a binary protocol, well documented. And they're stuck to basics, you know, send encode something, send it over TCP. So let's, let's try that. So if you have any Kafka 0.8 broker anywhere in the world, if you send these bytes, these 29 bytes, it will, which actually means send me metadata for the topic events. So, so that's the sort of, you know, democratic thing. So whoever you are, if you send this message, it will return the metadata for that events. Okay. So, and let's see what the response is. So the response is something like, okay, you asked us for metadata. No, this is actually what I sent the 29 bytes. So by looking at the spec today, you can actually write your own implementation that opens a socket and sends the first two bytes have to be the kind of request you're making. So metadata, the API version, which is currently at zero, you can send any four bytes, which is an integer ID, for example, this, so that it'll come back to you basically. So it's just a reference ID. And then client ID is again optional. So you can just say client one. And then you can, you can actually ask metadata for multiple topics. So in this case, I'm just asking for one, one topic. It has six letters, characters, six bytes, and it is event. So if you change this to let's say, ABC, I would send three and it would be ABC here. So, so this is pretty much what the clients do. And the response is again, this is very well documented. It tells you that, okay, for this event, these are the host names, they'll give probably three different brokers. This one is the leader and all your messages have to be sent to the leader. So it gives the host name, it gives the port. And then what do you do after that? You just open a socket to that port, host and port and send data again. It also has a lot of, again, other metadata, obviously. So you can do things like synchronous producer by just changing that packet, you know, by setting one bit this way and that way, you can say that, okay, I want to do a synchronous produce. So it will actually respond back with the offset. Or you can say it's a sync. If you just drop one bit, it'll actually be asynchronous producer. There won't be any response. And there are a lot of other requests like this. So similarly for messages, you could, it's, they have this concept called message set. So you could actually implement your own batching based on this. So message set is an n number of message size, number of messages. So you can, you know, send three at a time or 300 at a time or whatever you want. And Kafka will just accept it because you are sticking to the protocol. And this magic byte is also useful for saying that you have to tell Kafka, are you compressing this through snappy, through gzip or whatever. So you compress the data put into it. And this set one bit saying, yes, it is compressed. Please note it. So, so what can you actually send in that as part of the so you could just send Jason string. If you wanted, you could send MQTT encoded messages, thrift Avro packets. So a lot of people, especially at LinkedIn, are sending Avro encoded messages, but you could send pretty much anything you want. And the broker again, it tells, so the onus is on the clients, that is you, anybody who opens the TCP target to actually decide what to send and receive. The broker just accepts it on that specific leader and says, okay, you have turned on the compression bit or this and other bits. And if any consumer asked for it, so it just adds it to a file. And the consumers read from the same file. So your clients actually check, check. Okay. Yeah. So it's up to you to implement to handle detect downtime to detect if a broker goes down and so on. So here's a little few steps of what it does. Yeah. So this is basically what our client did. We have code to check. Hello, hello, check, check, check, check. Am I audible? Okay. So this is basically what so so on day in the first week of working on my own Kafka client, you know, I got that high of sending those 29 bytes and getting some metadata that gave us me the confidence to do more stuff. So your client eventually over three months and here the basic stuff is has to do it has to work with socket. You have to have functions to be able to like encode whatever Jason key value or whatever you want into these Kafka packets should be able to decode the replies. Yes, sending over TCP and also the replies that come in you need to know where which socket or which process it has to go to and how to handle failures. Broker going and all the state machines and state machines is another nice thing which I enjoy working on and especially because distributed systems should should be built as state machines because you never know when something's going to go down when some network, you know, partition is there and so on. So if you have any code that looks like reply equal to send off some URL, you know, that's pretty much broken unless the client is doing some smart stuff. So if you yeah, so that's the first example. If you have some things that just sends and waits for response, you know, that's not well designed. Even if there's some sort of time out in the background, it's not doing anything until it comes back. So that's why it's bad. So a better alternative would be to first of all, you know, as soon as you open a socket, you should be able to watch for state changes on that socket. You should also listen for responses. You should and once you send you immediately come out of the process, let others have a chance to do something else and you should expect a receive at any time and a timeout of course should do. You should then close the socket and so on. So some of the options that we did here is you can set concurrency options like how many workers do you want? So these workers and sockets, how many would you want? And also the so ideally you'll actually set a IP for a central place where you can reach out to all your brokers. So I wanted to also avoid boilerplate like 10 lines of code to instantiate the client and so on. So in my API, there's only one command which is published topic message and internally, if it's just started up also, it'll actually go find metadata, come back, open the socket to the broker leader and then queue up all the messages until then and then send the messages. So I think by abstracting all that you'd have the complexity on the client but the API is simple for users. So in our like I say web endpoint, it has no boilerplate or anything. It just says send and even if it's the first time it's sending, it'll work. And even if it's like three months later, it'll still work. So there like I said is to request metadata, workable creation, but the socket connection and then moves to ready state. So yeah, it knows which broker to hit. Now if let's say it takes a while to get this data, then they all queued up and you can actually specify it has a batching option also. So you can say that once it hits 100 messages, only then you send it to Kafka or you could say that if you have five seconds of inactivity or after every five seconds, you could actually send it. So you can check it out on GitHub, HelpShift, ECAF. Decent activity like 45 stars and 10 folks and so on. It also has a lot of tests. So in my unit tests, I actually simulate a broker going down, adding a new broker and then I check whether it's able to figure out the metadata. Here it detects that a downtime has happened, it's unreachable, how much time it took to connect again. So all these can go to something like Grafana that side. So the advantages of having our own client with these state machines and looking at TCP sockets and so on, we could actually write these callbacks which wrote to Grafana. So you can check out. So these are the different callbacks. So whenever you reach, let's say you said that after every 100 messages sent into Kafka, so whenever those 100 message comes in, you could, so there's a callback for that. There's a callback whenever a worker goes up and down. So let's say a broker restarts or there's a network problem, the workers will get disconnected. Then the next, you can actually see in the stats when it comes back up. And let's say for a longer period of time, if it's down, you could actually see how long it is down and how many messages were saved while the broker was down. And if it goes past, let's say a million messages or so, you could actually put a threshold saying that don't queue up beyond, let's say a million messages. So this is some examples from our live dashboard where we have incoming requests. So right now, so interesting thing to note is last August, we had around 6,000 requests per minute. Right now we have 6,000 requests per second. And this is just one events route. We have, we also plot downtime connecting to the broker. That's the queue size. So we have like a, it automatically figures out how many to batch and send. Yeah, so all these nice metrics that we have. And when I started development, I also did simple, which is very quite tough to figure out. Now, so let's say you're building a HTTP stack and it's tough to profile, you know, how many, I just wrote a low world root endpoint and figure out how many requests it could do. What is the response time? So in this case, the response time is 25 milliseconds. Now that became the benchmark. Okay, so it does nothing else. Now, every other code you can now compare, you have a sort of thing to compare it with, because doing nothing is 25 milliseconds. Now, how does your work compare? What actually starts the HTTP endpoint to Kafka producer. So we try to do actually as minimal code as possible. And it's also used at a couple of other places. So there's a group of ex-Apple engineers in Silicon Valley. We're in a company called Lail. And they actually had a talk at a conference recently on powering a messaging system. So they've used eCuff. And you can check out the link to see, you know, the section where they talk about it. And I was like, woohoo. And there's a Chinese social network also who are using eCuff. And he's also just submitted a patch also for an issue. So that's good to see. And that graph there is the clones, you know, for the repository and the visitors and so on. They're clones and unique clones. So around 10 per day, not bad. Okay, so let's come back to some examples to all the architects in the room. Yeah, so this is how much looks like it. It's similar to the one we saw on, you know, for the 100% team of HAProxy. We have orth logic that says, okay, this is a valid user SDK, whatever it is. Let's push it into Kafka. And we have a closure consumer that actually disk and something called hyper cardinality check. And then those and aggregations are then saved to Postgres and EC, I mean, S3 so that we can look at it later. We have internal EMR jobs when you want to do some extra computation retrospectively. We also have like our email delivery. So when somebody signs up, so we don't directly just say, okay, send mail, we just push it all you have to do is put that JSON, make a JSON object, send it to Kafka and make a so you have straight away a split your teams here now, two persons. Okay, so and this person only has to do is read that JSON from Kafka and use mail and send grade, whatever that API is. We also have ES indexing. Actually, this is working on it. So somebody from my team is actually working on it from the platform team. Yeah. So basically, all the docs again are not indexed directly into elastic search pushed into Kafka. And there's something called bulk index in elastic search. So you can actually take things as a group and then push it index it. We also have something. Some of you may know Swaroop CH. He was working. He started on this project, which is to find out what has changed. You know, when you get a little enterprisey, you'll have to figure out who did what. So on the admin dashboard, we have something called audit trails. So if somebody changes something, we actually find out the diff. We figure out that. Okay. What all logs does this diff have to emit, push that into Kafka. And somebody else is writing that to Postgres. So this is like a, so it was pretty interesting to come up with one table also that could handle any kind of objects. So we had something called a field called namespace. So you could say, you know, addresses are changed. So the guy changed the address from this to that. Or it could be certificates of change. And so that's pretty relational as far as PG is concerned. And yeah. So now storm, I'm not going to go into much detail about. So I think the next talk also is about storm. So I'll just show you a nice interface that we built internally for us to use to actually, we have the stream of data coming in. We wanted to be able to slice and dice it. So for all the UI guys in the room, yes, your aha sort of nice moment. So we have a nice, you can, we take the same data, we push into Kafka. It's red storm where we do the tokenization of sort of machine figure out the sentiment also. So it also works with us. So we all eventually ends up with postgres. See, yeah, I can see how this could easily be one of your internal dashboards also. So, so I think we might go for more projects like this. Yeah, so just to recap this thing. So we have a crawler, it pushes the actual crawlers and go and that uses a producer, Kafka producer from Shopify. We read it from Kafka enclosure. So that's the other good thing. You can have a very heterogeneous developer ecosystem where some people are working on one language, somebody another team is on another language and so on. And things that you expect from a crawler like deduplication, stemming and tokenization, topic extraction, and then eventually in that interface that uses postgres to read the data. Okay. And this is an example of storm topology. It's a little outdated, but you get the idea. So incoming review. Okay, so there's incoming review. So this is a storm topology, okay.