 So, I'm going to tell you about what we really wanted to build and before I start the video that kind of tells you the product requirement. I'm going to ask you a few questions so that you know people are connected to me as well. How many of you here watch Cricket and watch Cricket on Hotstar? So, I can see some hands. How many of you do really watch it the entire match at full go? I see less number of hands, a few hands here, so that is what we realize that Cricket in general when you watch an entire match has really like a lot of dull moments in between and you really get disconnected from it and you go check your WhatsApp or check your social media or go do something else, you kind of disconnect from that game or the video and go do something else and there are a lot of ad breaks in between as well. So, when there are ad breaks, you can't really just be glued to the screen. So, you take your focus away. So, that is what we realize that why can't we tackle those dull moments and make it interesting so that the user can put his focus solely on the platform and that kind of is really great for Hotstar because if a user goes frequently out of the platform it is not really good for the platform but if there is enough amount of stuff that the user can focus on just apart from the match itself in the first place, it will be really good for us and that is what we wanted to target. So, and I also want to ask you another question. So, I personally really like watching matches, be it any sport with my friends or let's say at a sport bar or let's say in a stadium. So, there is really something of a difference when I'm watching a match on Hotstar versus when I'm watching a match on let's say in a stadium or in a sports bar, there is something that differs. So, something that differs is what we wanted to tackle at, why can't we bring a stadium or sports bar like experience to Hotstar and that is where we came up with why don't... Is the mic off? Is it on? Okay, cool. So, what we wanted to build was a Hotstar sports bar and that is how we came up with this. So, I'm going to play the video while the video plays. So, what we thought about is can I talk to my friends in Hotstar, can I post a selfie in Hotstar, can I shout out for the team that I really like, can I say that Dhoni is going to or Dhoni is the best captain or the coolest captain, can I say Kori is going to make a century? We thought why can't we do all of these things on Hotstar and you know bring that experience and make it wholesome so that you can also watch the match and you can also experience and enjoy that experience like enjoy that match amongst your friends. Even if you're sitting, even if you're sitting let's say at your home and you know your friend is somewhere else as well. So, this is what we came up with. This is the experience that we wanted to power. For us to be able to power this experience, what we decided was that we needed a bunch of things. Yes, so what we realized was to be able to power that product experience there is something core that we wanted which is for us to be able to transmit messages in real time to our client applications and I'll give you some context about it. So for example, if I tied that Dhoni is the coolest captain as a user as a Hotstar user that goes to Hotstar. Hotstar wants to you know, read that comment, whether it is profane or not. If the algorithm say that you know, the comment is not profane. We want to deliver this message to let's say 20 million users in real time. So that is really the problem at scale. I would not go and talk about how the other system work. What I would go and talk about is if I have a message that needs to be transmitted to the client, how can I transmit it in real time with a performance system? So that was really the problem statement that we wanted to power. If you could have that communication layer set up, you know, building the entire logical infrastructure or business logic would not be that difficult. So that is what we really wanted to tackle and that is the problem statement that we're going to talk about today. We wanted to build a message messaging infra which wanted like which should have a real time delivery. We cannot afford latency in the system. We wanted to be able to support 50 million concurrent connections at one go. So during the last World Cup, India played a semi-final match and we saw peak concurrency reach 25.3 million concurrent users. So 25.3 million concurrent users in the world were watching the match on Hotstar. So what we want is if I want to send a message, it should at least be able to go to 25 million users in real time concurrently. And we wanted to still support for future, which is 50 million concurrently. Obviously it has to be pushed based because the client polling for something will not be as efficient and not real time. It will obviously add latencies. So we went with push based. We wanted to be production ready before IPL and that. So Hotstar has really stringent, you know, standards for us to be production ready. So we thought that if IPL is going to start in March, we need to be production ready by January and we should be testing in production and that is what one of the important goals was. We obviously wanted a system that had really good delivery metrics. So 100% is what we targeted. Obviously with mobile phones being in picture, we wanted a system that should be, that should have a low footprint on battery and data. So that is kind of the problem statement that we wanted to power. And if you could look at the diagram on the right, the blue system is what we wanted to power. We had an HTTP, you know, communication channel with our client apps, but we wanted a push based channel and that is what we wanted to power. Cool. Before I start going into the depth, I would like to talk about these jargon that I want to talk in the talk so that, you know, it's all of us are on the same page. So I'm going to quickly cover some of these. So PubSub, you know, PubSub as the name explains, it is just a messaging pattern where one user subscribes to a topic and one user publishes to our topic. And that is how the, you know, that is how kind of the communication is done. So I am listening to a topic, anybody in the world who publishes that topic, I will receive the message. That is what PubSub really means. A connection is self-explanatory. Topic is basically a string, a UTF string where, you know, you could say that I am listening to this topic and a subscription is the act of subscribing to a topic that is all there is. Connect latency is the latency that it takes for a client to connect to the broker successfully. Pub to sub latency is, so for example, if I'm a publisher, the moment I publish a message and the time it takes for that message to be successfully delivered to all the clients, average of that is what PubSub latency basically, you know, denotes here. EMQ is the application that implements the protocol and is the application that we really, you know, used to power PubSub. Cluster and bridge, I will, you know, I'll talk about it later in context. AWS load balancer. So I'm not sure if all of us here are aware of all the services that AWS offers in terms of load balancers, but AWS ELB or classic ELB is one of the load balancing service that AWS offers. And ELB has two metrics. One is search coolant and one is spillover count. So I'm going to talk about it right now so that, you know, whenever I'm going to tell you the context, it's easy for all of you. So search coolant is nothing but a Q of 1024. So if let's say your load balancer is getting 10,000 requests per second, it will try to make sure that, you know, it can serve all of the request as soon as they come. But in case the backend is not responding in instantaneously, or there is a delay in the backend response. So the load balancer, what it does is it starts queuing up these requests and it queues up up to 1024 and that is the limit beyond that any request that comes that it cannot process just goes to spillover count, which is like it will drop that request or it will reject that request. So these are the two metrics. AWS NLB is a new service that AWS has come up with. It's called a network load balancer and it is catered towards load balancing, PCP connection in general, not HTTP connections. So it is in a way, if you say it is built for something like PubSub, which has PCP connection and not HTTP connection. So if anybody has a doubt in this, please ask right now so that you know, we don't get stuck otherwise I'll move on. Cool. So this is how it starts. What we thought was we want to deliver a system in, you know, we started in September and we wanted to deliver something in March. So we had like a long runway. So we decided how do we know the system we're going to build is going to be successful or not. What do we, what, what does success mean for us, right? And that is what we defined here. So MQT, MQT is a protocol that implements or, you know, that defines PubSub messaging pattern and EMQ that I talked about is the application that implements MQT protocol. You can read about MQT protocol if you want more. I'm not sure if you know depth of MQT is relevant for the talk or not. So we chose MQT is a protocol. There are other protocols as well for PubSub. So for example, ready implemented so on PubSub, Kafka has its own PubSub. So a lot of solutions are available out there. The language we chose for the application was Erlang. And you know, the choice was something that came from one of my peers in the team is because that Erlang is a language that is built for distributed fault tolerant systems. And it is really, you know, it is really a language that is built for scale. So if you have a service that's going to scale in production to, you know, huge peak number, something like 25 million, 15 million. So Erlang is a really good language to use. And that is what we went ahead with. So EMQ was the combination which implemented MQT as a protocol in Erlang as a language of choice. So those are the two decisions that we took. And then we defined that, you know, I should have a system where I should not see connection drop. So if a connection connects, the broker should not drop the connection. The network could drop it, but the broker should not. The system should be low connection latency. The client should not wait for, let's say, 10 seconds for the connection to succeed, right? Because if we wait for 10 seconds, we have lost that window, right? So we wanted low connection latency. We wanted low pup to the latency. That is, you know, the heart of the system. So if I if I have to deliver a message, and if it takes, let's say five seconds to deliver by the time that message is delivered, the ball has been bold, you know, the context has gone. So that does not really make sense. So what we thought was let's have a system which is sub second, at least, if not, if not lesser, right? So we also wanted no subscription drops. I think this is, you know, similar to what connection drops meant. And obviously we wanted a system which is, which can be at least 99% successful, if not 100%. So we wanted a system with their message delivery was not the problem. And you know, message delivery, there is a lot of data throughput of the system. So if I have to deliver one message in real time every second, that means 25 million rps for 25 million clients. And that means 50 million rps for the system for 50 million clients. So the data throughput was really high. So we wanted the message delivery to be good enough. Cool. Okay. I would also like to talk about one thing before I start is the takeaway, right? So by the time you, you know, get up what I want to communicate to you is that building an application at scale is so much more than just cool, right? So if you could just code an application and if that were it, people like us wouldn't really exist, right? So it takes a lot for us to say that the code is ready. My services feature complete, but now will it work in production or not? And will it work successfully or not? It takes a long way. So I'm going to talk about that journey that we took because the application really is something that we took, which was an open source application. So what we had was a ready code that worked, but it did not work for scale and that is what we needed to do, right? So that is kind of the code takeaway from the talk. And I'm going to start, you know, this, this is where the meat of the talk is where we wanted to get to IPL. We had like from September till March, we had a long runway. What we decided was can we break down the entire journey into like checkpoints? And that was really important because we wanted to deliver these features in IPL. There was marketing communication that already went out. So we couldn't really back off later, right? And this is a system that, you know, you won't really find solutions out there for HTTP, you know, for simple services or for, you know, CRM and all these things, you will have solutions out there. But for PubSub at scale, you don't really have a solution at scale. So we had to make it either way, right? We had no choice. So what we thought was can we build in contingencies, can we set up checkpoints, you know, can we decide so that we make the delivery anyway possible? And that is why we decided, you know, we'll set up milestones and checkpoints out of those four were very critical. And that those four I'm going to cover today. The first one is we had an open source application, which said that it works for 100,000 users at scale. That is the limit they called out, right? So our scale started from millions. So 100,000 was nowhere in picture. So what we said was, can we take this open source application, which says it supports 1000 100,000 connections, can we scale it to 50 million or not? So we broke it down into smaller things. And what we said was, let's let me not think about 50 million. Let me think about one instance. Can I exhaust exhaust that instance with this application or not? So we thought, let's put 250, 250k as the number. And let's just go and test it and see how it goes. Right. So that is basically the first milestone that we targeted for. So our goal was very simple. We wanted to maximize the number of connections on that single node and that is it, right? And the info we chose, you know, you can understand it here. I'm not sure if you can see it or not, but we had a C5 forex large instance on AWS, which was our infrastructure of PubSub. And we had a worker infrastructure, which had 10 somewhere around 10 M4 X large nodes. And there are certain kernel tunings, which are, you know, standard for you to exhaust the network and, you know, the resources in the system. So we had all of the, all of the nodes tuned, you know, particularly, you can obviously find these details over the net. Or we have a blog as well, where you can find the system tuning details, but assume all of these were tuned to, you know, leverage the best of the resources. And this is how the infra look like. So what we would do is we would generate traffic from these worker nodes to the PubSub infra. And the traffic you could assume as being exactly the same as I would do on my app. So I would open my app, I would try to make a connection, and I would try to subscribe to a topic. And that the same thing will be done by the worker infra. If I were to comment, that would be same as sending a message. Or if I were to look at the feed where I'm receiving messages, it's also like, you know, I'm subscribed to a topic and the broker is sending me messages, and I'm receiving those messages. So that is what the worker infra was trying to simulate. Okay. So before we go on to the results, particularly, let me talk about our scenario look. So if you could, if you could look at the left graph, our scenario was very simple. We would have a simple ramp up of connections from zero. We will go to let's say 250,000 connections, we will hold it for a while, you know, so that the system is stabilized. And then somewhere around here, we would start delivery, start publishing messages. So there will be one backend service is responsible for sending messages. And it would send one message, you know, to all the clients were subscribed, that was the scenario, basically. And we would stop the message delivery somewhere around here and then do a ramp down. That was a simple scenario that we went with. We wanted to test the happy case first. Right. The rate at which we were testing was 36 rps. We calculated we want the, you know, the number of messages, it is okay for a user to see and read would not go beyond 36 rps. So that is where we put the limit at. In fact, it is very low, but we wanted to exhaust the system. So 36 rps is what we need to test it. And so this is how the test went. So we, we would ramp up, we would wait for a while, we would start publishing messages at 36 rps. And we would see how the latency looked like, right. And once our test, you know, once we had done these publishing messages over a period of time, we would stop that publishing and then ramp down. So this is how the curve looked like. It's difficult for you to see, but it says 150k here. So that really shows that, you know, our system was able to scale to 250,000 connections and hold it there. So the, in terms of connection, there were no connection drop, which was the first success right here, right? The second curve, if you look over to the right, it kind of shows you the pub to sub latency. And the numbers on the left, y-axis are in microseconds. So the 200k or 1 million that you see, 1 million is the 1 second, basically in 200k, 200 milliseconds. So what we really went for is, can I get a system that has constant pub to sub latency? Because if the latency is constant, the system can be performant in production. And you know, if this latency was linearly increasing, there would come a time and it would crash eventually, right? So or the message delivery will be bad, right? So we really went for a constant latency. And this kind of shows you, you know, min, average, 95th and max are all constant. So which is, which is what we really were aiming for. And that kind of really tells you that, you know, our test was successful and 250k really passed for us. So this is a service, this is an application which the developers say that it is capable of handling 100,000 connections. We have scaled it to 250,000 connections on one node. So that is where we were. And that is what told us that the first milestone is kind of successful. Cool. I will go on to the next milestone. So, you know, so that we're on the same page, we have an application which is now running in one node for 250,000 connections. If I were to let's say deliver 50 million connections, the worst case we had thought of is, okay, we will have as many nodes needed to power 50 million connections. We will make sure the clients are partitioned in a way that certain clients only connect to one node and certain clients only connect to the second one and so on and so forth, right? We could have done that easily, the partitioning of users to broker. So this is the first time we realize, okay, we can deliver it in production in IPL. So we really had a solution that we could build upon. And that is what the first milestone meant. The second one was can we now increase it from 250,000 to 2 million, right? So if I could let's say do 2 million, I would have 10 of such infras and I could do 20 million and so on and so forth, right? We wanted to build a horizontally scalable system. So that is what the second milestone is about. Cool. So second milestone, what we did was we said that EMQ has a concept called a cluster. And what that cluster means is that you could group a certain number of nodes together and those nodes will take care of the intercommunication amongst the cluster. So all you need to do is tell one node that this is the message that needs to be published and that one node would take care of communicating that message amongst all the other nodes, right? A basic simple rule of how cluster works. And that is what the infra was. So we chose five nodes to start with in a cluster and again the worker infra kind of multiplies similar way and we have 15 nodes. So this didn't really work for us. What we saw was that when we tested the same scenario, the latency kept on increasing as and when we increase the number of nodes in the cluster. So when the number of nodes in the cluster were 2 versus when the number of nodes in the cluster were 5, the latency was very different, right? So and we realize, you know, this can't work because if I increase let's say my cluster size to 20, it won't scale as good, right? So what we wanted to identify is where the issue is, okay? So I'm going to tell you an anecdote, right? Hotstar livestreams, GOT, it telecasts it as the same time in US as in US. So I really had the advantage to watch this first hand and we would watch it at 6 a.m. when it is telecasted and I am an evil person, sorry for that, but I go and share spoilers with my friends, right? So that is what I did. In a WhatsApp group, I shared a spoiler and now everybody knows what's going to happen in the next episode which they haven't watched. So they're like, we're going to do the same to our friends. So they did the same to their friends. So they send their messages into their respective WhatsApp group, right? And that is really what defines what the issue we saw with the cluster, right? And I'm going to explain you how. So just imagine, you know, that a comment instead of spoiler, a PubSub node or a cluster instead of that WhatsApp group and the respective friend WhatsApp group to be the clients who are connected. So that was the same, you know, that is the analogy here, right? So what we realized is that when we made a cluster, the number of hops increased. If I send it to one node, that one node is responsible to send that message to four other nodes, right? And then those four other nodes or then those five nodes would send it to the respective clients and then there's an additional hop, right? So if I were to increase the number of nodes from five to let's say 10, there are five more actions that one node needs to take, right? And that is what really was the reason for linearly increasing latency, right? And, and you know, we thought about how can we solve it? So if you think about, let's say if you think about this node here, right? So this node is responsible for doing two things. One is taking a message, transmitting it amongst the cluster and then transmitting the same message to its own 250,000 clients who are connected, right? So it is doing two jobs. We thought, can we separate these jobs out, right? Can we bring in a person who is responsible just to talk to the cluster and can then other nodes talk to their clients? So there is a constant latency, right? And this is what really solved for us. So what we did was we brought a common node that is only responsible publishing and you could understand it through this architecture here. So what we said was backend service only is published to one node. That one node is going to publish to all other subscribe nodes. So the delay in the system, I understand that, you know, by introducing a new node, we said that the delay will be constant amongst all, although it will not be good then, although it will not be better than the last one, right? And that is what made sure that the system does not have a linearly increasing latency and it will not eventually crash. So that was the idea. If you guys have any doubts, please feel free to ask me. So what we wanted was that all of the clients should get the message at least the same time, if not for, you know, if not for faster or more real time. And this is what we do. So this does one thing is that if you increase the number of levels, it won't really have, you know, as much impact. And if you increase it horizontally as well, it will not have as much impact because the published node can handle doing or sending a message to eight nodes in parallel, right? Because we had a C5-4x large node with just at least 16 cores or more. So it can do 16 actions parallelly. So that published node would just do instant transmission to all the nodes and then the subscriber node would transmit it to their clients respectively. So this fixed that linearly increasing latency for us and, you know, 2 million tests passed with that. So if you are to, you know, if you are to look at that checkpoint, right? So this is the time when we realize that we can set up a cluster that can power 2 million concurrent connections. And I can send messages to that cluster at 36k RPS. So if Hotshot is going to get 10 million scale, at least I can support 2 million of them. That was the goal that we had. And worst case, we could do partitioning among the client and have 10 clusters which would handle 20 million scale, right? That was, that is where that checkpoint was. Cool. Third milestone. So we were greedy. We said let's multiply it further, right? Can we go to 2, 4, 6, 8 million and further, right? And this is not with having a parallel infrastructure. This is having an infrastructure that can support 8 million together, right? And okay. So this is, load balance is something that has given me a lot of sleeplessness, right? This is an issue that it took a lot of time for us to solve. And I am sure all of you understand why we really needed a load balancer. There were two reasons. One was that client would always connect to a single URL, right? Client cannot say that I will connect to this node's IP or that node's IP, right? It is very difficult to discover for clients. So clients wanted a single IP and load balancers as a construct is, you know, that it will be a single point of entry. Also load balancer in its essence is used to load balance traffic. So imagine a scenario, you know, when the math gets interesting, we send a notification to 150 million users at a time, right? And the conversion rate, even if it is less, it brings in a surge of users, right? And if that surge comes in, every system is supposed to handle that surge, right? And it's a huge surge that comes in. So what we wanted was our one node not to take a hit and the load balancer to balance that load amongst the nodes. So that is where we really needed a load balancer. Cool. So I'm going to talk to you about how the load test performs. So the two million tests we performed last time, we did the same test with the load balancer. And the load balancer was in default configuration. We did not do anything on it, but it failed miserably. It failed so miserably that two million that passed, we could not now go beyond 500k. So it was one fourth of the performance that we had seen. And this is even at the connection phase, forget message delivery. It could not even hold that many connections, right? So it was very difficult for us to understand. And this is where the surge queue length and the spillover count comes in. So what we realized is if you look at the connection drops, it coincides with the surge queue length. You know, if you could look at the blue peaks that goes up to 100,000 24,000 24. And the spillover count shows, you know, goes up to millions. So all of our connections that we were trying to connect were not, you know, were not being served by the load balancer. And it said that I cannot take more than 1000 24. And after that, it would just drop all of them. So that is the reason why our connections got dropped, right? And we talked to a bunch of people. I was not an aid, you know, I was not a load balancer expert, nor was a backend expert. I was a, you know, I was a backend developer in making. So we talked to a bunch of people. We talked to AWS experts in our company. We talked to AWS outside. And everybody used to say this word. Have you, have you gotten your ELB pre-warmed or not, right? So pre-warming is something. I'm not sure if everyone of you knows or not. But AWS says that our load balancers can auto scale, but it cannot handle surges in traffic. So that scaling takes time. It will take, let's say, a minute or five minutes for the load balancer to scale. So if, let's say, you get a million users in one second, it cannot handle it. For it to handle, you tell AWS through a support ticket that, you know, this is the scale that I'm going to expect. And this is the peaks I'm going to expect. Can you pre-warm it? So it is like, you know, they keep it, they keep it ready to handle that scale. And it's a manual request that happens. It's a support ticket request. And that is what we got done. And that is what kind of solved the issue. So it's, it's a form that you fail, which is submit to AWS. And they say, okay, we will pre-warm it for you in advance for the time you want it for. And then only can your load balancer perform, right? So that is the first issue we saw. Cool. So after getting the ELB pre-warm, we were able to replicate that 2 million result on a load balancer, right? And again, the metrics or the graphs show you the same thing that connections were able to scale up. If you look at the right side, the last three lines at the bottom most show you min, average and 95th percentile. And the max is what had peaks, but we were not really worried about it because if the 95th percentile or the 99th percentile is constant for us, it's okay if a few messages got delivered late. And I, what we attributed that was, that curve to us, there are some messages that the dashboard or the load testing software itself is taking time to compute. So that could also have been reasoned because the data throughput and the peak connection were really high in number, right? Cool. So 2 million connection test passed with a load balancer as well. And this is what the comparison looked like. So the only, only difference in particular was the max latency, which was not so much of a worry for us. Okay. So, yes please. So there was a bit of difference. What we can tell you or not, I can definitely tell you the exact number, but we definitely saw that 100 to 200 milliseconds were added by the load balancer, right? So it would directly not be the load balancer as the reason because if you could imagine a system where you are sending 36k RPS messages on, you know, 50 nodes and five nodes on one side, the dash, the load testing tool itself needs time to compute all of those metrics in real time. So there was some delay that that also added. So it was difficult to attribute what was that delay, but there was a delta obviously in terms of the latency. If you see the, I've written the average latency. So without the load balancer, average latency was 500 milliseconds. With the load balancer, average latency went up to one second. But again, we were not worried because we had constant latency. So even if one of our success criteria, which was having sub-second latency did not go through, we will at least be able to do the other ones, right? Cool. Okay. So this is the second issue that we saw with load balancer and this was even more painful. So what we said was we are able to do two million on a load balancer. Can I do, let's say four million? Can I do, let's say six million on a load balancer? And we ran that test and it failed miserably. So even after getting the ELB pre-bump for eight million, it could still not perform at four million, right? So we were really scratching our heads at what do we do now, right? There is nothing that AWS can offer to us. There is nothing that, you know, we could do on the load balancer. We could not have our own load balancer spin up because that's risky for hot stars production scale. So we tried for alternatives. I told you about hot star AWS is NLB. So NLB is really catered to power such kind of a system. They say it is built for TCP connection that scale, even that could not perform. So this is the six million test with one ELB. It was pathetic. So the moment we started sending out the messages, the system would just crash eventually, right? And this is the second curve which we tested with a network load balancer and forget message delivery. It could even not hold two million connections. So it was pathetically bad. It was so bad that AWS, and I don't want to put them in a bad light, but that is a true issue. Open on them and it has taken them a year for them to be able to solve for this particular use. And I know the use case is very, you know, it's stringent. It has high data and network throughput, but it is not able to handle at least our scenario, the network load balancer. We also tried our own hosted versions of Nginx. We tried Nginx plus, we tried HAProxy. Nothing works. In fact, it was so bad that one of my broker load could handle more connections at the load balancer itself. So I couldn't really have more load balancers than my nodes, right? So that is out of the picture anyway. There is also a concept called direct server return load balancing. So if you guys want, you can read about it, but what it means is that when I'm trying to connect to the broker, it goes through the load balancer and the moment data transmission starts, it bypasses the load balancer. So the load, the stress on load balancer kind of reduces. We did not find a hosted solution for that kind of technique, so we could not go with it. And at the end, you know, we scratch the red and we said, okay, one load balancer works. Why don't we multiply that number of load balancers? So we had sharded classic load balancing as the solution. So imagine the infrastructure, which I'm going to show in the next slide as well. We had our broker nodes and we had sharded load balancers, something like six load balancers in front, all pre-worn to handle two to four million connections. So if one could handle four million, six could handle 24 million, and that is what we really did, right? It was not the best solution. We did really like it, but that is all that worked. So we went with it. And it's yes, yes. So let me go to the next slide and I'll explain how that kind of worked. So this is the test result. And eight million did pass. The only problem was the 99th percentile was now increasing. But we said that that is the best we could do. You know, that is the best that is out there. And 95th percentile was a constant curve. So we were not as much worried. If we are not able to do 99th percentile, we are able to do 99th, 95th percentile successfully and that is what we went with. And definitely the latency increased. So the latency is not just a metric of the load balancer or you know the hops. It is also a metric of how much stress are you giving that system or how much stress you are putting that system in and that is what we saw as well. I know that number seems high, but again we were not worried because we could not really trust that number. The load testing infra could even be taking that much time. So what we wanted was, first let's get a solution that has constant latency and then we'll worry about the latency. So eight million tests did pass with the load balancer, but we had a classic, we had a sharded load balancing setup which were all pre-warmed. I'm going to show you the architecture in the next slide and then we will talk about how we solved the other thing. So this is the fourth milestone. So just so everyone is on the same page, we have now built a system which can handle eight million concurrent connections at scale. We can send the number of messages you want which is 36 to RPS. Hotstar prior to us building this had seen 10 million as peak connections. So we were not far off. If we had tested, let's say 10 million, we could say that we would be able to handle hotstar scale. And what we also realized was, this system is in some ways horizontally scalable. If I were to just keep multiplying the number of clusters and if I were to just keep multiplying the number of load balancers, I could reach there. The only problem would then become would be AWS saying, we don't have resources or we don't have nodes for you to run on. And we were like that would not be a problem. And that is why we said that the eight million system can now handle 10 million and even more and let's say why not 50 million? We had a horizontally scalable solution and that is where the fourth milestone comes in where even if we had a system, there was a problem that we could not dynamically scale it and if you could not dynamically scale it, it would amount to a lot of costs. The step size of scaling would be very, very high and once scaled up, it can't be scaled down till everybody goes home. So that was really the question. And yeah, so this is how the architecture now look like. We had a published node. We had subscribed nodes. We had started load balancing setup. And how this work with clients is that we added a route 53 load balancing in front. So there's a there's a domain and route 53. We use route 53 is DNS load balancing to load balance amongst the load balancers. It is not said to be perfect, but in our production, what we've seen is it works. It kind of equally distributes the load amongst the load balancers. So we did not really have to put another one in front. But so that really worked and that is how we went with it. So clients still have a single URL for load balancers. We have started load balancers. And that is kind of how the architecture now look like, which could handle 50 million connections at the same time. Cool. So I've highlighted this EMQ bridge here. So and I did not talk about EMQ bridge before. So between the publish and the subscribe node. So the scenario looks like this. I will send a message to publish node. Publish node is supposed to send a message to subscribe nodes. And there is a concept called EMQ bridge. What it says is that it's like a simple bridge. Every message that receives get received at publish node will directly flow through to the subscribe node. So if I were to set up, let's say four bridges and send a message to publish node, it would automatically flow over to all the subscribe nodes. And that is our system kind of work. And what we realized was that this is the reason that is not letting us dynamically auto-spill it. And I'll explain you why. So how EMQ bridge works is that it is supposed to be set up from the originator or the source node. And it needs to know about the destination nodes IP in advance. Right. And that is what the problem was that if I let's say what to scale the center, what I would do is currently I would scale all the subscribe nodes, get a list of IPs of all the subscribe nodes, send this list of IP to the published node. Published node would set up bridges for all the subscribe nodes. And that is how my intro would scale up. Right. And the problem with this was let's say I want to handle 20 million connections. I would need somewhere around 100 nodes. I could not scale up 100 nodes before the match starts. Right. We see the traffic starts rising slowly from 100,000 to 200,000 to 2 million to 5 million. And there comes a peak which is at 15 million. But that does not, that peak does not hold for, let's say an hour. It holds for 10 minutes, 20 minutes when the match is really interesting. So it was kind of a mountain kind of a curve. And we could not have scaled for the tip of the mountain before the match starts. Right. And it was not cost efficient. And what if a subscribe, what if a subscribe node goes down? I cannot even bring it back up. Right. Because I have to bring down the published node and that is how it could scale up. Right. And if you guys know, this is a classical service discovery kind of a problem. And there are a lot of solutions available out there like console, zookeeper, which do the service discovery. Right. All I need to tell the published node is, hey, I am a new subscriber node. And this is my IP. Right. That is all there is to that. Right. So a service discovery mechanism like console or zookeeper could have solved it. But we wanted us, like, you know, wanted a solution that could be simpler. Right. Console or zookeeper would be another system. We would need to set up watchers for everything. And you know, it's another dependency that we had. And that is how we came up with an ingenious solution here. So actually I'll talk about it in the slide only. So if you look at the arrow, right, the connection for a bridge works from a published to a subscribed node. Right. And that is where the problem arises. We thought, can we reverse it? Right. If you were to just reverse this connection, published node is only single. And I can tell that node's IP or even set it as an elastic IP and tell it to everyone at once. Right. It could even be a domain name for that matter. Right. And if the system could be, or the bridge could be reversed, my system can then dynamically autoscale. Right. I could start with, let's say, only one instance. The moment the second instance comes up, it knows that this is a published node's IP or domain. Let me set up a reverse bridge to that particular published node. And that is it. The moment I set up that bridge, I will start receiving the inflow of messages. Right. So that is what we came up with. So we set up a new service called reverse bridge. And what this reverse bridge does is, it leverages the MQT connection in the first place. The connection at the end, or the protocol at the entire infra uses, we leverages the same thing to power the same infra. So what we did in the reverse bridge was, reverse bridge is nothing but a simple Golan service. What it would run in the same node as a subscriber node. It would make an MQT connection to the published node. It would make a subscription to the published node for all topics. It would listen to everything. Right. So if, if a message is sent onto a published node, this MQT connection and subscription would make sure the subscribe node also receives it. And that is where the beauty came in. Right. So it was a very simple service. It is, I think it spans across two or three files. It took us a day or two to get it up. And you know, we could just now autoscale the entire infra. So we could start with one node. We could go on till thousand nodes for that matter. Right. And the image of that subscribe node or the AMI or the Docker image, right, had it in build that it will run two services now. One is the MQ application and another is the reverse bridge application. Cool. So guess what? 24 million passed. We did test 24 million concurrent connections with this infrastructure in one of our game day scenario tests. So, and this is in January. So we were to go live in March. Hotshots had seen 10 million connections in our scenario where we replicate and simulate the production traffic. We were able to do 24 million on this infrastructure. Right. And I had not seen that many zeros before in life. So that was really good for us. Cool. So, and you know, you can obviously imagine by the number of rockets actually. So this is the impact that we had by this infrastructure. And you know, these are some numbers for you to crunch. We ended up sending more than 250 billion messages by this service throughout the IPL, which is 45-day tournament. From the moment it went live in January till now, we have sent over more than 700 billion messages. And by the time you've read this live, the number has increased. Right. So we, we've powered some really good use cases. And I would say that we've not even exhausted 25% of the entire infrastructure right now. So it's an open call from the, for the entirety of Hotshot, whoever wants to have a real-time message in delivery infrastructure use case, please go and use PubSub. It is set up with simple APIs that you know, you can easily integrate both HTTP, both MQT. And you know, it's, it's really like an, you know, a SaaS solution where you can use to send messages to the clients in real time. We thought about why can't we send personalization, personalized recommendations in real time, right? We've thought about, can we do configuration updates in real time, right? So there are a lot of use cases that we've thought about. We are already covering a lot many use cases. We are in the, you know, prospect of covering more and more. And I, I go and attend conferences while my infrastructure works, you know, and does the job for me. So it was active last yesterday. It saw 7 million peak connections. It is going to be active tomorrow in the match. If any of you want to experience, just open hot stars app and, you know, watch a cricket match. And you will see half of the screen showing you the experience that I showed in the video, right? So that is really the impact that we've had. That is it for me. Thank you so much for attending. These are the ways you can reach me. And we've written an in-depth blog about the entire thing which talks about the technicalities. You know, shares the tuning details, shares the applications, shares how we really build this structure. You can go and read about it. We've also talked about, you know, how the product is and, you know, how the design is. So you can also go and read about it. The metrics that you find, you know, the metrics that I've added here are also in a watch report that Hotstar just recently published. So you can also go and refer to that. How I like to end the slide is by this. So the four milestones are the four milestones, but that does not really tell the whole picture. This is how we started in September and our excitement kind of just grew and it has been growing ever since. So this is really what the beauty of the system that we built up. And that is it for me. If you guys have any questions, please, please, please. So in your system, you had a single published note, right? Wasn't that a single point of failure for the system? True, that is correct. That is a single point of failure. It has never failed though till now. So, and that does not really answer your question. We could have always went and, you know, added more nodes. You could say that, you know, it did not give me as much worry to go and do it, but I could do it anytime. So all I needed was for my back-end services to be easily, like, you know, to understand which, like, published note to publish to. And each published note would be then connected to each subscriber. So the number of bridges would double, right? I could have done it. We just didn't do it like that. I don't really have a good reason for you to... Yeah, what would have happened if, say, the publish note... Oh, the entire infra would be down. Right. Yeah, I know that. As in, I don't deny it. And, you know, while we've solved for so many things, I won't really say we've solved for everything. So we, there is still things in the system that are loophole. And again, we could go and fix it. We, though, have, like, a auto-recovery by ASG and other means. So if it goes down, it will come back up, right? And that node is already known in regards to the elastic IP. So all I need is that time it goes down and time it comes up. That would be a downtime. Maybe there will be a backup of messages in your system. So it is not like WhatsApp right now. So if you lose messages, it's okay for the product. The moment we go and solve for, let's say, you know, we want history or we want persistence, we could add that in the MQT. So MQT by construct supports persistence. So we didn't really have that use case, so we didn't really worry about it as much. Okay. Can we do a simple, like, how do we scale your subscribers? So if MQT, client MQT will be connected to an ELB using a TCP connection. And this TCP connection with the backend subscribers would be, again, a persistent connection. And so our load balancer has double the amount of connections our system can support. But if you add one new subscriber, you're not ensuring that all the new connections are going to that new peak server at that particular moment. So how do you manage that, like adding new machines into the system? Right. So that it would have been perfect if, you know, every new incoming connection went to, let's say a node that has least number of connections. And there is a strategy for that. So there's not for TCP, but I know there's a strategy where load balancer can do least, you know, the send traffic to the node that has the least amount of traffic. But the problem there was, let's say I get a search, right? And one node I just scaled up as zero connections, right? I can't send all 250,000 connections to one node because if it, let's say, take 100 milliseconds for one connection to go through, some would have to wait a minute, right? So what we wanted was I should not, like, open the floodgates to one node. I should slowly send traffic to all the nodes. And that is why we use round robin load balancing algorithm. And what round robin gave us was that it would make sure the load is first distributed amongst all, right? And that is where the second problem comes in. I'm sorry I could not have talked about it, but we did solve for that. So the second thing we solved for that, one node cannot handle more than 250,000 connections, right? But that does not mean it cannot connect. It will be able to connect. It will just make the performance of the entire infrabad, right? So, and that is a problem for us, right? One incoming connection could just fog up the 250,000 connections that are already existing and working. So what we said was MQDD application has a configuration where it says this is the limit of connections I can handle. That is one thing we did. Second thing is we limited the number of file descriptors and TCP sockets you could use. So that also limited the connection. So if it wants, so it cannot go beyond. And what it could go to is something that it can scale to, right? So load balancing would try to send in the connection to that node. It will say I cannot handle more and it will reject it, right? That was the first phase we came to. And there is another strategy we used where we had a different healthy LB and different traffic LB. So we made sure if a node comes and gets full up to the brim, it is removed from the load balancer, right? And when I say it is removed from the load balancer, the connection still persists, but it does not, it is not in the pool to be able to receive more connection. Did I explain that? Yeah, but I can repeat it if you want. Yeah, but does ELB support that? I mean, so ELB itself does not support it, but when we talked to AWS team, we came up with, you can call it a hack. We came up with a traffic ELB and a healthy LB, two ELBs. My clients only know traffic ELBs, but my health ELB is one responsible for managing the number of instances that are spin off, right? And who is connected to or not, right? So the moment a node gets up to the brim, I will remove it from the traffic ELB, but it's still healthy, right? And the connections are successful. The messages would go through. The moment it comes down from the brim, I would connect it back to the traffic ELB. So it's kind of a hack, but AWS team kind of prescribed it to us. We didn't really have any other solution. So that was the only thing. Thank you. Any other question? Yep. So you have explained the subscriber side of things, but the publisher needs to have, to be able to handle the two million connections as well, right? No, right. So at least not in our use case. I'll explain you why we have to do this one. So publish, what we said was, we really wanted to solve broadcast as a use case, right? So there is one backend service, which knows this comment is really good, right? And I want to send this comment over to all the clients, right? So there's only one publisher, right? That one publisher publishes a published node, which is again single. The published node is responsible for transmitting that message on to, let's say, 100 running subscriber nodes. So 100 actions in parallel, right? And then subscriber node has a traffic to send it to all the clients. So I don't really see how you're trying to say that published node could become the bottleneck, or it needs to handle connections at scale. It does not need to, right? The public node is not actually connected to the client. No, no. So Pubsub as a construct says, you know, anyone can transmit to anyone, but we said that broadcast use case does not need that. So for us to be able to solve broadcast use case, when you send a comment, or when you comment something, that does not directly go to me, right? It has to go through a moderation. It is for now a one-to-many and one-to-one. So we solved broadcast and unicast, not, you know, not something like a WhatsApp. It's not like a group you can talk amongst yourself right now. And obviously, you know, you understand, because I need to read the message before I send it to the client, right? I need to answer. Okay, okay. So we, I think, how we came up with that number is we did, so it was not the first number that we came up with. We started with like, you know, 5,000, 10,000. We eventually came up to a point where we said, one note can handle 500,000 connections, but when we started sending messages, the message delivery was not good and the latency was bad. So we came down for 500K and 250K was the one which was performant. And that is how we set this. Oh yeah, yeah, yeah. It, so again, first milestone is not the first checkpoint. It is something that we decided that works, but we came up to that number and not like that. How did you train the auto, the command moderation models and what, what latency did it add? And especially like with the Indian lingo and the context on that. So we should give another talk on that. It's like a detailed, it's like a mammoth on its own, but it, you know, if I could just quote rough numbers and not don't, don't hold me to it, it added somewhere around two to five seconds of latency. We have some ML algorithms that, you know, with a trained data set, which tells you whether a comment is propane or not. That is the first bar we wanted, you know. And then we have a set of manual moderation, which, you know, manual moderators are sent highlighted comments or, you know, comments with more confidence in terms of profanity, comments with more confidence in terms of relevancy. We had a, we had a huge list of sports terminologies, you know, and, you know, team, team names, cricketer names and stuff like that. So we would always rank those messages higher and then a manual moderator would say whether it is good enough or the message to go out or not. Was it completely English-based or did it have English? Oh, no. So I can't really stop the user from typing. There are keyboards that let you type in the, even if you, you know, don't allow it from your keyboard. What we could have done is, you know, gave out an error, but that is what we really did not want to do. And what we had done was we could identify unicode characters and say that whether it's in English or not. And till the time we had not solved for the other languages, we were not allowing Hindi. But if I'm not wrong, I think we have a fair amount of confidence on Hindi as well. And with manual moderation, we allow some of it to pass. But it's still a challenge. Like it's not easy. In fact, Hindi is still easy. Like solving for local languages is, is very difficult. Like you don't really have a vocabulary easily available or, you know, train data set easily available. So it's difficult to do. Yeah. So I don't know about your application architecture, but I was just wondering, couldn't a topic tree hierarchy of the MQTT be used in a way to isolate the clusters a bit more? So that would really mean going and changing the application. And obviously it's possible. I won't say it's not possible. But I'm not sure what advantage would that give us. Like as in, have you thought of an advantage? You're dependent completely on ELP and ELB codes. Oh, God. But you could use the topic tree hierarchy also to actually isolate it on that level. But I still, I would still need a URL to connect to, right? So let's say the nodes could handle amongst themselves that these topics, these, this, this bunch of nodes will only handle. Yeah. So I think the MQQ has a try kind of routing based on the topic. Sorry, can you repeat that? I think the MQ has a try kind of routing based on the topic area. I think I've read about it. I'm not really trying to. They may be, maybe that could have been a solution. But this, so my bet, personally, my bet is on network load balancer. Since network load balancer is one that is supposed to handle TCP connections and it auto scales as well. I don't need any pre-warming request at all. There is a service that is, you know, what AWS builds and is production ready has support. So I would much rather go and, you know, use a hosted solution than really build our own. But if, let's say the ELB would not have worked, we would have gone and changed the MQ application. We did go and look at Erlang code, by the way. I have my peer who did really go and solve that issue here. And, you know, we could have gone and solved probably this as well, but we never really failed in it. Anything else? Thank you. Cool. Thank you so much.