 So I am Gaurav, I am a cloud architect at Oddsar and today we are going to talk about how we scale our application and what is the engineering effort that went behind how we could like handle our traffic like 25 million during the recent World Cup. So before we start quick show of hands like how many of you watch the India New Zealand game, the semifinals on Oddsar, wow amazing. So we will start with this graph because most of you will be able to relate with what exactly this is. So this is the actual traffic like the concurrency pattern that we had on our platform and as you can see here this is the toss. So if you can like this is the toss what happens is you get a lot of there is a notification that goes out and what happens is you get a lot of users that like that particular point usually like the matches were at 3pm so at 2.30 you get a small spike of 3 million 4 million and then like people come back when the first ball is starting. So what happens is the match this particular match was spread across two days because it was rain affected and as you can see the New Zealand batting their peak was around 13.9 million. The certain dips that you see in between if you are a cricket follower those are the in IPL those are strategic timeouts in cricket those are drinks break which usually happens at the 30 at the 16th over and the 32nd over and during the New Zealand batting what you can see is most of the time the concurrency was about 10 million so it is not like in one of the event where you hit a peak and come back this graph shows you that the platform is resilient and is very stable to handle 10 plus million of load for a stretch duration of like multiple hours and the New Zealand was like doing gold and they had a like a peak of 13.9 before the rain impacted the match and then suddenly a lot of viewers went away but still the match wasn't called off it was called off like late in the night around 10 and you can see like there was still 4-5 million people on hot star like still waiting for the match to begin and like India to win so that sentiment was there and this is something first time we are saying like there is there was nothing on the show like there was no cricket being played it was like just commentators or like highlights were being showed but still 4-5 million people were like stuck to their screen just waiting for match to start finally when the match was like called off it got postponed to the next day and what happened was New Zealand still had like 4 or 5 overs left so the first spike that you see is the sudden surge of traffic for those 4 overs and the dip is when the Indian team went in to pad up and come back so you can relate the entire graph with the crickets inning or journey what the main thing that I want to highlight is how quickly the traffic goes from like 1 or 1.5 million to almost 15 million that is the surge that is important that is like you are talking about adding 1 million users every minute on a platform and things like your back end easy to like whatever infrastructure you use it is very difficult to cope with such kind of traffic then there was like few wickets falling at regular interval but then Dhoni stood up and like people almost thought that ok we will win this match there were a lot of positive hopes and that is where an interesting scene started to happen so around mid innings around I think 30 35 when Dhoni was batting really well we started seeing almost like 1 million users getting added to a platform every minute suddenly from 13 14 million we almost went to 25 or 3 million and that is the number at which like Dhoni got out it is very unfortunate because that day like hot sir made a global record but at the same time like because like India is losing you have those mixed feelings like you cannot really celebrate when your team is losing so we had to go through those factors as well and the other thing that you see is from 25 million there is a sudden drop what happens is when people stop watching match you either click a back button or you come to the home page so you have to understand this from an engineering perspective or like if like some of you are like Android or IS developers you are coming to the home page it is not that the traffic has really shifted away from the platform now you are making more home page calls so on a hot sir home page you have mastered you have those trees you have personalized content like continue watching or because you have watched this content you should watch this so the recommendation engine the personalization engine it really takes a lot of it when things like this happen like you have 25 million watch people watching one content and suddenly your backend system take a hit wherein you start getting all the calls because when you are watching a video you are just requesting the playback file or the video file there is not much of a thing that goes at the API level but when a traffic shifts from live content to your home page at the scale it's like a huge hit. Krakkar already spoke about this so I'm gonna skip this slide but yeah I'm gonna talk about the point three which is one day 100 million so the graph that we saw 25.3 million was the peak and that is the first time that we had 100 million unique users on hot sir platform so out of those 100 million use unique users 25.3 were like active on the platform at like given instance so that's the scale and this is a 2.5x increase in what we had before this so this year during IPL we did 18.6 million but before that the biggest concurrency we had was during IPL 2018 that was 10.3 so if you look from last year to this year there has been a 2.5x increase in our concurrency that we can handle again this slide Krakkar talked about spoke about like almost all the things but I wanted to stress a bit on the 10 TVPS number this is the bandwidth that we consume when we do live matches and we were doing upward of 10 terabytes per second when we were streaming to 25 million users to give more insight into this number this is almost 70 75% of India's Internet so it's like really huge and if things like your infrastructure things like your application you can scale up but there is a limit to how much you can scale your network delivery because these are like fiber optic cables hard lines or like last my latency that we are talking about so there is a limited set of Internet infrastructure that is available in India as of today and we almost use 70% of that and why is 25.3 million a big number so this again will give some perspective into that so before we reading a record last year of 10.3 the global concurrency record was like with YouTube there was a supersonic space jump Red Bull event which was done I think in Australia around 8 million people were watching that event that happened in 2012 from 2012 to 2018 there was no big event like there were so many significant events like there was Super Bowl there was Royal Wedding then there was Donald Trump inauguration which was again around 4 million but no one could break that record until like IPA 2018 happened and if you see from IPA onward like last year the concurrency is just going up and up so that's like that is why like if you compare to the nearest competitor if you compare to the nearest competitor it's like almost 3x of what they usually do so how do we prepare the platform to take so much head and how the services are able to go like handle that much load or take that much beating when we have a live event so we have something called as project hull this is the load generation or you can say the load testing project that we have in house these are few of the numbers we use C5 NXLAG machine each of those instance has 36 CPUs and around 72 GB of RAM we would we use upward of 3000 such machines to just generate the load so like 100 K CPUs 2 1 6 terabytes of RAM this is the hardware that goes behind running those load test and so what used to happen is earlier when we used to do a load testing for hot star we also used to impact other customer because the network is shared in a public cloud the network is shared among all the people and when it comes to your CDN partner even they have the same connectivity or same edge locations that power all the customers so we notice that when we do load testing of such scale it used to impact other users as well or other customers as well so what we did is we geo distributed or load so right now our load generation whenever we do this is spread across eight different regions in APAC in Middle East we use US regions and even in like Ireland and London so it is spread across the globe so that a single region is not overwhelmed and there is no impact to other customers or the public cloud network and that's the bandwidth bandwidth that is that goes out per second so we do a 200 gbps of network out during a load test and this project is capable of like mimicking the entire user journey so what happens is when you launch an app there are bunch of API calls that happen which checks whether you are logged in whether you have a valid payment whether you have subscription or whether you can you are entitled to watch a particular content so all this API calls in an orderly fashion how they are called when the user interacts with that is scripted and that entire user journey is something that we test by variable inputs so we have scripts and we have tools that can mimic the entire map so the graph that we saw on the first page if I have to replay that entire traffic pattern we have those capabilities where in I can say okay like India New Zealand semifinals this was the traffic pattern this was the spike it will be able to like load test all those API's whatever is like the relevant call because you have to think it in this way that okay 25 million is the people but are all of those 25 million people making an API call to app a app so there is a variation in those counts then we also do a lot of chaos engineering just to make sure that the platform is resilient and is able to handle certain failures without end users coming to know about it or like it's a graceful degradation which I'll talk more about in the end this is a very basic view of what happens during load generation so those CPI 9x large machine are like launched into multiple regions goes by internet to CDN eventually it hits the application your ELB or ALB and you have the application sitting behind it and this all happens through the eight regions that we discussed about and the main motive of this is to find whether a particular application has any bottlenecks any limits which have been undiscovered what happens is at a lower scale if you just do a normal load testing most of the issues are not discovered there are certain issues which only crops up once you cross 10 million or 15 million boundary so how does the scaling look like so this is again in the EC2 world we have a growth rate of 1 million per minute and the reaction time is almost like less than 90 seconds you have a live match going on if you are like scaled up to a 10 million ladder or something like that certainly a surge of traffic comes let's say five more million users comes in the next five minutes the time it takes to scale up because you're talking about EC2 boot up time your application health checks to pass the EC2 to register healthy under a load balancer take certain amount of time it is not like you can do that in less than 30 seconds so to account those time we have to keep some buffer and scale up in advance also what happens is if there are interesting moments in match wherein let's say some baller has taken a hat-trick or Dhoni or Virat Kohli is I think three sixes so marketing team sends a lot of push notifications to engage users or bring more users to the platform and what happens is this is this notification goes to a large user base you're talking about 152 million 200 million users and even if we convert or we are able to convert like 1% or 2% of that number we are talking about 4 to 6 million people adding within a fraction of merit why we don't use the traditional auto scaling and like Prakar was talking about how we build the automation that scales up the platform based on either request or the ladder that we desired these are few of the factors why we went away from the traditional auto scaling that the cloud provider or your native ASG or ELB has to offer so you get a lot of insufficient capacity errors when you operate at scale anyone here who has seen like they requested a any instance type and it failed because the capacity was not there so at scale you cannot wait for this error because live matches going on if I requested 10 servers I need that if I don't get it my customers will get impacted the second thing is this is before easy to fleet came into picture a single order scaling group can only support one instance type what if that particular instance type is out of capacity in one easy or one region then you you cannot scale because in real time you cannot go like have a new launch configuration have a new order scaling group change the configuration in the load balance or ship traffic that is simply not possible when you have a live match going on so another challenge is step size of order scaling group so the way order scaling group works is when you raise the what you say the capacity from 10 to 100 what it does is its launches at a step size of 10 so at every 30 second it will try to launch 10 servers but that is very slow for us we are talking about like asking AWS or cloud provider to give us like 400 500 servers at every ladder so this approach is very slow because if it takes 30 seconds or something for just 10 servers imagine scaling for 500 so even if you increase those limits what happens is you actually put a lot of load on to the API that are working in the background so what happens is when you request 100 or 200 servers there are a lot of get API calls like describe tags or describe instances which goes on in background if one of them then fails and if you are reading some values from your EC to that your server or the application will not become healthy and it just gets stuck in the loop and the last one is interesting all of us are like a kind of game of thrones so this is a reference to that game of availability zone so what really happens is let's consider you have three or ACs 1a 1b 1c if you don't have capacity in one of them your order scaling group will try to launch servers in the other two the remaining two ACs but what happens is this slowdowns your scaling process because the order scaling group the algorithm or whatever is there behind it tries to launch it in the easy which doesn't have capacity so it will try in 1a successful 1b successful here it is a failure in 1c but it will keep trying launching that server in 1c in the loop if it doesn't find capacity it will fall back to 1a 1b what happens is your infra is queued like you have like more instances in 1a 1b than 1c and that what happens is it also adds a lot of exponential backup and retry so we have seen like because of a single AC not having capacity or order scaling group like the launch time increase to 25 minutes because because of the exponential backup that it adds after every failure it grows from 30 seconds to 1 minute to 5 minutes 10 minutes so that is how it the retry period become like keeps on increasing so that is something we cannot have during a live matter those were the like few of the reasons why we don't rely on traditional order scaling and we built our own order scaling thing which is of the infrastructure add a fixed ladder based on the traffic that is currently like each application is currently so that it makes a request base or the concurrency that is there on the platform this is what we do before the match for scaling for high concurrency even preview on the infra before match proactive scale ups again that is completely automated we use port instances port fleet special what this does is it gives you an option to like choose multiple instance type so you are not limited by a single instance type in port fleet you can specify 50 different instance type you can like play around with the instance type of your choice or spread it across multiple AG so you have diversification of the compute capacity that you require and this was again before the EC to feed came into picture since one is G cannot have more than one instance type we always had a backup ASG just in case we are not able to scale let's say C for X we will have one other SG which has C 4 to X if this is not able to scale up there is a SNS notification that is trigger there is a lambda function which will automatically scale the second so those kind of automation we had built in now talking about chaos engineering what are the things that can cause an actual chaos at outside like during a live match so these are few of the things push notification I already spoke about increase latency so this has a lot of like downstream effect if any of the API let's say the response time increases by 20 or 30 milliseconds there is a lot of impact to the downstream services let's say my content platform as a API which my recommendation or personalization engine is using if there is a 30ms increase here there can be a 200ms increase here because it relies on that API to serve to the end user and the interesting part about personalization engine and all these things are these APIs cannot be cached because each user has its own different taste and personalization content that he watches so your watch history or your recommendation will be different than what my recommendations are so these things cannot be cast this are going to the back end so any increase in latency is simply like not acceptable delayed scale up we also spoke about with the reference to the easy to the scale up when it needs to happen it to happen you cannot like see okay there is no capacity or due to X issue I will scale up after 5 minutes the match is not going to wait for you tsunami traffic is the one that we saw in the first graph the sudden growth and increase of traffic and the sudden drop this is something that can kill the platform or take what you say the entire back and down because you are talking about choke at any level like a DB level or your cash here or anything also another thing that I missed mentioning is so this year while preparing for this game days and everything what we also did is we came up with an internal project where in now we know how much is the capacity like the rated capacity each application can support and what is the breaking point of each system so with chaos engineering and no testing of these kind what like visibility that we get or the inside that we have for each application is how much RPS it can support and what is the breaking point then that breaking point is about to like reach or the application is near that we apply some gimmicks to just like give them a breathing room so that the services are not impacted and bandwidth constraints is related to that 10 terabytes number which I was earlier talking about so you can scale up your EC2 you can increase your compute nodes from let's say 500 to 5000 but bandwidth and all is something that is not in your hand because here we are talking about the physical internet like delivery infrastructure of the India you have like multiple big players in the industry and whatever is the infrastructure that they have set up like for example you are now plans to come up with 5G or something so a provider like hot star or anyone in the industry is going to utilize those bandwidth those pipes of internet delivery if the requirement like our requirement is bigger than that you simply cannot like handle more users at that rate you then need to look at compression and other things that can like reduce the video bandwidth that each user consume so that you can accommodate more users in the same pipe and network CDN failures are again very scary what happens is if one of your CDN partners edge or pop location has some issues and let's say you are a serving 10 million user the users that were being solved from a particular edge location or pop location they will have to retry all the request comes to original and this is actually bad because when you gradually scale up there is because once the content is cashed it doesn't comes to origin unless like it has a refresh or something but if there is a edge location or a pop location failure all the users that were like being served from that edge will now fall back to the origin so at a 10 million scale if I say like even 5% of those calls come to us that means the API that were earlier rated or designed to handle load in a gradual pattern they just have a flood of requests like all at once so that is also something that we have to take into consideration so the things that we discover from these kind of exercise or the game days that we do is basically find out what are the border like what are the choke points like I said not all issues you can discover by running a normal load test of like 10 million 10 users or 100 users some patterns only are discovered or like the cascading effect of a single latency increase in non-application is only discovered at a scale so like while we achieved a global record of like 25.3 million the load testing that we did to make our platform resilient was for 15 million so even if like the traffic would have been like 45 48 million we would still be able to handle that kind of load it was unfortunate that like only got out and they are lost on that day otherwise we were like prepared to handle even more load without like user seeing any issues on that particular day so breaking point of each system this is a project that I talked about earlier wherein you have information and in what each application is capable of handling and what are the breaking points so death wave is that certain spike the certain drop so that is why I showed the graph in the start of the talk so that you guys can relate what it means to a live cricket match and to overcome the things that cannot be fine-tuned and like optimized with application tuning or stuff like that is panic mode so what we do is if we know that ok API X cannot handle more than 10 million users we go into a graceful degradation mode wherein panic mode is basically you have all your critical services up you shut down everything else thereby making room or the bandwidth for your key services so key services for us is video delivery ads subscription if there is a feature like social chat emojis that you send during the match those are not that important right you talk about video delivery so when you have 25.3 million users on a platform you know everyone is watching match so non-max services like recommendation personalization can be turned off because you know that ok 99% of your user base is just watching match so by turning off these services I can reduce the load on my back and I can reduce the load on the network pipe that is there so that there is more room for video there is more room for API calls that are important to users on the application so this is all the same like we degrade gracefully even if like let me give you an example like the same are login services has some issues during live match it automatically goes into a panic mode wherein users are not even shown the login screen it won't ask you for username password because we know that ok genuine user has come our system has a problem we won't show the login thing you just let him bypass the login system and watch the match or the content he wants to watch because if let's say you have RDS or anything in the background which has issues troubleshooting that issue or fixing that problem can take time and you don't really want to impact the user in that moment of the game so graceful degradation and panic mode is something that we do a lot so we have a like runbook kind of stuff wherein at every 5 million intervals we take off certain services which don't add value to my customers or the users watching the game thereby creating more room or space for the actual stuff which is a really important here are the key takeaways prepare for failure know what your systems are like capable of and what are the breaking limits of that thereby you can take decisions like turning off the services or improving or optimizing those things user journey understanding of that is very important like I said we know what the entire user journey is the moment you open an app what X number of API calls you will make if you click the watch page what actions happen and how much API calls will come to back end how much will be served through cash so understanding of that pattern really helps you to design your system make your application or platform more resilient and it is okay to degrade gracefully because you don't want to show an error or like you don't want to let the end user know that okay there is a burning issue at your end let's say the same example my login system is down because of some DB CPU issue or the disk is full it is better to allow that user to watch the content bypass the login then showing him a error screen saying that okay like the login system is temporarily down which is a bad user experience so things from that perspective cool that's all I had for today thanks Scott up do we have any questions hey Gaurav thanks for the session it's pretty cool Nina here hi I have two questions which are sort of related to scaling issues that you start seeing at the work that hotstar does you've talked about scaling your application your compute capacity but have you ever run into issues with the load balancers themselves whether you're talking of ELB is network load balancers how did you work around that is one thing I would like to know the second is all of these things that keep monitoring and are doing like you're warming up off your instances they're all going to consume API calls yeah if AWS API and there are per account per service yeah thanks that's what I would love to hear cool so addressing the first question about the load balancer though load balancer are like called elastic load balancer they have their own periodic limits that they can scale up to and all our applications briefed that daily so all our load balancers are sharded that means the maximum level that a load balancers can scale up to or be pre-warmed to that has been reached and then we have to shut this load balancers into like multiples of three or five so we use like five or more load balancers all at their peak capacity then the traffic is distributed through weighted routing or stuff like that so that is how you scale because it is like elastic not in the reality about the second thing which was API calls right so there are a number of limits like even if I talk about the get API describe calls of all the services the other issues that we run into is the cloud watch API throttling which Prakor also mentioned like API throttling is a bit issue so we work with AWS to see what is the best that can be done they increase our limits to best possible level which the service team agrees to but in the end if there are still there are throttling we have to build our own solutions along that so in reality it always doesn't work I will say that we still have some answers around API throttling which is a reality that is why even like the security tools that we run we always test that in our environment wherein we can log how much API calls they will consume because even if you have a normal let's say a security scanner which sees what all users don't have MFA each of them is gonna consume an API call and we don't want a utility or any other system consuming given small portion of the API call limit that we have which is not for the user or which will impact my production system so that is a reality but then that leads to a lot of yeah it's a different conversation because now you're dealing with two of different account ID how you scale and it becomes more complex okay hi Gaurav this is a non show so my question was when up when you have a such a critical window right and you correctly point out you have dependencies on other folks like other people in the stack like the CDN and AWS so do you guys have like a you guys figured a process around a game plan that you had all the scenarios run through them yeah on support or all of them that's all around your activity that we like to work with our partners to ensure that whatever is the requirements that we need are there in place but sometimes it's simply not possible for them to like because no one in the world is doing stock at this scale right when it comes to video streaming so for them also at some time it's difficult but we work together to bridge that gap and whatever business possible but yeah there are issues that are still there one question is related to the bandwidth that you talked about you said you were utilizing 70% of India's band yeah I'm and you said you were doing 25 million users right I do not like think that that would have 15 million 50 million users because you said you could scale your microservices for 50 million users but the bandwidth in India would have choked at that point yes right so we are also limited by the right partners that we work with right so I talk about compression and all those stuff but they would only to a limited level right video content is not compression able to a very extent so in that case what we do is to accommodate more users let's say people are watching content that HD you reduce that quality so that you can accommodate more gradation again same pool of resources that you have right second question is microservices and services you can scale yeah what about databases so they are already provisioned for the peak okay because those are not like real time things that we can scale so that is something that our game days give us a number that okay for so many users this is the type of instance that you should have and that is provisioned throughout the tournament so we have like two sets of infra one for like sports season non-sports season you fall back I had a related question you mentioned it most of the infra is AWS specific so is there any other player in the country let's say talking about IPL who can you can partner with who can handle this kind of not sure not authorized to comment on that but yeah of the record today's reality even if like two three partners come together work together still it's gonna be a tough thing yeah because I'm not sure if there's any from the top three clouds and if anyone else has that big presence on AWS and the the second question is related to it which is coming at 70% of India's bandwidth is there so it was worth like a off shell of the shelf you have an account but do you do any of the special planning when you are targeting 70% of countries and do you have to so this is the video bangers but yeah like I said we work with them around the year it's not that you just go two weeks before IPL and of them okay I need this much capacity so IPL 2020 is like far away but they've already give them our plans like what are the things that we are looking for and that is a year-round activity that we have to work thanks