 Good morning everyone. So, as described earlier like I will walk you through the hot star journey from EC2 to Kubernetes specifically from IPL 18 to IPL 19 ok. Myself Rakhar Joshi I am a software development engineer at hot star I do most of the infrastructure work there ok. So, a bit about context about hot star we are like number one OTT platform in India. We have content in more than 15 languages, we have live on demand content, we have sports, we have retail content and lot of other things I guess everyone has been to the website ok. So, let us deep dive into like what is the number we are facing and at what scale we are operating. So, during the World Cup semifinals between India and New Zealand we have seen like 25 million plus concurrency on our platform. We have seen like more than around 1 million, 1.1 million RPS on specific services not all, but specific services. We have seen like the clickstream messages that we provide the data team that works on that. We have seen like more than 10 million clickstreams during that specific time and we have like video encoding, a video trans coding for more than like 100 hours and like the content is more than 100 hours and we transcode that like every day we do these things. So, this is just scale and the like the scale at which we operate day to day life. So, just to set context like we what we will do we will go through the journey of hot star from how it started in IPL 18 and how we operated what we did and then what challenges we faced and like because of these challenges what actions we have taken and what is the state of hot star during World Cup and currently. So, I will just give you a context like what hot star was at IPL 18. I will discuss more of the infrastructure platform things on which hot star was running. So, we were running on EC2 stack. We were having ELBs which are sort of ingress for our applications. We have those ELBs attached to our auto scaling groups which scales the application as per the requirements and then we used AMIs to deploy our applications. Those AMIs were pre-backed. We were using packer to deploy these create these AMIs and deploy these applications on the EC2 machines and we were using Jenkins for CI CD pipelines and for infrastructure deployment we were using Terraform infrastructure as a code. We were not doing anything manual. We were trying to like we have stopped at that time also but again we like never ask anyone to do any manual changes because whenever the automation scripts or whenever the deployment goes on it will overwrite. So, we believe in like you should like always make changes in code. You can trace them because it is difficult to debug issues and something breaks in production because of the manual changes you have done at one place and it can affect you at other places. So, yeah that was the state of hot star at IPL 18 and we have seen like around 10.3 million concurrency that time. We might have seen more but there were challenges and there were issues that even we faced but yeah. So, those were the issues that we I have like mentioned here. We will discuss them also like what are the issues we have seen and why like we were not able to handle more than that or even if we are handling them then there was a bit of issue in user experience and all this. So, the first issue we have seen is like handling search. So, you know like when a cricket match is going on and the peak of the traffic that comes on our platform is during the IPL or World Cup like cricket matches. So, what we have seen is like it is something interesting going on within a match or something happens then what happens like you see a surge of requests coming, surge of new users coming to your platform and you have to handle them seamlessly. So, that surge handling we have faced like it is quite difficult in EC2. We have seen like we were not able to scale as much as we want to scale during the scale during the surge that we get but so EC2 takes some time to boot up like if you have an application which is running on your EC2 machines you have to let's say you have to scale from just an example 22 and you need 30 machines. So, if you spin up 10 machines at a time it takes some time and it is just one service. We have like around I don't know the count but more than 40 repair services which are running and you have to scale everything you can't just scale one service and it will be done. So, that was one issue the first issue leads to the other one because we know like there can be surge at any time of the live event or any time. So, what we used to do we used to keep like the buffer of the resources like okay like if this much surge comes and we will handle it. We have these resources there. So, what we used to keep is like we used to keep some extra buffer with our infrastructure which is not good like at the scale that you are running like if you are running serving 21 million and currency and still you are keeping some buffer okay okay it can reach higher then it's difficult as it can reach to the hardware problems also you might see we have seen like capacity issues at AWS also specific instance that in that region in that is not available. So, these type of issues we have seen another thing we have seen is like when you scale up these much EC2 machines the EC2 itself calls the AWS API internally. We have seen like throttling of those APIs which is EC2 internally calls to AWS and we have seen like we were getting throttled on that which lead to which lead us to not scale at a time but we have to do the step scaling. We can't scale like let's say the let's say we decided okay we are running on a 20 million concurrency platform and we want to go to 25 million and we have to do like scale up 50 instances at a time. So, for one service if it's 50 then it can be more than 200 300 instances at a time across the platform. So, there are account level rate limiting and all this internal things that AWS does. So, those issues also we have faced. So, these were the challenges that we were facing and we were trying to like overcome these things during the live events. So, that was one part then we decided like okay whatever patch or whatever logic we are putting behind it there is something we are missing and if we go if you want to like break more records if you want to like double the concurrency or if you want to like make our platform more faster we have to think something else. So, we started exploring like how what other platforms we can use for Hotstar. So, we started exploring containers we started exploring how to orchestrate those containers and how to run those containers and manage them. So, like couple of things we have like found plus points in that and those are like containers you can run anywhere like for easy to like if your application is there then you need a local setup and everything container like if you create a just an example let's say we are using Docker. So, if you create like your 10 applications your 10 Docker file you create one Docker compose and you can spin up the whole infrastructure anywhere. So, these type of things local deployment become fast you can you don't have to like depend on one cloud provider all these things were positive there to boot up containers it's quite easy like as compared to EC2. So, as compared to EC2 it's quite easy to spin up a new container let's say you are running with 10 containers you 115 it's quite easy as compared to EC2. The resources are optimized because what happened earlier we used to have the whole EC2 machine for one application. Let's say my JVM is running it requires 2.5 cores and it requires let's say some amount of memory and lowest CPU that I am getting as a hardware is 3 core or 4 core I have to take 4 core I can't deal with 2 core machine. So, these some so these type of things we have realized like at scale it bites you like you can't just throw away like 1.5 core or 2 cores of instance because you can't take the last one because this is not available. So, with Docker you can or with containers you can specify whatever the resource your application wants what memory they want. So, as the boot up is easy for the container it becomes seamless for like we can scale it to any number whatever you want. There are challenges it's not like you can just keep the throw the number and it will scale to that but it's as compared to EC2 it's like seamless for us. Let's go to production time because what we have seen like we have created base container on top of that their application gets deployed. So, the whole deployment procedure has been extended. Earlier it was like some applications are deploying on their own way. Some are using different instance book some are performing some different instance book differently. So, these type of like mismatching across the organization has been reduced. And now as a devolved guy I know as infrastructure guy I know like what's the platform for any application if I have to debug any production thing of any application earlier I used to switch context. Okay, this application these type of setup are there in their production then I have to debug. So, these type of things have been improved. So, as I talk like Docker we have decided we will go to the container but how to orchestrate them. So, Kubernetes is the best fit for us we have found in present like what we can use to use we can use for handling our infrastructure. We have this because of Kubernetes we have like standardized deployments across what stuff because the way now application deploys is like you just give me the your application name some context about their application what resource you want what you want what's your image Docker image is all these things you just give me I'll deploy everything on the cloud for you. So, it's standardized nothing is manual there like some tags are required some tags are missing some parameters are not there some parameters are there. So, these type of things have like gone away from what's one clip deployment because we as we have standardized the deployment. So, now we have a separate flow for deploying your applications you can just write we use Casan net for creating the your JSON JSON objects to Kubernetes object to convert into ML and then apply it on the cluster. So, these type of things are helped us and also the developers like they don't have to think much about like how these things will gonna happen like how my alerts will be set how where my logs will go all these things are like out of the box provided to them because we have basic measures which do all these things logging and alerting has been improved because earlier what used to happen some applications are debug logging debug log some are just logging error log some are logging some different log it has been standardized every application can give their own information like what type of logging you want and why it like if I'm an infrastructure guy I should know like why some some guy is logging debug logs if it's not required because at the end of the day I have to manage the logs. So, these type of things were there. So, we can like at a single place I know like what's going on across the organization for and we use go CD for CI CD go CD is again deployed on Kubernetes cluster itself and it's a like this is the best fit for we have found for us but all of apart from all of this the best thing we have found is like request best scaling for our application. So, in EC2 world what used to happen is like we used to get the concurrency okay our platform is running on some x amount of concurrency. So, we have to scale for 2x or 1.5x. So, there is some manual things we have to do we have written scripts and everything but again like you never know when you will have to go to 1.5x or 2x. So, these type of things we used to do manually we have to manually trigger the scripts and perform the scaling and we have to manually check all these things. Now, what happens like our applications have become intelligent enough to scale themselves the way they get the request. The application spin up the containers on its own based on the horizontal part scaling that Kubernetes provide and we are using the metrics that we get from Prometheus for request that an application is getting. So, these type of combinations we have done and this request basis just made our life easier for us because now during the live event like earlier there used to be some person or someone is responsible for scaling the Hotstar. Now, it's like it will scale on its own. So, you don't have to worry about okay something interesting going on we have to scale the platform otherwise latencies will increase some errors will so on. So, all these things have been reduced. So far so good but again like if you talk at the level of Hotstar where you have more than 50 services, 40 references just thinking of migration and doing the migration is another different story. So, like introducing a new thing in an organization like for a team is okay but if you are have to introduce it for an organization level there are a lot of challenges you have to face like we have to dockerize the application. So, there are a couple of ways to dockerize the application tell the developers okay we will give you a sample docker file you can use this and create your own docker files that is one way. Another way is like we give them the base docker images they can on top of that they can build their applications and they can process again like what image which we should give like at organization when you have like more than 40 50 teams then you have to cover all the edge cases and you don't want to make your single docker image that huge k at the time of scale. It has to download that image in a node and it is taking like GBs to download that does the base image and then it will do this. So, these type of issues we have faced we have improved those things. So, these are the this is one issue we have faced another issue is earlier applications are running on single EC2 machine whatever CPU they want whatever memory they want they were getting now we have reduced it because we have seen like we were giving more than what is required for an application let's say a JVM is running we were providing four cores and four GP but we have realized like it can work on 2.5 cores or 3 cores and 2.5 or 3 GB something like that. So, why waste one core and one extra GB at scale when you are running like thousand containers for an application you are like just wasting the hardware that CPU. So, again when we reduce it we have to make sure like everything is working as well and it should have improved. So, these application performance we have to check for all the application from scratch. So, it was like your application is the first time it is going to production it is as same as the first time it is going to production again the third thing was cost optimization like we have reduced this is not like we have reduced like crazy amount of cost but the features that we are getting against the cost that we are paying is really like improved for us and the way we scale our applications now the way we do the performance tuning for our application has become good for us. The fourth thing was like I as I said like the autopilot mode or the request based scaling like or at what number each application has to scale you have to give that number to the Kubernetes or HP object like at some amount like if you are getting 10,000 request then scale if you are getting 5,000 request then scale. So, finding that number is difficult for us because right now it looks easy ok scale for 10,000 but why 10,000 why not some other number why can't it handle more request or why it is handling 10k why it was running on 5k earlier. So, all these findings are like the challenges we have to face and we have like worked on those things the other thing was like the cluster management we manage our own cluster in of cluster management we do. So, as we said like when we scale these things right. So, you have to make sure like with the scaling of your application your infra components are also scaling like Prometheus like file beat all these things. So, whatever extra infra components that you are running the DNS that we are running in that cluster all these things we have to make sure like they are running properly. So, after that like once we started migrating these things we made sure like the how we are making the platform ready for the live events or the big day where we see a lot of crazy traffic on our platform. So, we did the load testing of the applications we have like load generator and all these things I think Gaurav will be talking about how we test our applications for and make our applications prepare for live events. We have like during the game day preparation we have realized like whatever the scaling parameters we have given like either they are not tuned up we have to like we can tune them more properly so that at load or at peak we are not just throwing resources to them. The other thing was capacity allocation like even though you are moving away from EC2 but your nodes are again like some hardware. So, that you have to scale. If you are running to 200 or 500 nodes then you need 500 EC2 instances there or 500 hardware there which can provide you the CPUs and memory and handle the your containers. So, that allocation we have to we have also worked on that. Again, as I mentioned like the horizontal scaling of input resources because and if your application is not scaling then it will affect your one part of the application or one part of hot star or one part of platform but if your intra context are not intra resources are not scaling then it can affect at a cluster level or an organization level. Let us say your Prometheus is not scaling then you will not get any metric. You are not if you are not getting any metric your applications are not able to scale because they do not know what is the current request your applications are getting. So, these type of things we kept in mind and we prepare and we focus on these elements as we move towards the IPL and that was the state of hot star at IPL. We were using OCD for CICD pipelines we were using Balton control for secret management like on runtime at runtime we inject those secrets for the Docker containers for the applications inside the containers when they spill off. We were using Terraform earlier we used Terraform we love Terraform. So, we use Terraform for infra deployments. We are doing in-house cluster management as I mentioned like all the core components of infra like the networking inside the cluster Prometheus logging, alerting monitoring everything we do in-house like the cluster management we do on our own. The major thing we have achieved is like the autopilot mode for scaling our applications and that was like a huge win for us because as an infrastructure guy or as a DevOps person you know like when you have to be there during some events and you have you are responsible for scaling your applications and it becomes difficult like one time job is okay twice it's okay but IPL runs for 60 days and then World Cup comes it will run for another 45 days. So, it's not possible at human level K you can do these things at one point of time you will get frustrated and things go away. So, that is the big thing we have achieved during the migration. So, this is a bit of comparison that we think like we have done for migrating to EC2 to KITS. One more thing is like it's not like EC2s were bad or KITS is some next level thing there are issues in Kubernetes also there were issues in EC2 also but whatever the requirements and whatever the use cases we had EC2s were not getting fit for us it was more easier for us to move our applications and run our applications on Kubernetes as compared to EC2. So, one main thing you know like earlier the patching of infra components were difficult like let's say logging there is some let's say we were using some component in logging and there is a security issue in that component. Now each application has its own way of logging like the component they are using is same but they install on each EC2 machine. So, every packer thing you have to go and you have to change. So, that was the setup idea but what happens now everything is in our based Docker images whenever we have to patch anything we can just patch the Docker image we can update the tag and we can use those tags in the application when the application gets deployed. So, devs doesn't have to care much about that like why we are changing this or we are telling them we are creating tickets for them okay update this version because these are security issues. So, these type of things have reduced a lot and it's really speeds up the deployment and functioning of an at an organization level. So, another thing was like I'm again focusing on the autopilot mode because we know the pain that we have faced in high pill 18 during Asia Cup all these things like how difficult it is to be there and because it's a manual thing right you miss for two minutes or one minute or even seconds if you miss K you have to give a scale up call or you have to run the script to scale it up if you don't do this your user experience will become bad. So, it can impact at a global level. So, those are like critical components but we were doing it. So, yeah, those were like differences major differences we have seen another thing was like scaling for for the traffic and the for the pattern that we receive scaling containers is quite easy for us as compared to easy to the challenges we have already discussed. So, this is like a summary for whatever we have done in the last one and a half year like we are scaling applications based on the request base model whatever request each application is getting they are scaling on its own. You don't have to like people that are all these things like how many back ends you need for that concurrency how many back ends you need for that concurrency all these things have been gone away scaling is better in k it is as compared to easy to as we have to keep the buffer of resources we were wasting a lot of resources there the cost issue was there the hardware problem was there even we have to talk to AWS before right now also we have to talk to cloud providers before this but again like it has reduced a lot earlier we were like if we don't get a resource the application scale or we have to compromise with the node type or something like that. So, these were there then again like we have standardized the deployments and it's really like it looks like it's a okay thing it should be there but believe me like when you start your organization and like you have like 50 60 teams growing together it's difficult to keep the track of all the team. So making this standardized after the way it was earlier it was a huge success for us and again like once the deployment and everything is standardized it was easier for us to bring any application to production it takes less than a day now to set up all the applications from scratch once they have their application ready we can create Docker and yeah, thank you. Yeah, I think that's all from my side my you guys can follow me or catch up me on Twitter this is my Twitter handler and you can also connect on the email the official email address and you guys should visit to tech dot dot star dot com and see the blogs that we are writing. Thanks, Parker. Questions? Yeah, I have a question. So you spoke about your moving from EC2 to K it's that's Kubernetes open source containers. Was this like you know moving away from Amazon because Amazon also has ECS where it has you know container service and it also has really good integration with Terraform where it's I think much simpler than using k it's with Terraform. Yeah, so first of all it was not regarding any cloud provider or anything it was just the pain that we were getting because of the requirements we are getting the easiest thing that you have mentioned we have tried ECS also but the use cases and customization that we need at cluster level were not available at the time when we started migrating to ECS. So it was like you don't have that much control on the cluster level the way we want to tune our application the way we want to tune our things the way we want to tune our master nodes to give us some value or some things that were not available at ECS like you can't handle that much on its own. So then we decided like let's move away from like the service that is providing the master and coupon ECS for us but let's do it in-house. So yeah, that was the major reason for us to migrate to in-house cluster management as compared to the cloud providers that we are still on like AWS only. It's not like we are moved away because our nodes are still there. It's just that the management of the nodes management of the master node, the DNS of the cluster these things we handle. So instead of getting it from some third party things we are handle. Yeah, thank you. Hello. Yeah. So I think he partly covered what I wanted to ask whether you really moved from AWS to something else or so which way your talk was clearly about EC2 versus Kubernetes. You compared EC2 and Kubernetes in a way it is kind of confusing because you are ultimately using EC2 only at the back. You're just managing EC2 by using Kubernetes. No, no. So let me make this clear like I'm not comparing any two products here. I was saying like the way we faced issues in one platform and how the other platform helped us fighting those issues and making our life easier as a infrastructure guy or at an organization level how the other platform helped us more as compared to it. I'm not saying like everyone or like EC2 should be EC2 is bad or something is good or you should migrate to that. It's not like that. The use cases that we had we were facing issues and we have seen like in live events like we are not able to do it as per the expectations that we had. So that's why we moved again in K it is also we use nodes, right? The nodes that are running on EC2 itself, but the way just give an instance. I'll give an instance. So let's say you have 10 you have 10 applications, right? So earlier each application each process of your application is running on a single EC2 machine. That was the architecture or infrastructure that we had. So each application is using in situ instances. So we were wasting a lot of resources and everything. So scaling and all these things become difficult. I'm not saying that Kubernetes is just but click and over the night you can scale to whatever even at 25 million we had our own challenges. We had issues. We have to scale nodes also. So all these things were there, but they have reduced a lot so that we can focus more on our core intra components as compared to these issues like okay, we are not able to scale our platform. Instead of that we should work on making our Prometheus more better, making our logging more better, making other products more better. So we get time now to work on these so that's what like I think that's what I want to continue. Quick question. We have half a minute. Is it working here in the meantime? Okay. Sorry. My question is you mentioned Prometheus, but what do you guys use for centralized logging, alerting, monitoring and stuff like that? What tools do you use and you haven't even mentioned CDNs. So how much you know, it helps you guys in your. Sorry to cut you off. We'll start with the next session and yeah, we can take it off. Thank you very much. Thanks, Prakar.