 So that was a very good, very informative session that we had from Helm and at Carousel we don't really use Helm but we have been exploring it for a while now. We want to move some operations over Helm and we want to see like how we can use it. But as of now we don't use Helm and we have plans in the future. So let me just quickly get started about telling you guys how we do Kubernetes at Carousel. So a quick intro of me, I am a senior engineer in the systems engineering team. I work previously at a few companies and I mostly do like infrastructure or application development or scaling. So that is what I mostly do. Let me give you a bit of a background for us. This is so that you guys understand what scale Carousel is at. So currently we have close to 600 servers. These are 600 servers. Most of these are like 8 core or 16 core machines. So you can guess a basic idea about core counts. We have external load balancers. See like roughly around 5000 requests per second, not per minute per second on an average. We have like a crazy break and a crazy crest but this is on average. So we see 5000 requests per second on an average. And we have an internal amplification of these because we have multiple services underneath. So internal amplification can be anywhere from 8 to 12. So roughly we take an idea about 10. So if you see that then you realize that we are doing 50,000 queries per second in our entire infrastructure. Which is quite a lot. Next one that we use a lot of self-managed deployments. So we use Elasticsearch, Postgres, Cassandra, Kafka, Radez, RabbitMQ. You think of some proper storage or proper application server, we probably use it. One thing which is special is we auto-scale our Elasticsearch. Which is I think something very weird because Elasticsearch is primarily used for search. So when you guys query on the app it goes in through search. But as you know that at night the traffic is very low. So we don't want to over-provision our Elasticsearch cluster. So we auto-scale it. So it grows in shrinks according to traffic. Which is very interesting. It's a bit difficult. It's a bit tricky. We have alerting on it. But so far it's been working for more than 6 months now. So which is really great. We have an uptime of 99.95. And we are able to handle multiple availability zone failures. So 99.95 we have constantly since a year I guess. Only I think one service in the past has missed this. But majority of the services are over or we were able to meet 99.95 percentile of availability. I'll just quickly go over architecture. So architecture. I think you guys would have seen this diagram. I've been using this for quite a while now. Over a year I don't think we have changed much. So basically all our clients like Android, iOS or web. Yes we treat web as a different client. Use the same set of APIs and they go through external load balancer. These are different layers. So like GCP or Cloudflare or whatever. Inside that we have our routing, security and load balancing layer. This is what we manage. Primarily these HAProxies or OpenResty servers or something like that. So whatever we want to do, we want to modify a request or do something great limiting whatever. We do it at this layer. Inside that we run our applications. So behind this, this whole cluster is where all our applications are. And then we have something called as internal load balancers. I'll go over them in a while. And then we have a storage engine. So there are two ways that services interact with either storage or among services. One is they call through internal load balancers or they call storage engines directly. So some of the storage engines are behind internal load balancers with primarily being caches. Because even if you have a cache miss, it's fine for you. But you don't want that call to take too long. And then we have two vertical layers which is monitoring. So we use a little bit of tools here from external tools like Pingdom, VictorOps to internal log management like ELK. We use Prometheus very aggressively for monitoring all of these. And then we do some alerting from Prometheus as well. We have Zipkin and Histrix which is primarily for all our new microservices. And then on top we have our configuration service. So we rely on configuration service very heavily to do all sorts of discovery. So this configuration service is basically built on top of console. Now if you guys are not aware of console, console is similar to ETCD. ETCD is what gives you the persistence, the CP persistence in Kubernetes. And we went with console because we wanted to use a few more features that console offers. So both are pretty similar. They both do RAFT underneath. RAFT is a consensus algorithm. They both do the same algorithm underneath. But console gives us a bit more features which we wanted. So we build this configuration service on top of console. So the main advantage for us why we wanted configuration service is because of this configuration service we don't have anything hard-coded in our APIs. There is nothing that is hard-coded here. So if a service has to talk to another service, they just discover them and then call them. This gives us the flexibility to run those services anywhere. We can run a service inside Kubernetes cluster. We can run it in instance group. We can run it across multiple clusters. It doesn't matter because the discovery happens based on name. This is not also DNS because previously we have seen DNS becomes a bit flaky because let's say in Kubernetes you have cubed DNS. You have multiple instances of cubed DNS. Sometimes the updates don't go through. Sometimes you'll have viewer client libraries that will just keep on caching that DNS despite having a proper TTL for it. So we just use configuration service and then we can just change those IPs. So we have everything as IPs but all the services inside are not even aware of it. They will tell config service that, oh, I need a set of IPs for this particular service and they'll just get it. In this way we can run any IP, any port and none of the service is affected. This gives us a very interesting way of doing upgrades. So we can route entire traffic from one cluster to the other without any other service knowing about it because their traffic will just route through. When it comes to Kubernetes, we do use rolling updates across a cluster. So whenever we deploy something, we will just set a new version of it and these are all running in cubed CTL or Kubernetes control commands which is run through a deployment pipeline. So we just go there and say, oh, I want this new image is available. Go ahead and deploy this image and then the cluster will automatically grow. So we normally do a max surge of 20%. So the way it works is think of it like a blue-green sort of deployment. You will have one version running, let's say 100. So that is your current version. You want to release 101. So version 100 is running. We create 20% more nodes for version 2 when they are up and stable. New set of nodes will be created for version 101 and the older for 100 will be deleted. So in this way, we go step, step, step, step and then we roll over. The advantage of this sort of approach is that in case something goes wrong, let's say there was a particular case that was not tested or there was a bug or there was a panic or there was an exception, something that was causing our containers to crash, we will never grow beyond this step because our health check or the liveliness probe will keep on failing. So we will be in sort of a degraded state, but we will never have a complete outage. Next, we use custom commands. So we use the same container and we can schedule it in a development environment, a staging environment or in a production environment. Based on these commands, we can just specify, oh, whichever environment the particular code is in and we change everything because service discovery and everything happens based on a configuration service. Based on the commands, we just set the environment. We say, oh, you are in staging environment, fetch staging configurations. You are in production environment, fetch production configuration. You are in dev environment, fetch dev configurations. So we don't, the same code base that runs on your dev machine is guaranteed to run on your production as long as the configurations are same. Next, we use some environment variables to force overload certain parameters in configuration service and this works transparently. We have our own inbuilt framework which we call Orion. So the way Orion works is it's basically written in Golang. It was initially inspired by GoKit, but we did a complete rewrite of it recently. It's based on Proto. So it uses sort of like a GRPC definition. You just do a create. It will create the entire service for you. Initialize everything, including Docker configurations, pipelines, everything. And then you get history, zipkin, error logging, alerting everything built in right out of the box. And these environment variables, whatever we have, can overload those certain parameters. So there are, I mean, some services always need a way out of your standard approach. Whenever you do standardization, it will never fit all your services because you'll always have that 80-20 rule. 80% of your services will always be running standard and 20% will not be running standard. They will come to you and say, oh, I can't use your framework because you want certain overrides or this particular case doesn't work for us. So in that case, we can disable or enable specific features inside of the framework by just using environment variables. And then it works seamlessly with the commands. Next, as I said, liveliness and readiness checks are really important. The reason you should have liveliness and readiness checks all the time, what most of the people do is they will just ignore this part and let their container run. And in case your server goes down or there's a crash or there's a deadlock in your server is not accepting any connections, nothing will happen. Right? When your service is down, you open CUBE CTLO, show me what is happening with my pods. They are completely fine. Right? You go and exec into them and then you see, oh, there is nothing. They are not serving any traffic, but you don't know why. Right? These sort of deadlock scenarios or crashes are very difficult to identify and fix. But the simplest approach is if you have liveliness or readiness checks, then your container will crash and it will create a new container. Think of it like the way your health check works from a load balancer. Your load balancer will keep on calling a health check API and then whenever the health check API fails, it will remove that. So no traffic goes to that guy. Right? The same way Kubernetes identifies, oh, this guy is not serving. You have parameters that you can tween. You can say, oh, three failures, kick this guy, five failures, whatever. You can configure. So once you have that configured, if your containers are crashing, then what will happen is you can clearly know that, oh, there is something wrong and the health check is failing. So either your health check is taking too long or your application has a bug. But your service never goes down, which is another very good advantage. Next, we use auto scaling. So this auto scaling is auto scaling for the Kubernetes cluster itself. So these are Kubernetes nodes, not just spots. So here I have a graph. The green part, the step one, indicates number of nodes that we run in our Kubernetes cluster and the orange line on top is the QPS or queries per second. How many requests we are getting per second. So as our queries go up, as you can see, based on the time frame, right, like early morning four o'clock, we have hardly any requests. But peak around like eight o'clock after 10, we hit a peak. And then we also see certain peaks when the MRT goes down. So, yeah, I guess people are like, oh, I'm going to buy something, I'm too bored. So the advantage of this approach is that we only pay for area under the curve, right? So if you have to put up a cost, you'll say, what is my cost? The cost is area under this green curve, right? That is the amount we pay. Whereas if you were doing a fixed amount of provisioning in your cluster, then you will have it at least more than your max, right? So more than our max is, let's say, here. So we'll have to pay for this portion and this portion as well, which we are not really using anything there. So since we are in the cloud, it really helps us and like we save a lot of cost because of this. The other advantage of this is if you can see this, right, these particular arrows, these are with deployments. As I said, we do 20% growth when I told you here, right, that we do 20% of max search. So 20% new containers are creating. So let's say we were running 100 containers of some instance and we do a deployment. Now there are 120 containers running. In order to support that much container growth, we add more nodes dynamically and once the deployment is done, it goes back and the cluster shrinks. So these peaks represent deployments of our major component, which is Carousel Django, which we call Carousel Django internally. So these all peaks indicate deployments for that. As I've already said, we do not use Kubernetes Ingress or Kubernetes Services because we completely rely on our configuration and service discovery that we have built. And what happens is the configuration service discovery engine runs as a daemon set on all the nodes. So every node will have a daemon set of this configuration service as well as an agent for console because console does a gossip and it's easier for propagation so we can specify some limits and it doesn't choke on the network a lot. And all these containers that are running in the particular node will then get registered in config service using the node port. The other reason why we use node port here is because previously when we started with Kubernetes, which was I think 1.3, the internal routing layer had some latency issues and some of our services are very latency-specific. We do not want them to be very latent because of our amplifications. Imagine if you have 50,000 requests per second coming in and you delay all the requests by 1 milliseconds, you will have a huge pile up. So you'll have to maintain more clusters, you'll have to maintain more nodes. Our first layer of load balancers will need to be even more bigger because we need to maintain those connections. All the connection needs to be established, flowing through and a lot of other stuff. So everything for us is very latency-sensitive and the advantage of this is that we don't need to change anything in the existing architecture. Tomorrow if we want, we can move off from Kubernetes to let's say marathon. We'll just deploy users and put everything in there and it doesn't matter, nobody needs to care where your containers or where your services are running. It'll just keep on running. And the other advantage of that is the service discovery from internal load balancer is streamlined. We don't have to worry about, oh, we are running this in Kubernetes. Kubernetes will work in such and such way so we need to give it either a service IP, we need to whitelist state or we need to do something with that. We won't have to worry about that. So as I said, config service allows us to have hybrid model. Services can be deployed even in multiple clusters, which we do and it can be in instance groups as well as Kubernetes. We were deciding on migrating to Kubernetes in Carousel. We went ahead and said, oh, okay, we are not so sure if this is going to work because this was done quite a while back. It was October of 2016, so more than one and a half years back. So we were running, it was coexisting, so we had half of our deployment in Kubernetes and half of our deployments in instance groups for close to a month. We monitored, we measured a lot of metrics. We said, oh, is the scheduling proper? Are we paying more for, because we are running Kubernetes? How is it working? And then we decided that no, it makes sense. It actually makes sense for us. The ratios that we have, the cost savings that we have, and the compaction and the resource realizations we have is pretty good. So we moved ahead with Kubernetes. The other thing we do is we have because we keep on moving clusters around and the VMs may not be what cluster the particular thing is running on. We run one instance, sort of like a Kennedy. So you have a Kennedy deployment which are actually VMs which are very lightweight VMs like two core or four core VMs depending on your service. And then you can go ahead and do whatever you want in those VMs. Because sometimes you'll have a production issue you want to debug, oh, my containers are crashing, whatever. And if it's in Kubernetes, then the liveliness helps you keep on crashing the container. It becomes a bit difficult for people to actually follow. So what we do is we use Kennedy. There's a dedicated VM per service, only one VM per service. You go and use it whichever way you want to do. You want to run some backfill of data or go ahead and do it on that node. It's basically very free. So we don't restrict who has access to it. Every team can go ahead and do whatever because they own their service in the end. Also, the reason we use instance groups is because we want a recovery mechanism in case we mess up something. In case because the somebody using a Kubernetes cluster goes down, we do not want a service outage because we want to maintain that 99.95 percentile. So in that case, we just resize our instance groups. And because we have service discovery such a way that it doesn't care where this particular things are running, the instance group will start serving production traffic and we can recover. The same thing we use for transitioning between clusters. So we do not actually do auto updates on our clusters because we have had some bad cases of these in the past. So whenever we have to do a major version update of the cluster, we just create a new cluster, schedule all the pods on that and then slowly route the traffic. So it's very easy for creation. So we just create a new Kubernetes cluster and terminate the old cluster. We have migrated. So currently we are on 1.8. We'll probably do the same for 1.9. When we feel we want to move to 1.9 Kubernetes, we'll create a new 1.9 cluster, schedule all the pods, terminate the one on the older one. Boom. We are done. For internal communications inside Kubernetes, we use NY. As we said that we are very dependent on latencies. We do not run these invoices inside Kubernetes hosts, inside Kubernetes clusters because we keep on spinning them up and deleting them. So we run these outside. These are like dedicated Envoy hosts. Envoy is initially what was written by Lyft and it is very good for GRPC because most of our services internally communicate over GRPC with each other. So it works really nice with GRPC. Then we use Prometheus for monitoring. The way Prometheus works is we deploy a Prometheus server inside our cluster. So in each of the Kubernetes cluster that we are running, which is I think more than five, we run one Prometheus instance and then we have one Prometheus instance which is the major Prometheus instance which just scrapes this particular guy sort of like federation. So we use that federation and then we don't care about the data that is there in the clusters Prometheus because most of the images copy back. Then we use Turbine for historic stream aggregation. So as you can see as I told you that we run a lot of queries. So this cluster is doing 10 and a half KQPS and these are all pods that are running inside a Kubernetes cluster. So this is a dashboard. It's for our authentication service. When you go into Carousel we verify whether you are a user you are logged in like what is your state. So this is for that service and whenever your request comes in we come in and we validate. Now if you look at this graph you'll see this fetch auth check valid these calls the 99th percentile is 1ms right 99th percentile is 1 microseconds. So that is extremely fast and the moment we add something to it it will cause delays all over because multiple services will call this particular authentication service to validate. So that's the reason we again went with NotePort because latency is very important for us. This is another cluster is here if you look at it the 90th percentile is 29 microseconds and then 99D grades like really badly. But and here 90th is just 0ms which is just insane. So within few microseconds we can actually validate and invalidate auto and saying oh this is valid this is a particular configuration or whatever and then we do massive amounts 10,500 queries per second which is insane. The other thing that we use is Zipkin. So we have tracing across all the microservices. Now if you notice I mean I'm not sure if it's visible from far but we have multiple layers of services. So the first layer is gateway all your request will go to a gateway node which will direct this to some other service. So we have traces across services. So the gateway service called a particular thing and here you can see this is the getUserID function which took 2 microseconds then it called third service is called one more internal cat service and then if there are any failures inside any of the services people just see this in the particular trace. So this is very helpful. But the point is again because this volume is huge. If we are doing 10,000 per function at top level we will be creating let's say 10 spans per each. So it goes on to like 100 QPS which is insane which is huge. So we do use sampling in some of our services because we don't need all that Zipkin data. But still we write it to an external Kafka. So all the spans are written to an external Kafka and then there are dedicated consumers that consume it and write it to a Cassandra cluster. And then our UI runs on top of this Cassandra cluster. Okay deployments. So the way we do deployments is using Jenkins pipeline. It's very easy. Everybody is familiar with it. I think everybody would have used Jenkins or Hudson if you remember that in your previous jobs. All the pipelines that we have trigger existing jobs. So as you can see this is our pipeline. Django build deploy pipeline. This was our monolith version. So we have a Django build deploy pipeline and this guy will invoke these jobs. And you know the step. First build it, deploy it to Canary, promote it to production deploy it to Django prod, deploy it to Kubernetes prod and then deploy the workers. So the reason we have these in steps is because in case something goes wrong or let's say our workers start crashing we can quickly go in run a particular job and fix that issue. So think about the case where let's say all of a sudden we are running our workers in a Kubernetes cluster and the cluster goes down. And or we have to roll back because there is an issue. We do not want to run the pipeline all over again. Because the pipeline deployment has a lot of other steps. So we want a short circuit in there. So in that sort of a scenario let's say you want to roll back only workers but not your applications. Then you can just run this job. It will roll it back. So all our pipelines actually trigger that. So if you go back and see these are our steps. Build the Docker image, deploy to Canary promote to latest, deploy to prod deploy to Kubernetes and deploy to workers and these are these steps. Build, deploy Canary, promote to prod blah blah blah. Now if you would see that this step 5 says deploy Django prod and then the next one is deploy Django Kubernetes. So deploy Django prod actually deploys to our instance group which normally is size 0. So in case something goes wrong we can our deployment pipeline does not need to change. We just call the same deployment pipeline and based on the jobs because there are no this guy just goes off right there and then Kubernetes one takes a while which you can see here. So deploy to prod takes a few seconds maybe because it will fetch the list of all the instance groups, do certain actions and realize oh, there are no instances running currently and then our actual deployment to Kubernetes takes a while depending on when we deploy. So like either 6 minutes, 7 minutes on an average it takes around 8 minutes. Then next we have approval steps for each deployment. So what we call is that 3 click deployment. What you have to do is you initiate the deployment, you approve it yes I have tested it, it's working fine or the automated test trigger or whatever and then you deploy it to prod and whenever somebody approves a deployment we just have their name. So oh, this confirm that I have tested on Kenry, yes and then approved by whoever it is and this is through our internal login so we know who did it. This is the reason we track who did it is to figure out what went wrong. We don't care if you actually did something wrong unless it's repeated. I mean I hope you are not doing it on purpose. But mistakes do happen and we don't really go into too much depth of those. The next thing is we have jobs to pause, resume or reward deployment. So as you can see these things that are marked IMP, this is so that you know these are important jobs you are only supposed to run them if you really need them. So you can do a custom deployment, you can throw away this pipeline and deploy whatever you want like all custom or you can pause an existing deployment that's running. This is when you realize oh I might debug but I'm not so sure. Basically we'll run kubectl pause rollout or you can resume that deployment or you can just totally revert that deployment. So the reason we have a revert in deployment is because we don't know which image we deployed. Sometimes what can happen is we will build a certain image, go to canary and then realize oh there's a bug. So we'll fix it, we'll fix it. There might be a few steps and then nobody when somebody is doing revert and somebody high stress scenario you don't know which was the last image deployed. So we don't want people to check, you just go ahead and do a revert deployment, revert to the previous known state automatically. And everything is tracked in Slack channels. So we have this dedicated channel which we call as backend releases. Any microservice or any service that is going to do a deployment just announces in this channel. We don't allow anybody to like discuss anything in that channel post because we want to use this to figure out if there are issues. The biggest problem with microservice or microservice based deployment is that it's always a murder mystery when you're trying to figure out what went wrong. It's like oh I'm Sherlock Holmes I have to figure out oh this service did this which means this service did that which means this service did that. It becomes really difficult. So if you have a log of all the channels, all the deployments you know this guy deployed oh and the issue started happening after this particular deployment then you know they are probably related. More often than not this is what we have seen and then all these are deployment jobs feed into this channel with a link of the deployment pipeline. So you know at this time this job was started this time the job was started and you have a clear history of what the service did and at what step what deployment happened or whatever. I think that is all I don't want to take too long I had like twice the number of slides there but I reduced it because I think we are already running out of time. So if you guys have any questions I'll I think yeah oh good question so what we do is we have a simple philosophy that all your container names all your service name everything will be same as your kit repo so once you create a kit repo for your project we follow the same name across everywhere per repo yes so you can create multiple repos and then you can do whatever you want you can have multiple services and then everything is that your name whatever is your kit name because most of the time what happens is we have seen this in the past people internally call their application by a different name some other team will refer to that as a different name and then it becomes a miss oh is this service this is that service that and it becomes difficult for you to debug sure so previously in the past we had a lot of services named after characters from anime manga and all of that so we had Denden Mushi we had Dante and like all of those but we have stopped that practice now we want our services to have the name of the function that they do so the gateway layer is just called gateway search is just called search odd service is just called odd service because it's just easier for anybody who's joining new or when you are in a meeting to figure out like okay what this service does yeah so ideally we would want to avoid that sort of a scenario but yes that scenario do happens but we don't really change in that case because we would lose our history so if you change the kitrepo or you change the version or you change anything then you don't have the history of that so we sort of avoid that it's a very tough thing to solve for right like same thing we go by 80-20 solve for 80% of them 20% you will hit no matter what anyone else how many of you are still awake okay guys thank you I think I'll give you