 Okay, welcome everyone. This is the tech talk that we are doing. I guess you guys would have already noticed if you've been using Carousel for a while that the app has increased the performances or the downtime they have gone down. You would have been noticing some changes, right? So here we are gonna share about what we did, how we went about it, right? So, let me start. Who are we? This is Harshit. He's the principal software engineer here. He leads the infrastructure team. We have worked in the past for quite a lot of years. I'm Amkur Sivastav. I am one of the senior software engineers here. I report to Harshit. And basically we just work together. Yeah, that is what we are doing. Yeah, that is right. Thankfully. Okay, so let me give you a bit of history about Carousel, right? Carousel started in 2012 at the hackathon. This is a really old image that I took. I had to go through these two pains to figure out these guys look awesome in the pic, I guess. Okay, currently we are in seven countries, 19 cities. We have 57 million listings live and we have sold more than 23 million items. And that's our principle. We want to make buying and selling simple so that you can fulfill your life with more meaningful things. Let me tell you about the infrastructure perspective of things, right? So, currently we have more than 400 servers. We have multiple services underneath. Most of them receive more than 2000 requests per second. And now let me tell you that these 2000 requests per seven are not just read. They are write calls as well. Right, so 2000 RPS is not that big a number when you think of read. But when you mix in write calls, it becomes a bit problematic. Right, we have a large number of self-managed deployments. The self-managed Postgres, Elastic Search, Cassandra, RabbitMQ, Kafka, Riddes, Fankage, and much, much more. And we handle uptime of 99.95. We measure that on a monthly basis. So it's not like quarterly or half a yearly that most of the companies do. We do it on a monthly basis. And we have the ability to handle a variability zone failures. We go more details into that. So this talk, right, why we wanted to do this talk. We wanted to share how we reached here and what are the plans that we have for the future. So Harsh, do you want to share? Thank you. So let's see what our current infrastructure looks like. Let's start from there and then we can go into how we came to this position. So okay, let's start with what infrastructure means to us. Infrastructure means architecture, systems, and operations. So apart from operations, we also do architecture and systems. And what we have realized is having a strong architecture and systems background will lead to, say, faster resolution of problems whenever you face them. Have a proper design of the system which will be more fault tolerant and everything. So having deep, thorough knowledge of systems is essential to have a strong infrastructure. We realize that stateful components, and I think most of you would agree that stateful components are the most important piece in an infrastructure. And they define how well your infrastructure can scale, because stateless components can be scaled by just adding them. But stateful components may make or break your infrastructure depending on how well you can scale it based on your traffic patterns. So identifying your traffic patterns and scaling your stateful components is the key to scale. And how agile you can scale, in how agile manner you can scale, not the other agile. For that reason, we self-manage all our user part data stores. That gives us these flexibilities. We have, like, we have a choice. We can choose whatever data store we want to use based on various compromises on consistency availability and what not performance being that one of them. Possibilities of work around during, during whenever there is an outage, we can think about moving the data somewhere else, quickly move the data somewhere else and bring it back up on some different store or something like that. And have very fine flexibility in node configuration, choosing what size of nodes you want and everything for your stateful databases, stateful stores. This is how our current infrastructure looks like. On the left-hand side, we have our end user front-end apps, the iOS and Android app, and website also connects. So the specialty of an infrastructure is the functionality is implemented only once. The website invokes the back-end layer and reuses the functionality implemented in the back-end layer. The web is just a rendering layer. So all the functionality resides in the back-end layer. There is a front-end load balancer SSL termination layer. We have a proxy layer, which does routing security, load balancing among other things that it does. It is all static but auto-skid. The back-end functionality lies in the next layer, where there is the main back-end API, where all the functionality resides. Then there is a chat API cluster. We have a separate cluster for handling chats. And we have another cluster where miscellaneous small functionality is handled and workers are also run, workers to process asynchronous tasks, sending out mails and messages and all such things, push notifications and everything. Moving to the data store layer, we have the postgres layer. We have three elastic search clusters. One main for the listing search. Then we store our chats in two stores. One is in the one, so the primary store for chats is Cassandra, but we also store them in elastic search so that we get the flexibility of lookup that elastic search offers. And we have Memcache and Redis layer. So Memcache is used for like caching very frequently accessed components in the listings and profile pages and other things. And Redis instances for low latency counter kind of application. So we maintain counters and everything. Most of the counters on the app, you see come from the Redis instances. And we have RabbitMQ and Kafka as giving layers. RabbitMQ is used for asynchronous tasks. Like so we enqueue whenever a user creates a listing, sending out a mail is enqueued in this RabbitMQ cluster. Things like that, RabbitMQ handles it. And Kafka is used for analytics. So any action happening in the backend is recorded in the Kafka cluster which is pulled for analytics. Then we have really important pieces of the config service and the monitoring, alerting, logging layer. Config service is based on console which Ankur will divulge details later on. And we have monitoring, alerting, logging. So our main monitoring is based on Prometheus where we collect almost every system metric that we can capture from every node, every server that we host. And that is really important for us. So Prometheus acts like time series capture, metric capture layer for us. We have ELK where we ingest all our logs and derive various metrics out of it and basically troubleshoot whenever there is an issue. And we have Pingdom and Neuralik for third party monitoring. So Pingdom to ensure that we are up and from an external source to ensure that even if our monitoring goes on we have an external source covering us and Neuralik for basic APM that it offers. Going a bit more into the data stores that we have, the main, the source of truth for us is a Postgres deployment which is a, so we have one master and two slaves in three availability zones. So total of six slaves with one master and one additional slave for analytics which has where the timeouts and everything, query timeouts and everything is configured really high. So the buffer sizes and timeouts are configured high where you can run long running queries but that doesn't handle real, as in user facing traffic. On top of this there is a PG Bouncer layer and then on top of it is an HAProxy layer. So HAProxy layer handles load balancing across the MySQL slaves, sorry, the Postgres slaves and the PG Bouncer layer ensures proper handling of connection, so per transaction connection pooling and everything. What we learned is always have dedicated disks for databases because there are a lot, tons of optimizations per disk that the operating system can afford as well as things like detaching a disk attaching to somewhere else, resizing a disk is very much easily possible when you have dedicated disks rather than having the data on your root disk. Always use SSDs, of course, the era of rotational disks has gone for databases unless you really know what you're doing. Master disk snapshot, so for, we really can't afford losing any data, so we, apart from the slaves that we have, we take full backup of the database every three hours. We take the backup of the master database and F-Sync is enabled, so we have proper point in time snapshots of the master file system. Another point that we learned a bit hard way is never turn off auto vacuum, even if you have disk space, even if you have plenty of disk space, because transaction ID is apparent. Your transaction IDs will eventually wrap around because they are 32-bit integers. If you turn off auto vacuum, they will not be reclaimed and you will reach a point where you'll reach a dead end and it's hard to get back from there. And even if you turn on auto vacuum at that point, it will take a while and it might be risky. So, if you want to explain more, uncle. So, we were facing this issue quite a while back, and we were seeing that the auto vacuum job was running all the time. I'm not sure if you will monitor that, but if you don't, you should. If you have a Postgre database, please have a look out for auto vacuum. It will backfire. So, the use case for us was we wanted to do a migration. We wanted to change some scheme. I wanted to add a particular column to a table. And then we noticed that auto vacuum was holding a lock on it and it will just not let go. Auto vacuum will run, get killed off, start again, get killed off. Sorry. It will start, it will get killed off, it will start, it will get killed off. Later, we realized that our transaction ideas were hitting the limit, right? So, if you are aware of Postgre, I'm not sure if you are. The transaction idea up around is close to 2 billion. That is to support MVCC. I don't want to go into details, but if you want to have a look, you can look at. That's how Postgre allows you to do MVCC, right? What we realized was there was a broken relation that was causing auto vacuum to die. But the problem is there are multiple steps there. So, it will do the first step, come to the next step, die. Do the first step, come to the next step, die. And we were approaching the limit very quickly. So, if we would not have had to do that schema migration or schema change, we would actually have ended up in the trouble. But because of that schema change, we realized, oh, we have a bigger issue. So, we killed it off. We dropped the relation. We did run a manual auto vacuum, and it was all fine. It was all okay. And we did a downtime, we did like two minutes of downtime to enable auto vacuum and do it very aggressively. So, that is something which I guess everyone should take care of. Moving on to the next, let us talk that is important to us is Elasticsearch. So, Elasticsearch drives our main listing search that you might be using. We have three clusters, largest being 75 nodes. We have enabled shared allocation awareness to ensure that even if one availability zone goes down entirely, we still are able to handle the traffic without any impact to the user. It basically ensures that each availability zone has the full copy of the entire dataset, and your queries are routed properly. Use of plugins, Crop, Head, and Cerebro is really important for us to manage, because without these, we really couldn't have managed Elasticsearch properly or efficiently. Masters need to be kept in different availability zones because availability zone loss, you can survive availability zone loss by having two masters if you have total master size three. But ensure that minimum master count is not set to one, otherwise you'll have split brain, two clusters running separately. If the availability zone traffic is cut off from each other, but they still are alive. And we have an HAProxy on top of it, with L7 Lchecks, HDP Lchecks. The important part is L7 because L4, the point is L4 cannot do application layer timeouts. L7 can quickly detect if a node has gone down, but L7 will require the node to entirely go down or the process to entirely die out. So L7 Lchecks with the proper timeout enables HAProxy to quickly detect if a node is going bad and just removes it out. We do incremental backup, backups to S3. And yeah, so key here was to set the short count correctly and airing on the higher side because you can't really increase it afterwards apart from the re-indexing, so which is an expensive operation. Reliance of page cache, since Elasticsearch heavily is on the Java heap, using the next page cache to cache your data is a better bet than using Elasticsearch cache. That is what we do as well. So we have set the heap size to a short value and let the operating system manage the cache. Okay, talking a bit about the history, that is what a bad day used to look like earlier. We were on a different provider. We used to hit upper limits in network disk on quite a few critical nodes that we were running on. Moving out of it was difficult because we were choking network already, so moving the data out was not really feasible. We were facing issues related to noisy neighbors which were translated to disk operation latency going higher suddenly, network obviously going flaky intermittently and suddenly CPU being not available to the guest VM. So you suddenly see a spike of LA but CPU is flat. There is no IO activity happening. Then everything comes back normal. That is the classic case of CPU being, your VM being start of CPU. Then the types of instances were limited. There were like only few types of nodes which we could spawn. Lack of cloud features like load balancing, auto scaling, security were not there. So we decided to migrate out of it. How did we execute this migration? So let's go through each stage that we had to go through because obviously it was not such a simple task to move everything out from one provider to another. So around June 2016 is when we decided to migrate. At that time we were on around 250 nodes roughly of that size. The task was to identify each and every node, what each and every node does and their functionalities, identify all the traffic patterns, how the request flows through each of the nodes, gathering all the information that was scattered everywhere in the company and gathering it into a central place, documenting it. And after this was done, the architecture was frozen. No more changes to the architecture. It's a full stop. Now in this provider we'll look like this. We would make feature changes but no architecture changes. Then we started doing comparative benchmarks between the new provider and the old provider. So obviously there was a lot of performance difference between the old and new. So we had to measure this and factor that in how we redefine the node and cluster configuration in the new one. So then we took up the task of redesigning our entire deployment considering the new provider. And yeah, the new provider was Google Cloud. If I can, so then we did a completely isolated deployment in GCP that is Google Cloud. And we connected our app and basically saw whether the latency was acceptable or not. We did thorough checking whether it's acceptable to end users from various countries. We did a user test, usability test on the new provider with an isolated deployment. So we copied some test data and tried it out. It worked fine. So everything looked fine till this point of time. Then we did dry run for copying the data over from this provider to that. So we tried it out multiple times to see what is the fastest way to copy and to minimize our maintenance window. So we tried out multiple ways to copy data from this provider to that. We'll go a bit more detail into that. And estimated time for the, how much should be our maintenance window, how much downtime we can take and what will be a conservative value for that. So preparation for this. July 2016 was spent on it completely preparing for the migration. We started by setting up a heavy duty VPN aggregated bandwidth between multiple nodes so that we have like GVPS of throughput, encrypted throughput between the two providers. And we replicated all that can be at that point of time which can be done while we are going through. That included the post clusters. So we set up slaves in the new provider in multiple regions. So even if one slave got disconnected because it's a van link, even the probability of the other slave also going down was less. So we bettered on that and it worked out. So we created multiple slaves in multiple regions. And we kept all the stateless nodes ready in the new provider because we can do that. So we also had, we also decided to switch our DNS names error which we did three, four days in advance to the migration, as we learned that was not enough. So, yeah, the learning that, so the principle that we followed was script as much as we could in the migration so that less of manual intervention and less of it being error prone. So we scripted node creation, cluster creation. So given a size of cluster, the script would just create that cluster in the new one. So during the migration, we shouldn't be creating the nodes. We should just run a script and it should bring that cluster is what. We also scripted some data movement pieces out of it. So after beginning downtime, we just executed those scripts to execute the data copy. The key here was to aim only for data movement during migration. If anything else was kept during the maintenance window, it was gone. So the pressure of the maintenance window is still there. So you can't troubleshoot in that and you don't have the time to do anything else basically and practice. So we practiced these steps multiple times to ensure we are thorough with it, we are comfortable with it and we are confident about it. The migration day, so it was 29 July, 2016. At 3 a.m., we took the downtime. We first let the queues drain. So we let the data in rabbit MQ, Kafka, everything to drain off and let it complete the functions that it was intended to send out mails that were remaining at all. Then we started with replicating. So we already had a slave in the new provider. So we had to promote it to a master and then replicate, multiply it into multiple slaves. So we did that across multiple availability zones and it was quite smooth. We didn't face any problems into that, thankfully. Elastic search in Cassandra, we did a snapshot restore where what we figured out was the link from both the provider, so S3 was a common point where both the providers had a fast link. So we copied the snapshots to S3 first and then restored it from S3 to Google Cloud while we were restoring it. So that was figured out while we were doing the trials. And the GCP network is quite fast and that helped us in quickly restoring like 700 GB of elastic search data into re-index and expand it into the new cluster really quickly, like within 10, 15 minutes it was restored. So that was there. And Redis, that is where the challenge lied. So it felt simple as in just take the RDB and restore it in the new provider but apparently as we learned out, it was not that simple. So isolated instances, single node instances were fine. Just restore the RDB, they came up. The problem lied in the cluster. So in the trial that we did, so from the time we did the trial and the actual migration, we came to know that the cluster configuration had had switched. So the slot allocation from between the masters has had switched. So what we did, what ended up happening was we copied the wrong RDB to the wrong node and tried to restore it, which failed completely, which obviously failed, the cluster didn't come up. And we had to debug that during the maintenance window. That is where, that was our catch. So fortunately within like 20 minutes, half an hour we could figure out that this is happening and could, so we had to mirror the existing cluster entirely into the new one. We migrated slots into the new cluster and then positioned the RDBs correctly and then it came up, thankfully. So half an hour, 35 minutes it was a tense moment for everyone. Post-migration. So around five to six hours was the maintenance window. We came back up and we came back to a latency of one fourth, the original one. So that was our original latency. That was our downtime for the migration and that is what we came up to. And our throughput reaching normal levels. But there were a few issues that were unsolved till that point of time. The major, the main one being DNS hadn't propagated. So there were a few ISPs in some countries which had cashed the DNS entries way beyond the TTL. So our TTL was much less than what they should have cashed. So the solution was to obviously, they were resolving to our old IPs. So our old IPs were, we sort of created and we deployed HAProxy on the nodes which were listening on those old IP addresses. And created L7 tunnels on the VPN that we had already provisioned for the migration. So we had the VPN, so we created a tunnel from that VPN. The app, so due to this, the app came back up but it was slow. Still it worked for those users whose DNS was cashed. That's what we had to do after the migration. Immediately after the migration we were doing this. Another key thing that we ensured was monitoring after the migration is taken over by a different set of people because the set of people doing the migration are completely exhausted after that. So that is what happens after the migration. Okay, so the key takeover is that we can have here where practice, practice is really important practicing the various data flows and keeping it in the muscle memory rather than you actually thinking about it is the key. Keeping stateless nodes ready. So you can do whatever you can before, you should do whatever you can do before and stateless nodes being the key of them as well as the configuration. So you can keep the configuration ready with the IP addresses that you provision for the database nodes, the data stores. So everything in the stateless layer should be ready. Just the stateful nodes should be plugged into the cluster and that should be the end of it. And there will be issues. However you prepare there will be one or two unknowns. For us it was the Redis cluster, state switch and DNS caching for, but yeah, there will be some things that you would miss out. Most important, keep calm because that is when you can actually solve for problems that you will face. And I'll give it to uncle. So after the migration, we realized that just switching providers is not going to be enough for us. Sure we reached one fourth of our latency that we had previously, but we still had other issues. We still wanted to make sure that any failures didn't cause issues to us. We stayed active. We wanted to maintain a set of uptime, right? So how many of you are aware of what this means from pets to cattle? Just raise your hands. I just wanna, oh, wow. Okay, so this is a very, very sad way of describing what is true. Basically what it means is, what is a pet, right? A pet is something that has a name, it lives in your house, and you care for it, right? If the pet dies, you feel very sad, right? Come on, like, you will feel sad, right? Yeah, right, but what about cattle? Cattle is something that doesn't live in your house. They don't have names, they have numbers, right? And you don't care about them unless they die in a set of numbers, right? If you lose, like, 50% of the cattle, then you have an issue. If you lose one, it's fine, because you were going to do that anyways, the sad part of it, right? Sure, so basically, right? Your infrastructure should not be like pets. You should have as small number of pets as you can, and your infrastructure should be more like cattle, right? One going down should not, so this is what we wanted to achieve. The thing being static infrastructure, there is no such thing, it's a myth, right? Gone are those days when you can have a cluster running for, like, 10 years, right? We have all heard of that Unix machine being up for 15 years and running continuously, like, gone are those days. Banal updates can be really faulty, right? How many of you know of the Miss Universe fiasco? How many of you have done this? Oh, I'm just gonna delete this note, it's not used. Oh, wait, go on, right? There it will always be, I have done it. I'm not proud of it, but yeah, it happens, right? Notes can fail quickly one after the other. You create a note, you're like, oh, this is my new engineering server, or this is my new entry proxy server, and boom, it's gone, right? That has also happened. It happens, if it hasn't happened so far, it will happen, so take it as a warning, I guess. Then configuration can quickly become stale, right? You are in a state, you are in this sort of technology companies, right? You move so fast that things will always keep on breaking, things will always keep on changing. The moment you have changed, something will always be left out, right? And misconfiguration of note. That always happens, come on, that has happened with you guys, come on. No, nobody is, everybody's sleeping, right? We shouldn't have given the food first, I guess. Okay, I think, if you guys are following me, just respond some, like a bit, right? Okay, so misconfiguration of note happens. Here at Carousel, we use salt for most of our deployments, right, and salt propagation can have issues, right? Some updates could be missed out. There's some noise, yeah, I'm not sure, okay? Yeah, okay. Sure, so the problem is misconfiguration, the biggest problem with misconfiguration is, it's very painful to detect and fix, right? The problem is you will sort, the most likely way this comes up is, oh, this user is trying to do something, but some of the time, something else happens. And you're like, okay, I don't know why that's happening. It's very painful to detect and fix, and it always causes production impact, more often than not. So what we needed, if you want to scale infrastructure, you want to have centralized configurations as much as you can. By centralized, I don't mean that you just have one config for the entire company. You can split it up in different clusters, you can split it up by different teams, but at least have a central place where you know what is happening there, right? You want dynamic discovery, you don't want to remember the names again. We don't want pets, we want cattle, right? So you don't want the particular IP to be associated with a particular service. You would want it to be discovered. You want automatic recovery from failures, right? This is very easy to say and very hard to achieve, right? Because there could be multiple forms of failures. Sure, you can't do automatic recovery from all the failures, but there are many which you can do recover from, right? You want auto-scaling, right? You need auto-scaling because it enforces the rest of whatever we discussed, right? Once you have auto-scaling nodes, you will not, you will have centralized configuration, you will do dynamic discovery because nodes are coming up, nodes are going down, and you will automatically recover because if a node is gone, doesn't matter, a new node is gonna spawn out, right? And the auto-scaling helps for stateless nodes. You would ideally want to create scripts for stateful nodes because there will be times when they also go down. Take, for example, Postgres, right? If it goes down, you don't want to have app get installed, Postgres or whatever, right? You want scripts, you want startup scripts, create a node clone, this node, whatever. It should just create a Postgres database for me. You want very aggressive monitoring and alerting. The reason I'm stressing on aggressive is because monitoring and alerting, the simplest way is somebody coming up and saying, oh, this idea is down. You don't want that. When I say aggressive, I mean proactively. You should identify issues before they happen. You shouldn't react to them. You should know that this might happen. And for all of this, you want streamline deployments because deployment is the biggest change you ever do in your infrastructure and that is when most of the things break. So this is a story of how we went about and figured out what we needed because we wanted to follow these principles. We wanted us to be able to do this. So we wanted configuration. We wanted centralized configuration storage. We wanted something that was consistent. We wanted something that will keep an audit of changes. So audit will enable us to quickly reverse. So if somebody changed some configuration, we should know what the previous data was. Most of the time, a team member X updates something and nobody else knows what the previous value was and you're just gone. You have to do trial and error. So ideally, you would want some sort of audit or versioning so that you can quickly revert. And whatever is the service, you don't want an overhead on top. You don't want to have to manage one more config service. You are already managing a lot more. So it should be easy to deploy and manage. For service discovery, you don't want your applications to be aware of this service. You would want it to be as much transparent as it can be. You can write your application in whichever language, whichever framework you prefer. You want Java, you want Python, you want Golang, you want Ruby, whatever. It should all be plug-in, plug-able. You want hell checks because if you are doing some sort of load balancing, you probably have hell checks. And if you don't have hell checks, please put in hell checks because they really help you. Same thing, it should be easy to scale out because your cluster size, your configuration size, your node size is always going to go up. Or so you hope. And it should be again easy to deploy and manage. I think that's same for everything. So we at Carousel, what we decided, we built this thing we called as config service on top of console. How many of you are aware of what console is? Okay, so console is a very good configuration management, service discovery, KV store, built by HashiCop. It uses raft underneath. You guys can go and check it out. I don't want to go too much into the details. But what I'll tell you is that we built this thing, what we call this config service. We haven't given it a fancy name. It explains what it does on top of console. What this config service allows us to do is all the configurations that we want to do on node are just done by console template in ENV console. So if you're using 12-factor apps or apps which use ENV variables, environment variables, ENV console works out. Otherwise, if you're doing plain rule configuration files, console template is good enough for you, right? Then we, how do we install this config service or this set of scripts that we have, set of code that we have on all the nodes? We built Debin packages. Because Debin packages still work. Sure, Docker is the way to go, but you're not going to deploy your stateful services on Docker any time now, right? So Debin packages or RPM packages, whichever thing you're using, please don't use Ubuntu. That still works. And the config service package, it takes care of everything. So the reason we have it in a package is because if you want to update it, the previous package, we know how to install it. We know how to clean it up. Everything is handled for you automatically. So this is how our configuration management looks. This is a screenshot of the UI we have. This could be any git. We use git repositories to manage configurations. So these are multiple services that you can see, right? Each, so these are all folders. So let's look at main app or Lumos or Infra. So these are all like clusters, right? And if you look here, it says prod. So what we do is we have environments. So slash config slash prod is a production environment. Slash config slash stage is your stage environment. You can have dev environments. You can have whatever environments you want. So it's just simple folders. You can track it and git. Git gives you version and git gives you audit. It just works, right? It's your single source of truth. If you have to look at any configuration, you already have the repo on your hard drive. Just look at a pollit, whatever, right? And it allows you to do easy rewards. This is exactly what I was saying that you need audit and versioning. This was me, yes. I did a manual config update and I messed up, but I was able to reward it. No impact, nobody noticed. I guess until now, and now people noticed. But yeah, sure, the secret is out. It's okay. I still have my job, right? Sure, so basically the idea is easy rewards and versioning so that you don't have to worry about it. And the bonus advantage is everyone knows how to use git, right? It's not like you have to teach people how to use it. Everyone should ideally know. Oh, you have had to teach people, okay. Sorry, I feel sorry. Okay, so coming to the next part, service discovery. As we said, we just want name discovery. Oh, I want to connect to app nodes. These are here. I want to connect to chat nodes. These are here. Anything that you see with HA, these are our HA proxies. So we use suffixes to identify. You have groups, you want listings. There are listings, HA and everything, right? So you can just discover that. So if you want to discover a configuration, you just say, oh, I want ES listings, HA, right? The IPs of the port and the port will just come to you in the config. It could change, it could be anywhere. You don't know, you don't worry about it. What this enables us is to do auto failover. So if you look at this, this was when we were doing auto scaling on app nodes and backend servers, one of the server failed and all the other HA proxies updated the config. So total number of servers went down. Yes, we use Slack a lot. I think if Slack goes down, Carousel, Intramilla have issues. But so if you look at it, right? It will say total number of servers is now 71. Total number of servers goes up, goes down based on auto scaling, based on node failure. The way we identify node failure is health checks. So if the health check fails, the node is gone, boom, remove it. It even allows you to do auto scaling, right? So we, again, Slack notifications for a node created. So if you look at this, we created these hosts. These hosts got destroyed, these hosts got created, again destroyed. So based on your request that are coming in or your QPS, you're just a bit off the scale. Second, it allows you to do is node maintenance. So otherwise, if you have a large number of configurations, the problem is if you have to remove a node from configurations, you'll have to go to all the other places and remove it. Here we can just fail that health check and it gets removed from everywhere. So this is what we, this is again provided by config service. We created a command, it's called enable maintenance. Again, keeping it simple, right? You type enable maintenance, the node is gone. And it says who did that. So you can track always, right? Like, okay, uncle moved, blah, blah, blah, to maintenance mode with particular reason, whatever is the reason. So as I heard sir describe the infrastructure, this is where config service comes in. Config service has a console as a backend and get repo for the configuration. We poll the health check on the entire infrastructure. So config service even checks for monitoring. The point is who polices the police, right? Like somebody needs to check up on them, right? So config service checks monitoring and monitoring checks config service. So we know if one of them goes down. The other thing is we maintain configurations and health checks across. So this is done by our monitoring layer as well as our config service layer. Config service says if the server goes down or something, it will update it, monitoring will alert it. So I can just go here, delete a node and I can be assured that it is not going to affect my infrastructure. Then the next part is we have these internal HAs. These are internal load balancers that we use to maintain configurations. The problem here is most of the client libraries that we use are not actually aware of multiple posts. They just take one post as an input, right? So the best thing what we can do is we just put in an HA proxy layer in between L4 or L7 depending on the use case and we manage configurations there and HA proxy handles reverse proxies for us. It's very helpful. HA proxy I guess is the best thing next to having a hardware load balancer I guess. Sure, so as I was saying auto scaling is very important for us. If you see here this orange curve and then there is a green curve that's just following it. The orange curve is a throughput curve. So this is in the night it goes down morning, it comes up evening, it peaks. Night it goes down morning. So the same cycle every day and the green is a node count. So they're trailing it very much. What this allows you to do is you just pay for what you use. Otherwise you'll have to provision. So if you have to think, right? The cost that you pay is area under the curve. This is my cost. If you have a flat structure, if you have a flat node count you're paying a lot more. If you have something like this you end up paying less. But not just that, as I said, auto scaling enforces a few things, right? Which is you manage dynamic configurations, you manage availability, you handle fault tolerance, your server goes down and even availability zone failures. So let's say one of the availability zone completely goes off. What will happen? Your peak traffic is much, your CPU load is going to go up and you will spawn up more nodes in other availability zones, right? Because we do CPU targeting or request targeting for auto scaling, right? So you'll end up scaling more nodes. So even if an availability zone goes down your app nodes are not affected. And this could happen for two availability zones, not just one. That is something that you should keep in mind. So then this is the classic case of what we saw. This was what we call as accidental flash sale. This happened sometime midnight. I'm not gonna reveal like much. I still want my job. So the basic thing what happened was everyone was asleep, not I guess we were still working somewhere, but midnight normally everyone is still asleep, right? And there was a bug or some issue with one of the notification systems. It sent notifications to a large portion of our users at midnight, right? Nobody was aware of it. It just happened. A large number of users got a notification and they opened it at 12 in the midnight. Majority of the users just opened it, right? I guess now you know when to send the best time for notifications, right? If you guys are wondering, midnight. Everyone opened it, our apps, the request count went up. It went up very high and look at the curve. It's almost a straight line. Now imagine what will happen in case that we were not having auto scaling. Everyone would have been alerted and we wouldn't have slept, right? But we noticed this later. We didn't even realize that this happened. Maybe our latency went a bit high for that initial duration when the node was being created, but as soon as the node was there, boom, it worked, we were saved. So this is one thing which we should always remember, especially at midnight. You don't want alerts to happen then. If you're in office, it's still okay, but you don't want that to happen. So the key takeaway is assume things will break. Always assume things will break because they will, right? No matter what you do, things will go wrong, right? In case you're wondering, this is Singapore F1 2012. Whenever you think that things can go wrong, it will go wrong. Set convention, you don't want to assume things. The problem with assumption is everyone has different assumptions. So I will have certain assumptions. Harshad will have certain assumptions. Other member of our teams will have different assumptions, right? That is where the problem arrives. Assumptions become your problem. You say, oh, everybody should already know this. Don't assume that. Set out convention. Say that, oh, this is what we want. This is how it should be. Even if it's simple, silly things. Set that convention. Then script everything. This is one of my personal favorite. If you go home and you can have a look at the slide, this script, I put in a high-dissolution image. So you can zoom in. This script will allow you to create ES, elastic search, ELK, auto-scaling. So you can just put this script as a start-up script. Resize the cluster, you have ELK. So this is how we manage one of our ELK deployments. We just, oh, it's not able to handle the log traffic. One more note, one more note. Because it's not very critical. So script everything. The biggest advantage for us was that we went through a migration, so we already did. We were forced to script and it works very well for us now. So we are still retaining those scripts. If you look at this, this is a set of all the scripts and conflicts we have. It's a big ass repo that we keep. And all the miscellaneous files, everything goes in there. We update them regularly. So scripting everything is not important. I mean, it's not going to help you unless you update it. So update the script and use the script from the Git repo. That will always work. Second, use DBN and RPM packages. So we use a lot of DBN packages. We are mostly deployed on DBN. But if you are on CentOS or something, Fedora, whatever, that should still be fine. Packages still work. Sure, Docker is the way to work. But for straightful nodes, packages you should always do. Instincts for stateless services, I don't think I can stress enough on it now because it allows you to do water scaling. So we have a CPU target. So you can say try to implement CPU and your instances will scale across zones. So these are different availability zones. They'll automatically scale across it. This is on most of the providers. We use a Google Cloud, but most of the providers should give you, except for the provider which we were on. So in the end, what I want to tell you is, have more kettles and less beds. So if something goes wrong, you will end up not crying. Okay, Kubernetes, right? Yeah, okay, I heard of you, yes. It's a very big topic. I don't think we can cover the entire portions. We might need another event for that. But if you guys are interested, we might actually do that. How many of you know what Kubernetes is? You know what Kubernetes as in the name, the term, or you know what that Kubernetes means, like in technical terms, the project Kubernetes. So basically this is one project that is very much in focus and is very much being actively developed by Google. And it allows you to manage your containers. Okay, how many of you know what Docker is? I'm sorry, I have to ask these questions. So if you don't know what Docker is, let me give you a brief overview. Docker basically allows you to package your, you package whatever is your application into a small container. So I'm guessing since you are in Singapore, you would have seen the shipping containers, right? You would definitely have seen them, right? No? Yes, say yes. You have seen the shipping containers, right? So it's basically that idea. The point is the container is a fixed size. You can put whatever you want to put inside, right? The ship doesn't care. The ship says, I'm going to take this container from this place to the other. I don't care what you put inside, right? It could be soft squishy toys or a sports car. I don't know, I don't care. I'm going to follow the same mechanics. I'm going to do the same things, exact same things on this. And Kubernetes allows you to do that shipping. So managing these containers for you. Kubernetes doesn't care what you put inside the container. Kubernetes says, oh, you tell me something. I'm going to get it done for you. I'll use this container as a isolated component. I don't care what is inside. You decide whatever you want to do. And that containerization is done by Docker, right? So we have, we at Carousel, following the same principles that we discussed, we obviously wanted to move to Kubernetes. We wanted to move all our services, stateless services to Kubernetes. We experimented a bit with stateful services also, but I don't think it's there yet. So we have been running Kubernetes in partial deployment since October 2016 now, and in full deployments in November 2016. So all the production traffic for some of the major services is being served from Kubernetes. And we are using Google Container Engine because it's easier and they have much better SLA than what we can provide. So if something goes wrong in our container engine, we create a support ticket and sleep. Somebody else can manage it for us. We have around 30 deployments. When I say deployment, these are Kubernetes deployments. I'll go over them. We have run close to 500 containers at peak. And we again, autoscale on CPU targets, but not on services are on boarded to Kubernetes. So we have been running Kubernetes in production for three months and full production for like two months and it has worked out very well for us. If you look at these services, this is the output of the Kube CTL GetDeploy command. We use deployments. We didn't start with Kubernetes like very old, like 1.2 or something. We started recently, 1.4.7, 1.4.6 actually. So we have used deployments and if you look, I'm not sure if that image is very clear from the far, but these are all our worker nodes. So these are all the worker nodes that Harshad was talking about. These are all worker and rabbit. So we have individual deployments for each workers. And the reason behind that is we can autoscale them. We can control the consumption rate of workers by just seeing how many desired containers we want to run. So in case we are seeing a lot of traffic, in case we are seeing our DB is getting loaded, we can reduce this size and get some room. And in case we see that the Qs are piling up, these are all being fed from RabbitMQ. So if we see that those RabbitMQs are piling up, we can just scale this up. So let's say you guys are sharing a lot on Facebook, and suddenly we see there's a huge spike in Facebook latency, so our Qs will go up. We will just double this number. We don't care if we should be able to scale, right? So from five, we'll bump it up to 10. Now we have 10 workers running. Nobody had to do anything, no code push was required. You just go there and you click, done. Now we have 10. I'll cover like how we do this in the next slides. The other thing to note is that we do not use the Kubernetes ingress or the service. So basically what you guys are not aware of the Kubernetes, I should tell you that Kubernetes ingress and service is what service discovery is what we built. It allows you to do that, but there are some problems there. The problem is that you can't really do it outside the cluster. So your service discovery doesn't work outside the Kubernetes cluster, and we have a lot of services outside, right? Ingress has another issue is that you cannot have a private load balancer there. So ingress allows you to have a load balancer against your cluster, whatever traffic is coming in. You go to your cluster, you don't have to worry about anything. If that is a use case, it will be very good, but that is not the case for us. So ingress doesn't really work for us, but thankfully we have config service. So we run config service as a daemon set. What a daemon set is, it runs on every node, only one instance. So multiple containers can be scheduled on a node. Let's say you have a node which is 16 core, some x and gb, right? You say that I want to run a particular container and you give it a request CPU. You say I want to run it on two CPUs and give it 10 GB of memory, or two CPUs and four GB of memory or whatever, right? Now Kubernetes will schedule it, shuffle it around, you don't have to worry about it, you don't have to care where it's running. Kubernetes will do that part for you. So what happens is you end up running multiple containers on the single host. So we run a single service of console, a daemon set, what it allows you to do is to scale only one container per node. Whatever node is in your cluster, it'll only be one. Only one container per node will be running for daemon sets. Then we use the same Hellchick mechanism to figure out what all containers are scheduled on that node, and then add it to our service discovery. So what this allows us is to have no change in our config, no change in our architecture. We don't have to change anything to support Kubernetes. Since we were already doing it that way, it just worked out for us. Second thing, the service discovery that I talked about internal HAProxy, external HAProxy scale works. Because service discovery, you give it some other IP and some random port. Nobody cares, they do discovery based on names. Again, auto scaling. So country service allows us to have a hybrid model. We can even have instance group coexist with Kubernetes. Now there is a very interesting case why you might want that, right? As a recovery mechanism or when you're transitioning. So we use this as transitioning and we still keep it for recovery. So though we have been running on Kubernetes in production for like more than two months now, we haven't seen like any major issues, we still keep a recovery mechanism which is our old instance groups. Since our config service, you can register a node automatically, right? Whenever the node comes up, we'll get registered. When I told you that we were running it partially, that is how we were. Some of our nodes were, some of our application nodes were in an instance group and some of them were in a Kubernetes cluster. So traffic was going to both. We monitored, we saw, we do very aggressive monitoring as I already told, right? So we monitored everything. We checked that whatever performance we're getting from the Kubernetes cluster is it same as what we're getting here. It was actually better. So then we transitioned to full deployment. So it was built because of transitioning because we didn't want to do a cut over or a monitoring window for it. We can just like migrate wherever we want. We can schedule instances wherever we want. You set the instance group side to zero, you're fully on Kubernetes. You set the instance group size to whatever our auto scaling and delete the Kubernetes cluster, you're off Kubernetes. But your app still works. You never go down. Now, coming to pipeline. Deployments are very important because this is the single biggest change that you do, right? We have all heard of a large number of cases which says, which talks about 100 deployments a day, right? We've all read those blogs, we have read all those articles. I'm not sure if you really want to do 100 deployments a day. I don't know what you're doing, but we do two or three deployments a day and it's fine for us. So, yeah. So I'll tell you about the build steps that we have. This is Jenkins pipeline, I'll talk about that. We first, the first step we do is build a Docker image. Then we deploy to Canary host. The reason we do Canary is because you should always test your code. We are still not CI, CD, as they call it, but I guess we might go there soon. So we deploy to Canary host and then we check for it. We check whether the errors are more or is it doing the same functionality. We have one of our own internal DNS that just points to that node, right? You switch your DNS, you run your app and you can check just the Canary host, right? Then we promote the image to Google Cloud. Then we deploy it to prod. This deploy to prod step is the instance group deploy part. It doesn't take too long because there are no instance groups now, but earlier this time we used to be here. Deploy to Kubernetes is again Kubernetes deployment. We actually run, just run the command cube CTL set image and we give it an image. That is the deployment for us. We do rolling updates, so that works and then we deploy our workers. So the configuration and everything that you see is what we deploy. So as I said, we use Jenkins Pipeline. It's good, it's groovy, not the language I like, but it works for us. Pipeline that we have triggers existing jobs. So you have the option of scheduling whatever you want out of order. You can just go to one build and just run it manually in case you want to reward. Or in case there is an issue or whatever. But normally we just use the pipeline for normal deploy. So Pipeline triggers the existing jobs that we have. We can deploy to production, a large number of containers, large number of hosts in three clicks. This is the step I was talking about. So you can say, oh, I want to deploy to Kennery. My build is done. I'll deploy to Kennery. And there are approval steps. So we know what, when, when, and who did it. So whenever you want to deploy to plot, there'll be a box, you click, oh yeah, deploy. Which will do the deployment. And we have jobs to pause or resume or revert the deployments. So in case something goes wrong, you can run any of these jobs. You can pause the deployment, check, oh, something went wrong and revert it. Or you can say pause the deployment, you realize, oh, that's not a deployment issue. You can resume it and re-continue it. We track everything in Slack channels. As I told you, if Slack goes down, we'll have issues. Please don't try to bring Slack down. So if you look at these, these are all the steps. If you see, I have given out the IP here, Jenkins.carousel.io. But these all result to private IPs. So nothing's gonna happen. We use VPN aggressively. So everything that we do, every action that we take is over a VPN. Without the VPN, it doesn't work. So if you try hitting this, sure, go ahead, it won't do anything. So we do deployments. We put this out in Slack channels. We try to keep these Slack channels very, very less noisy. You know how Slack channels can get very insanely massively noisy because of all this giffy images and people emoticons and, it becomes really difficult. So we try to keep these very less. The idea here is to figure out what could have caused an issue with it. Right? So the point is you keep one channel where you put all the notifications and when something goes wrong, you go back and look at that channel and it tells you, oh, around this time this happened, this could have caused whatever is the problem you're trying to debug. So it gives us audits. And soon, we'll transform it to CI, CD. Though, I mean, I have to put that buzzword there I don't think we are going to do that, but yeah. Now coming to monitoring, right? If this image is the natural TV we have, if you look there, we actually have a TV, we set and there's a TV that keeps on rotating all the metrics. This is one of the screens that we have, right? It keeps on rotating and anyone who's passing back can have a look if there is something wrong, if we don't notice somebody else will, right? So always monitor things. Monitoring is very critical. Monitoring is very, very critical, right? But the stuff, but before monitoring, right? You need to know your infrastructure. Unless you know what all components you have, what all components could go wrong, your monitoring will never be right. Whatever monitoring you have to do, you have to do based on your current infrastructure and it should adapt to your current infrastructure. Capture everything always. There is no harm in it. You may never use it. You might require it at one moment, two o'clock in the night, whatever. Capture everything. It's cheap. Use proper tools. So here at Carousel, we use Prometheus very aggressively. We use ELK, we use Sentry, StatsD. Actually, we have faced out StatsD for Prometheus StatsD exporter now. So those StatsD also goes to Prometheus. We use New Relic, we use OpsChini, so these are the services that we actually use in production as of now. The biggest important point I said, capture everything. But if you capture everything, you're capturing an insane amount of data, right? So you need to know your retention. You always have to set your retention that, okay, these are load average numbers. I want to keep it for a week. These are my ELK logs, access logs. They might be too huge. I'll just keep it for a day, right? You need to identify how long you want to keep this data for. Otherwise, it'll become really insanely difficult for you to do anything. So the bare minimum metrics that you should always, always capture. If you're not capturing any of these metrics, go today, today, I'm saying today, right? Go today, deploy Prometheus, just use node exporter that gives you all of these, right? It won't take too long. You probably already have a deployment pipeline, so you can go ahead and use it. So the most important metrics, load average, CPU percentage, and memory available. This will help you debug a large number of issues. This is the simplest thing to capture. This is what top also gives for you, right? Then next thing you want to capture is network bandwidth and network connections. This will help you identify problems where your network connections might be breaking or something's are just failing out of the blue. Then the next thing you want to measure is a disk IOPS. How many people know what IOPS is? IOPS, how many? No. Okay. Nobody from the systems background here, huh? Okay. Sure, we both were from the systems background, so you always look at IOPS. IOPS means IO per second, right? You guys would have never looked at it, so let's say your DB is running very bad and whatever queries you do, take like a minute or two to execute. That's not normal. It shouldn't take one or two minutes to execute, right? What you do is you look at your IOPS and you realize, are you hitting the limit? Is this the max you can do? Then maybe you need to change from provider X. Monitor your disk usage because that's important. Most of the people don't configure log rotate. Your logs will just pile up and it will cause a downtime one day. You have your HAProxy, HAProxy logs are being there. It will keep on building, building, building, suddenly boom, go on, nothing. So the next thing what we do here, but we're like very aggressively, is we build dashboards. We build logs and lots of dashboards. So this is one of the very simple dashboards I was talking to you about node importer, right? In Prometheus, this is the dashboard that actually comes with it. You don't have to do much. You just get the dashboard out there. And it's very easy to identify outliers with a dashboard like this. So you see this is for our ES listings. ES here is elastic search. Listings is the listings. Whatever you post goes to elastic search that you can search. And if you look at this node here, you'll see there is one node that's running very high. Right? So I didn't have to do anything. I didn't have to go anywhere. I just opened the graph and I saw, oh, one of the node is high. Everyone else is within this band, but this guy is an outlier. He's running very high. What is the issue? You identify which node it is. You go there, you figure out. It becomes very simple. Otherwise, what this might result in? Oh, some of the users have high latencies. You would not know why that is. Right? One node is running in a degraded state. One node is bad. You will never know. It is always an intermittent issue. These intermittent issues that you see, right, are because of these. But once you have a graph of all your nodes, all your similar nodes, you can straight away see outliers. In this case, we figured out that this host was running on a degraded node. We killed it and we started a new one. And it was back. Now, I said these are the basic metrics, but there are other things that you can capture. Like, let's say Kafka. I'm guessing most of you are running Kafka there. You can capture most of the metrics, GC count, Java, GC, big issue, right? Then you can capture Postgres overviews. We have sort of OK KPS, not that high. This is one of the slaves, by the way, so there are no update inserted. This is just read-only data. And we can capture these metrics. Somebody is running some job. We'll figure out what's happening. Then the next thing is ELK. The ELK also works. So we feed all our access logs through an ELK cluster, and we have built something which we called as the intradash port. So what this allows you to do is we can drill down in real time what's happening. So I'm not sure if you can see this, but this is last 15 minutes. This is data for last 15 minutes. These are number of errors. These are whatever response codes we are giving. These are API endpoints. These are IPs. So if somebody is hitting us or somebody is having an issue, we can identify them right there. So if you guys go to night and try to run abagainstcarousel.com, we will see it. And you will probably face some issues. So these things allows you a lot of freedom. This is very easy to build. As I told you, you already might have a script if you look deep enough in my presentations. So you can set this up. It's not that hard to do. So how does it all work together? Config service, logs, failovers. If you look here, this was the graph I told you. It logs, that one of the node call went down. But we don't log which node, what node because we are autoscaping. So a large number of nodes go down or come up. Then we use Slack for notifications. I'm not sure if most of the engineering team here is aware, but whatever you run on salt, we get to see. So whatever actions you take, right? So if you look at this, this was done by Jenkins on behalf of this job. And this was the command that was executed. We have the trace of it. This guy was running some other jobs. We even got it. We know when he ran, it was today. So monitor everything. The reason to monitor all of this is not to find the person who has caused an issue but to find what might have caused the issue. We don't care about the person who has caused the issue. We care about the code. That is very important. The next thing we do is on-call. So since you haven't seen, so I'm not on-call today, but Rajat is, sorry, but Harshad is. So if you see him running in the middle sometime, right? You know that we have an issue. So we do on-calls. And the way we do on-calls is, this is Ops Genie, by the way. So we do on-calls. We have a large number of alerts that trigger Ops Genie and then it notifies whoever is on-call at the given time. If the person who gets the notification does not respond in a set of time, which I guess is two minutes now. So if you don't respond to that notification in two minutes, we give a call to, I guess, six people. One of them will identify the issue and get it fixed. Again, the thing here is not to find blame, but to get the problem rectified. That is the important thing here. On-calls we do so that we know when we can't get drunk, right? There is a reason there is Monday off, right? Yeah, you went out on the weekend and like, oh, okay. So everyone should at least get one weekend to enjoy. You can't always do on-calls, right? So on-calls really help you. The partial reason why we have the availability that we have currently is because of on-calls and they are really powerful. The next thing you should really care about is alert blindness. This is a very big problem. I'll tell you what it means. Alert blindness means that you're constantly seeing alerts. You're constantly seeing alerts. You're constantly seeing alerts and you're like, oh man, that happens all the time, so it doesn't cause any issue. And one time it will, and you will ignore it, right? You do not want that to happen. So what we do is we do not put all the alerts in on-call. We have this sort of a promotion mechanism. We put in an alert. We look at this during the day. We don't really look at it this during the night because we know if one of this causes an issue, one of these alerts will get triggered anyways, right? Which is a down time, but it's still fine. We promote these alerts based on whenever they cause an issue, right? So we are seeing this, we are seeing this, we are seeing this, and then we say, oh, whenever this happens, our latency will go up. Oh, whenever this happens, a large number of users are scraping us, right, or something. So then we can put it to on-call based on priorities. Otherwise, there are a large number of alerts, like let's say high socket in use. Till a particular point, it's fine, but if it goes beyond, it'll trigger. This allows us, this is one Slack channel that just has alert notifications. That's it. We normally keep it muted because it's very high volume, and we do very aggressive monitoring. So this helps us identify issues that are happening beforehand. So you're able to predict that, oh, this is going down, there could be an issue. The second thing is, when there is an issue, what do you do? You should always keep links in hand. You shouldn't have to worry about, oh, what was the IP of the Elasticsearch server? What was the IP of this server? What was my load balancer? You shouldn't have to worry about all of that. So this is a service that we built, which we just called Go Slash. I think Google also has a similar service. Flipkart also has a similar service. Amazon doesn't have a similar service. So yeah, so coming from that background, we got pretty used to this. It's very handy. So we just hacked together this. I think it took an hour or two. It took just two hours, just one HAProxy server. And as we told you, we are all on VPN. So whenever somebody connects to the VPN, we change their hostname. Their hostname becomes kerasildotio, and they can just type go, and it will translate to go.kerasildotio. Right? That is still a private IP, by the way, if you guys are wondering, so. So you type go slash. It will give you a list of all the IPs. You can search through that. Yes, I want people to use WIM, so I have a WIM class. Yeah, but yeah. Sure, so we have production, New Relic links. We have elastic search stats. We have metric stats. We have our fleet stats. We have our dashboards. This is the intro dashboard I was talking about, go slash dash. So it's very handy when you know there's an issue. Go slash dash, oh, this is that. Go slash here stats will give you elastic search stats. It'll just work. You won't have to worry about remembering those IPs. It's a different story that I think we remember half of the IPs of all the notes, because we have gone through them so many times. But you shouldn't have to worry about it. You call it out as if you start remembering IP addresses, there's no problem. Yeah, there are a lot of problems with me, right? Okay, the next thing is, always schedule your jobs. So as Harsha told you that we do DB backups every three hours. This is the job again. It's all going to Slack. So chat backup, got started at some time. It ran for certain duration. This is less because it's elastic search, it's just snapshot. This is just a query command. You can look it up. It's very handy. Then we do database snapshots. We do groups database snapshot. Everything is automated. And whenever some job fails, alerts. Schedule job, and the next, automate. Automate as much as you can. Scripting is your first step. Next step is automation. Ideally, you want everything to be handled automatically. Not automatically, but automatically. Harsha, do you want to explain? Sure. Okay, so. So, let's see what our future plans are. Main one being hire more engineers, because obviously we are running very lean in terms of engineering. Move more services to Kubernetes, as Ankur mentioned. We still have a lot many services that we are yet to migrate to Kubernetes. And since it has shown the promise and the impact, we want to move all that we can to Kubernetes. So, we want to as well move away from the postgres that I mentioned, because one, we don't need asset and the second being scalability. Transition to microservices. Obviously, we want to break our backend into more manageable functionality, as an easier to manage in terms of adding functionality. More maintainable services. As well as transitioning to a different platform which will allow us better performance. So, improve monitoring further. You can't monitor enough as Ankur shared. So, improve monitoring as much as possible even more further. Get as detailed in terms of monitoring as possible. More fault tolerant. So, embrace failures. Ensure that you can tolerate even greater amounts of failures that might happen. Even more frequent, more set of nodes going down. You should be able to tolerate that. So, sharing a bit about microservices. We are planning to write it almost. So, we are in the process of writing a microservice in Golang, Gokit inspired with various horizontal pieces to ensure that we don't end up doing a Merida mystery after we transition to microservices. We will mostly be using Cassandra for storage because it's scalable. Elastic search for eventually consistent lookups where consistency is not really required. And GRPC for RPC communication between various services. This is an important one. So, Histrix for real time monitoring. Since each service will be calling a bunch of other services, you need to be aware of how each service is performing within like say two or three dashboards. You shouldn't be debugging each and every service separately by going to that machine and seeing the timeout. So, Histrix is one way where you can see how much latency each service is giving, how much throughput or traffic is flowing through each service. So, Histrix is a good way. We use that with Flipkart and it worked out well. Zipkin for request tracing. So, given any request coming into the system, you should be able to trace the entire path of the request till the end. What all microservices, what all services it hits and how much latency each service took if there was a failing service, which one was it and then going deeper into that particular service. And from Histrix for metrics because it is working for us and we want to continue using that. We want to make more use of it. So that the transition to microservices is successful and we gain the benefits of microservices not just to follow the buzzword of microservices and all the other things. So, ensuring that we do it in a meaningful way. Okay, so flash sale. Who are the tired now? Nice. Nobody? I am. Nobody? Okay. So, having built this infrastructure, the best test case, the best use case for as in the best stress that you can put on this infrastructure is to have a flash sale because handling a natural growth of traffic is relatively simpler than having a punch on your infrastructure. So that is what a flash sale essentially is. I have a flash sale. Do you know what flash sale means? No. You all know? Raise your hands. We are at the end. Look at the slight number. We are almost at the end. Just bear with us for a few more. That's it. I hope so far it was interesting, right? Nobody? Okay, somebody's leaving. Okay. So what a flash sale is? As Harshal said, right? You want a punch, right? Like when you punch somebody in your face, right? So it's basically a huge spike in traffic that happens just at that instant moment because of some offer given out. Because people, at that moment, you want a lot more people to be attracted to your platform within, on that instant of time. Like Black Friday. Black Friday, Singles Day. Singles Day. Something like that. So it's an ultimate test of scalability. And so for us, it was the first time for the first time doing a flash sale. And we really didn't know what we were looking at. So we didn't know what to expect. There was no baseline that we were to follow. It was hard to just, we were shooting in the dark. So we planned for a 2X throughput, 2X of the peak throughput. And we assumed that throughput will multiply in a short amount of time. That was the actual curve that we saw when we actually executed the flash sale. So this basically tells you the number of requests that's coming in. We are sorry we had to hide the numbers for some obvious reasons. Nobody could have done this. Yeah, sure. So this basically tells you how much traffic is coming in. So this is, think of it as a baseline and you see the jump. So that was a normal traffic, normal peak traffic. And that is what happened when the offer was given out, something like that. So that is what a flash sale would mean to a flash sale. So imagine this much amount of throughput coming to an infrastructure suddenly. Your servers can't autoscale when this comes. That is where. And that is what our latency looked throughout during the flash sale. So this curve is nowhere reflected in the latency. That is when we said, okay, we have done a flash sale successfully. So what all had to go in doing this flash sale? Cash, read calls at multiple layers. You need to aggressively cash whatever you can and at multiple layers. Care has to be taken at each layer to take care of the invalidation because that is where it will be missed. Also, another care has to be taken that the cash doesn't leak out data across users because when you cash, you might end up bypassing some validations that your code normally does. So be really careful but do caching at multiple layers. Upsized elastic search nodes. So while we were taking a look at the nodes, we realized that we need, we should be increasing the memory on those elastic search nodes. We ended up replacing all, actually entire set of 75 nodes of elastic search clusters without a downtime. So zero downtime, we replaced the entire cluster by rolling each shard out of the. 75 new nodes were created, 75 old nodes were deleted, all while serving production traffic for a state-of-the-art, not for a state-of-the-art. Local SSD goes with slaves. So Google Cloud offers a category of disks called local SSDs which have like 100K read IOPS. If you go to the, if you choose the NVMe option, it goes to like 170K. For comparison, a really high IOPS disk normally, the highest IOPS disk is around 25,000. 20,000 normally, but 25,000 if you like take the highest 32 core machine. So, but local SSD gives you 100K IOPS per disk. We took three such disks and created a RAID 0 out of it. That basically RAID 0 adds these IOPS. We are looking at theoretical 300K IOPS per machine and we created three such slaves. So the IOPS bottleneck that could have been there was removed as in 300K IOPS, theoretical IOPS, even if you get like 100K IOPS out of it, practically considering CPU bottlenecks, kernel overheads, everything, that's still fine. You will choke your network before you choke. Yeah, you will choke your network, you will choke your CPU, you will choke your interrupt handling core, everything. So that is how we removed the IOPS bottleneck from the database. The key was to identify network bottlenecks because there are proxy servers through which the traffic flows. And normally, if those proxy servers don't see such a traffic in a normal basis, even at peak, there might be certain bottlenecks that will be hidden and which will be hit when the traffic reaches those spikes. So the key is to go into each particular node and see what the bottleneck is there and we check U limit and connection limits. So whatever U limits are set per proxy server, wherever whichever server is handling connections, check the U limits as well as check any application and first connection limits. That is the key here. So you might have configured it considering your normal day-to-day traffic, but for a flash sale, you need to rewrite it, everything to a different set of values, which you might want to revert it back when you're done with the flash sale. If you're not going to do flash sales really soon, which should be the case, ideally. The important one is build and keep a standard operating procedure handy. So you should not be, while the flash sale is handling, most of your failure scenarios should be rehearsed and bi-hearted and should be in your... It's college or... Yeah, it should be from your spinal cord, not from your brain. So that is our, so what we did was, if you can see the board behind, we wrote out our standard operating procedure so that if anything goes wrong, we just have to look back and do, that's it. That is how it looked. So it's dirty, but it worked, apparently, yeah. So this is the infrastructure, talking about the team at Carousel. We manage around 400 plus servers, thousands of requests per second on across multiple services. Production issues, we ensure that production issues get a set of eyeballs within less than five minutes. That is how our alerting works, critical issues. I'm not talking about less critical issues, but yeah. That's how we manage an uptime of 99.95 measured on a monthly basis for a few quarters now. Failures don't result in outages. Ideally, individual server failures now don't result in total outages. There might be some degradation, but it's not, it doesn't convert to outage. All thanks to planning, monitoring and automation. Having detailed plans, thorough monitoring, and proper automation, whatever you can automate should automate. So that helps in keeping your team really lean, but effective. So, the final key takeaways that I'm going to do. Isolate stateful and stateless components, obviously, because that helps you in scaling. That is the key, what I can say, key thing when you are building a scalable system. And isolating compute is equally important, because if your compute leaks into your storage layer, it's a no-go for scale. You can't scale your data layer out of that. That includes complex SQL queries running in SQL and all, you shouldn't be doing it. Compute has to be done in nodes dedicated to compute, that being the application layer, stateless layer. Choose data stores really carefully, you won't be changing them frequently. You don't want to change them frequently and you shouldn't be changing them frequently. So, while you're choosing them, choose them really carefully, not just simply because you're familiar with one of them, but and it doesn't really fit that use case. So everything has to be matched properly. Cap theorem is your friend here. Use abstractions only after understanding them. Don't use ORM libraries, DSL libraries without going into what goes inside them or how connections are handled, how the actual request is translated into the query because without understanding them, you won't be able to do any debugging at the operations layer if you want to do ever. So use abstractions only after thorough understanding, testing it out and going into the code inside that. That is what we practice. Whenever there is an issue, perform root cause analysis, not just workarounds and isolations. So workarounds and isolations would save you for a short amount of time, but thorough root cause analysis and fixing the problem that has actually caused that issue will help you not repeat that. So we strive for never repeating a particular issue that has happened. We ensure that if an issue happens, it never happens the choice. That is what we try to attempt. We go a bit further than that. We even enforce it from our new cloud provider. So half of the tickets we raise are like, okay, what happened? Tell us what actually happened. So we don't just do this at our level. We even try to do it from whatever services we are using. We want to identify what has happened, what has gone wrong, and they actually know, they actually understand that sure, this will never cause another issue. And thank you for the support. They actually support us in doing these work. And this is quite true. Identify bottlenecks and nip them out whenever it's stitching time saves. Stitching time saves. Yeah, monitor everything. You can't monitor less. We can't monitor enough. You have to monitor. Finally, blame the code. Don't blame any person. Never blame any person. This is very important because there were a few companies, I'm not going to name them. I think I actually did, where that wasn't very much followed. Your own. Your, yeah. Yeah, so it becomes really bad. People will leave. A large number of people will not stay there. People won't remain happy. People won't remain happy and nothing will work. It will just fall apart. So this is, I can't stretch enough on it. You should never blame the code. You want people to blame. You don't want people to repeat mistakes. But yes, that's what we want. So if you're blaming the coder, you have bigger issues than scaling. Thank you, everyone, for attending. We have listened to us for, what, an hour? More than an hour now, wow. Okay, we are sorry if you had to, it was a trouble for you to listen to any of our voices or whatever you're saying about. This is the time for you to correct us wherever we went wrong, or if you have any questions that you would like to get answered in detail. The one question of the Kubernetes. Yeah. Did you observe any latency issue assessing, for example, directly these tiles, well, interfaces, for them or interfaces? Usually depending on the driver reviews. So what driver reviews, second question, what driver reviews do you compare to these? They never play a driver. It's a wave, it's an IVP, and so I'll answer that. So what we do is we use GCE for it, so we don't really get to manage much. But if you are facing those issues, I would suggest you run the host in privileged mode. Not the best way to do it, but it'll get the host network. Otherwise, you'll always have some proxy in between. Because the moment you put in services, it'll try to balance it, and it'll try to open a node port. So the moment you have a node port, multiple containers have to be mapped and allow this to be a proxy. So the thing that we did, we were facing this issue with console, because we actually run a console agent on all the nodes, that's the config service, Daniel set, right? That runs a console agent, and console has very aggressive timeouts. It uses gossip. So there were very aggressive timeouts. So we figured out that we can run console in privileged mode, first of all, because it's fine, it's well-tested, and everything is underprivileged. If you don't have public IPs at all, right? Other than load balancers, everything is private. So we were like, we can run console in privileged mode, and that's what we did. But for apps and such, it wasn't too much of an issue because we used persistent connections. So reconnection normally doesn't happen. So it's still faster for you to send the packet, then get the acknowledgement. We've got a question from the stream. Yeah. And someone asks, why did you guys decide to use GoLang for microservices? What about Scala and Elixir, et cetera? So one thing I'll say is, we do want to hire people, and I don't know many people who do Scala and Elixir and all the other new languages. Not to say that they are bad. I would happily want to write the entire infrastructure in Erlang. That's one of my favorite language. I'm not sure if you've heard of that. It's a functional language. I really like it. But if I start doing it, I won't have a job. I very much love my job. So we prefer GoLang because it works. It performs better than Python, Ruby, or PHP, or whatever language you have there. It is much better than Java. And it's compiled, right? It's fast. So that was the main reason. It's compiled in typesave. Single-banded deployments. And it's cool these days. We will find a lot of engineers knowing it. So hiring is easy. That is one reason. Yes, the other languages are good as well. JVM, we are not a particular fan. As Ankur said, we are not particular fans of JVM. Not that we haven't worked on it. We have worked on it a lot. And then we are not fans of it. Yeah, that is the reason we are choosing GoLang. Good concurrency privilege, so one more thing. Good concurrency privilege as well. More routines, channels, everything. Yeah, got it. OK, I'll go one minute. Yeah, I can't go into laser at this time. What do you guys migrate from the previous product to better? There's a lot of better apps. Yes. But I don't really think you guys move to GCP. One thing we're considering moving to GCP is the essential rotation. And the more we understand that GCP isn't having a good experience in CCP. But of course, I guess there are more markets. Of course, I'm wondering how to end up that placing one down so much when you guys move to GCP. Let's talk about this. Sure. So the latency benefit, after the request, entered the data center. So we are hosting in GCP Taiwan. So after the request entered, the data center was so much that it offsetted the sort of 40ms latency that Singapore has with Taiwan. So that was one benefit. Another thing is Cloudflare. So we have Cloudflare in front of that. So Cloudflare to Google Cloud Taiwan is a fast link. So that latency is even reduced by around 5ms on an average. So that is one thing. But we are OK with having 30ms over it. It really doesn't affect us any more because we were a lot more than that earlier. So I'll see if I can go back to that. We did cover a lot of things. But while you're searching for that, I'm just going to change the plug from GCP. We will have a data center saying, what works for you? The data center is very similar. Like, data center is similar? Like, very similar. I'll frame that. It's X number of days. So if you look at the graph, if you look at average, this is an average latency. You should never track average latency. For some reason, you really does that. So the average latency were at least 200ms. And we went down to close to 80ms. Even if you would have added 30, 40ms on top of it, it really does come slower than originally. 100 and 120ms is not really, doesn't make a difference. That is not something we really are worried about. Singapore to Taiwan latency is manageable for us. We're not doing high-frequency trading here, so. So what's the response on that? So then for, like, do you do deployment? The entire architecture, pretty much everything is the same. It's just kind of the platform that is changing. After you've frozen the architecture, you're doing classic architecture changes. We changed data stores. So as an example, I can tell you, chats used to be stored in Postgres. We moved them out into two set of stores, which was Cassandra and Elastic Search. We moved out chat as a separate first-class API component. So there were some latency issues which were causing, which were because chats were residing in the same set of app nodes, which were identified to the traffic pattern. So higher, so requests which were held longer in the worker versus chat requests which were shorter in the worker, both of them didn't really intermix well. So we had to move them out. So these kind of architecture changes we did after that. And we did actually resharpen the Elastic Search instances while we were doing that. Because we were doing a dump restore. It was very good opportunity for us to actually resharpen. So I would say that there was a concern that a large number of people actually pointed out that you should maintain the same infrastructure in your cloud that you're moving out from and into wherever. Wherever you move, it should be the same infrastructure. We took a call saying that we don't want any more dump names. We already are taking a maintenance window. We can resharpen data. And because practice, practice, practice, we knew it would work out. And one of the case, why did you choose Google Cloud instead of AWS? What was the right point? So quite a few benchmarks. And obviously, other things like, well, sorry. So yeah, obviously Google Cloud network and disk, it is way superior than AWS if you consider a press point. So if you ask him that, if you take a given cost point, the kind of IOPS and network bandwidth that you get per node, Google Cloud's per node IOPS and network bandwidth that you receive is way higher. So think about it this way, right? If we have a lot of internal HIP proxies, they will all add a hop, but we are OK with it. Because the latency is, on an average, we observe an average latency between nodes to somewhere around 150 microseconds on the Google Cloud network. I don't want to quote a similar number from AWS which we observed, but it was much higher than this. So. All right, so well, we've a question for the stream. OK. And JJ asked, you mentioned asset is more important. It has moving away from Postgres. Can you elaborate? So the traffic pattern that we have in the use case we have, we don't really need asset. Asset enforces a lot of things. She don't know what asset did. I'm not talking about the drug. OK. You are all. So if you look like typical of our use case, right? Creating a listing and basically having it and letting the buyers view it. None of these are typically asset scenarios where you need to be serializable. Your transactions need to be serializable. You want your data to be durable, but you don't want, you necessarily don't want the transaction isolation part out of it. So your transactions can be, we run at read committed level at Postgres. So anyways, we run at a degraded transaction isolation level. So essentially, we don't need serializable, or even say MVCC snapshot isolation transactions. Maybe in future, if we do something else, we might need it. But we will use asset databases only for that particular application, because scaling asset is tough. And there are solutions which allow horizontally scalable asset databases, but they are specific vendor-specific proprietary databases that allow you to do that. And if you want to scale, you need to really carefully look at, is this level of consistency in the transactions, in the modifications that you do, is it really required? What if it is not there? What percentage of users or what percentage of traffic will be affected by it? If it's like less than 1%, less than 0.1%, you really don't solve for it in a very obscure and very complex manner. So you can always apply the 80% of the problems solving, 20% will take you longer time. And obviously, the performance, yeah. Having a scalable asset database is a performance nightmare. Maintaining a performance there is really a problem. So all of these. The reason we are talking about IOS is because we have to worry about a single node. The moment we scale it out to multiple nodes, we won't have to worry about it anymore. Or not as much. Sorry, you had, yeah. I actually asked, can you guys move from your X program into Google Cloud? So, like, when you started off with migration stuff, so it was, there was absolutely no traditional data center hosted applications. I mean, I didn't get any traditional data center hosted apps, or it was just, can you take over some X program and move this into Google Cloud? We were already using Cloud Providers before, so it's not just that we were on Cloud Provider X. We are on all the major providers. We have some pieces of architecture, some pieces of running on at least all the providers. This was a major chunk, and this was a major problem. So you still have some of the additional nodes? We still have some nodes on Cloud Provider X still. We still have some on other providers also. But this is the majority. This is the main bulk of all the providers. So it depends, actually. What all features your data center had or lacked, and what all features the new one has? You need to definitely re-architect it, considering the peculiarities of the new one. It depends on what the difference is. It would lie right there. So if it's a Cloud Provider, but doesn't have any of the features. So let me tell you this, right? So we have actually solved for this before. We have moved out from one Cloud Provider to a private Cloud, their own data center in one of the previous companies. And then we have also moved out from one private provider to another private provider in different regions, different data centers. We have done both of that. The problem most of the cases is, what all features do you need and how complex your infrastructure is? So let's say you own your entire data center, then you have control over your L3. So the L3 of the TCP IP stack, right? You will lose that when you move to Cloud. So there are things like Keep Alive D, Great Graduous ARP. So you will lose most of those when you are on a Cloud Provider. But on data centers, because you own that pieces, you can do it. But it's much harder to manage those at scale and they're much costlier. Also the main challenge would be how you can move the data out of it, because data residing in one particular data center will provide access. So there is no movement of your data. Okay. And there will be a complete transfer financing. Of the? Of the banking domain. I don't want my data to be moved out of that. Banking domain. And app layer will be moved out. Okay, you just want the app layer to be moved out. That should be a simple task. So it depends on compliances. Do you have to worry about PCI, DSS? So I think the problem will be PCI, DSS, and not the Cloud Provider. Maintaining that audit requirement is a bigger issue here. Just moving the stateless application in the node shouldn't be a problem. You need to worry about the latency between the app nodes and the database that you're looking at. That is the only problem. But stateless? And encrypting it. Make sure it's public network and banking data, right? So high level encryption. Yeah. Actually, I need three hours. Well, you come up there. Three hours? Yeah. That is, we don't keep all the snapshots. RPO that we sort of took. Another thing is we log almost every transaction in Kafka. So we store all the transactions in Kafka for seven days. Even if we lose that data, we have a transaction log in Kafka. Remember that analytics pipeline that contains all the transactions being logged. So straight away, the RPO is three hours behind at the most. And then we can sort of replace the transactions from Kafka. Replaying from Kafka won't take more than like two hours. Yeah. Sure. Sorry. Yeah, that's just how you mentioned it. Progressively cache. And do you do anything special when it comes to cache evaluation or cache consistency? It's a good question. So ensuring that the cache is expired properly? Yeah. Like I say, like I say, like I say, like I say like evaluating the cache after you do it right. So that is very hard to scale. I mean, if you're doing that, you are probably maintaining your data in your cache. It's no longer a cache, it's your data. So when we talk about caches, we don't worry about evaluation. We let TTL handle that. The moment, let's say you're using Riddice as your data storage, right? So what do you call it? Is it your cache or is it your actual data? It's just like a back through cache. So your cache is your data in this scenario? Treat cache as a cache. Don't worry about too much about the consistency. Don't worry about the invalidation. So yes, you need to invalidate it frequently, but treat it as a cache and ensure that you invalidate. So for us, the invalidation was like expiring it quickly at one there, and at the other, as you said, doing validation in the same way. So you don't especially invalidate the key that after you provide it? For a very high throughput system where we identify that it's okay to have stale data from older than a minute, it's okay for us. We don't really invalidate such caches, but others we do. So it's on a case-to-case basis where we identify that. In the critical one, we have to invalidate it. We go ahead and delete that key or something. Just a short question. So we talk about auto-scaled in the search bar. We do auto-scaled in the search bar. We don't auto-scale them, we still scale them. So we scale newly scale them, but the thing is, if you consider our Elasticsearch cluster, everything is scripted. So it's a matter of just specifying a number that we wanted to grow. We don't let it auto-scale on CPU because you don't want the shard migrations to happen whenever you are not prepared for it. So when we want it to be done, we set the thresholds properly. We see that the graphic is not at the peak and something like that, and then we execute the script manually. So you scale out when you scale down, but what is the data after you perform this? Sorry? Like if you have two instances, you don't see that data would be leaked in two different elastics. So you always have replicas. So in Elasticsearch, you can always set up replications. So you scale the replicas but not selling them? So what we do is we disable actual shard relocation. We disable that first, manually move these shards to different nodes, and then terminate this. We make sure this node doesn't have any data. Right, right. Or in case of adding, we disable allocation. We add the node so it gets added as a blank node, and then we enable the shard. So they pass the menu? 40 for Elasticsearch, I guess, but for Postgres, you don't do... Postgres, we have scripts to create and attach slays. Can I stay for two? Yeah. But nothing more than that for Postgres. That's partially the reason why we do snapshots every three hours. It's because if you have to create a Postgres slate, you want to keep that data, and our syncing huge amount of data or copying it over is very important. So we attach the snapshot, and we run a script, and we copy it. Okay. You have on your config service, you do any kind of validation, or do you like testing, or do you like testing of energy, of your test, of your application? So that's a... So for config service, you don't really have energy, but we have stage. So if it's a configuration update, as in you're adding a new key or modifying a new key, you first test it out on stage. You're staging environment. Is it a complete work? Is it a live stage? Yes. So you update on staging environment, and then you say, oh, I've tested it, it works fine. Then you move to production. And we also have this where you can start the docket container in your machine, and it connects to the stage. Because you're over VPN. So you can connect with those. You run a local console, you can get the docket bin, and you can get all the information you want. Okay. I think we're on the end of the session. I think you guys are coming in. Any more questions, or anything? Please, you're free right now. Sure, sure. You are really tired as well. If anything, you're free in the last two hours, and we are hiring, so...