 I'm a co-founder and CTO of Fiji Networks and someone who has been working on FIER Open Source for a very long time. I started by staying in Anayana city in Aarasi Falls, if you guys would have heard about it. So that was where I started and learned my ropes and we've been doing this for 7-8 years now and I don't know what else we're doing and stuff like that. So quickly moving on to the agenda of the day and the first thing I wanted to convey is how difficult it is to get something on the cloud which works with scales and it seamlessly scales at any point of time without getting hands-on dirty with the underlying technologies of the cloud. And a lot of vendors have tried to make life simpler for a lot of customers as in like people have started updating a lot of ecosystems around their public clouds and so on. So while that may serve as a quick getaway for you to get your POC done very quickly, in the long run it has its own certain challenges. So unless you understand it fully and navigate it correctly, you can potentially have a bit of a challenge in scaling up and so on. And one of the biggest things is the interoperability on the cloud. So essentially your workloads cannot be guaranteed to run across the cloud and so on. So there's no standard, there's no interface, there's no API gateway of sorts where you can just pipe in your workload into an interface and it runs across multiple clouds, nothing like that. And of course Bracken is there, they're in our survey and they have found out that the vendor lock-in on public clouds is one of the key concerns for a lot of players out there. Now what we are trying to answer and probably ask a few more questions myself is to how to do this scaling without essentially being hampered in any way by the underlying technologies and essentially being portable enough and be stable enough to be able to go this way. And let's look at the overall concern for a lot of people with cloud agnosticism. One of the big challenges with cloud native is that when you have really large deployments, you cannot control costs. So that is some of the challenges that you have because you have a challenge in essentially being able to do a lot of cost because once you start scaling, the underlying resources potentially become infinite in the sense that they spin up like a lot of instances and you're not able to control your costs at a very reasonable rate of life. So there has been quite a few studies on how to get this balance right. And of course any cloud native services including that which is provided by AWS, RGCE or anything, there are some absolutely critical stuff that probably no one can avoid to some extent. And if you get that, you should be able to still do that. But then what we also want you to do is essentially be able to convert the rest of the 80% of the cloud native infrastructure into a standalone cloud agnostic infrastructure which essentially runs multi-cloud. So you could essentially go to a dedicated server provider, someone in the U.S. and try and get those bare metal boxes along with your container workloads across different locations and so on. So typically there has been like a lot of studies. I am not quoting those papers actually here but you can potentially see savings of up to 40% is what a lot of studies have proven just by going cloud agnostic with a minimum of a cloud native. Now just going back to this let's cut off this entire cloud native and cloud agnostic stuff. Let's get into the actual technology on when to scale, how to scale, what to scale right. So that is the most important question. Of course I think in this day and age this slide should be remembered in time. Everyone knows there is an equivalent for every service out there right. It's just that there is some time and effort involved in getting that done. So name any technology for any cloud native service you obviously have like an open source and a free and open source equivalent. For example you take Lambda which is one of the sophisticated pieces of engineering you have IBM best. So that point is new. So you have all the technology in the world for every cloud native services that are available. So let's get back to the most pressing question of the day for me. So what do you scale when you scale and why do you scale if at all that matters. And it's also important to understand what cannot be scaled right. One of the biggest mistakes that we all make is that we essentially assume everything is scalable, which cannot be further from the truth for anyone who has seen production workloads and massive amounts of traffic coming in when there has been let's say exponential traffic rays and so on. So you could go viral, you could get on Reddit in just a matter of minutes, some celebrating which tweet you and they bring it like a humongous amount of traffic to your site. So meaning you would prepare for that how do you do that and all those are the questions we want you to answer. So first things first one of the key challenges is to how to get the right box. If you do not have the right box no matter what kind of scaling technology you have it will fail. So it will fail no matter what kind of container is micro services you have, no matter what kind of auto scaling groups or your app engine everything will fail. Simply because you do not have the bare minimum item pertinent unit which can run that request on its own. So the first thing you need to define is what does it take for my application to run a particular request. Let's say 10 requests after self what's the minimum unit of compute and memory and whatever IOPS that I require to run that. Unless you solve this question it's not wise to start scaling. So that is the first thing you have to do and you also have to overcome it on a per box basis. And this is like again you have to do some statistical analysis to find out how much do I overcome it keeping my overall budget in mind. But typically what I would recommend is 20 to 30% overcoming on a per box basis actually helps. The reason is it's actually so good to get carried away by the term called scaling that you stop making all these decisions. No matter what technology you use there is still an inherent latency in getting that new node up, right? Go to AWS, go to GCE, go to any cloud provider that's your there is still a finite latency. You talk to their architects they will actually recommend that you give some buffer times just to get the node out. And even let's say you go so fast and say look I can spin up like 1000 doctor instances in no time but that still takes time. Right? You can spin 1000 instances that's like the most easiest piece of infrastructure code that you can write. But there is still a finite microsecond latency and you cannot sacrifice your users for that amount of time. So it's still very very common sensible to make sure that you have over committed on a per box basis to just ensure that you have that buffer till your backup plan kicks in. Well whatever array of technology that you have written and code that you have written to take care of that scale. So it's very important to that. And network planning is again super important. So I cannot emphasize this enough. People make mistakes here. I've seen like people under provisioned bandwidth. I think people are not set up to not get with the right way and so on. So how many of you know Atlastian Hitchat? Some of you might have heard about it. So they went down massively because that gateway bandwidth was low. Something like that happened. Of course, you knock on the true extent. This is what they shared as an RCA but you cannot over emphasize the network setup to revive. I've seen people set up encryption between DPCs. That's so ridiculous. DPCs are always supposed to be encrypted. So those are the kind of stuff that you have to get it right before you even scale. So this is why I have this game. Make your side decisions, make your network decisions right. So that you're not sacrificing. Even when you scale, you should be able to absorb those stuff. Now the traditional model of scale tells you that you need to monitor a lot of metrics from your monitoring system. Set up very, very fancy data pipelines, Kafka, all that and monitor every metric that you can possibly get. But that cannot be further from the truth. So you have boxes with load averages of 10 and 15 running very, very smoothly without any challenges. Because the number of cores itself are like 72 cores and so on. So who cares about a load average of 15 and more? So you need to get the actual data that counts. And for an end user, if you talk to a business guy in your organization, he will tell you that the browser error that someone sees something going wrong in your website is the single biggest differentiator for the user. If the average page load times are high, the user's bombs, if there is a 504, you get tweeted out on a major sale that cannot be good publicity at all. So the finite x errors and the page load times and surprisingly, surprisingly people still haven't figured out the TTFU problem in this day and age. That is still true. So you think, look, using all sorts of caching mechanisms, using like, let's say your Ajax C5 request, using your request response cycle in which you don't do any data lookups, you solve the TTFU problem. But if you look at still the major size, they still have that problem. And the first back is still on the higher side. So these kinds of things are or should be your triggers to scale, right? Not the typical monitoring method. So you should have more attention being put into finite tools for a finite force, finite trees and all this stuff, preferably Elastolars, the TLKs and so on, which can pump in notifications whenever they see a pattern of high load, high latency and so on. So that is the most important thing. And again, this is one of the big questions that you have to answer. What is the mode of scaling that you want? Whether you want the proactive method where you essentially take a look, there is a huge sale going to come in. And I'm just going to over provision all my instances by a factor of one or factor of two, right? Just double my instances. I can get over the sale, but the unit economics is not going to help, right? So you inherently assume like a lot of stuff here. Like for example, you assume that there will be latency in getting new instances, right? No matter what kind of technology you use. And you also assume that there will be some warming of problems. See, no matter what speed you spin up a load, you still need to spend some time in getting it up to speed, ensure that some of the caches are built and so on. And before it can be actually predictive. Otherwise what happens is the more nodes you add, you end up exponentially degrading your existing customer experience. So that is one of the challenges of scaling. You will end up making the problem more back than what you started out to actually fix, right? So that has to be, you have to assume all these things, right? And typically this, you have to have this massive segregation in your architecture on what is scalable and what is not scalable. Once you have the determination, you should be able to answer this very easily. And basically like, again, you make that slightly machine learning stuff where you can set up your data pipelines, assume rapid provisioning using containers where typically like in less than, let's say a second you have a new container and do like a probabilistic offset where you always have 20% spare capacity on containers. Anyways, you are over committing on CPU and RAM. So you should be able to afford that much of over commit on your containers and have newest data pipelines. Like forget about those monitoring alerts which come in with respect to your, let's say your natural ZappX and all that. Those don't help at all. That's one single piece of advice if you like that I would like to say here. Don't trust them. Trust the newest data pipelines. Look at your ALKs. Look at your Comet DSLs, Bafanas and so on. They are making much more sense. Sensor is again making a lot of sense today. And it's also interesting to understand what cannot be stayed. For example, your DB, no matter what kind of help you have, you cannot scale that. In an instant, you can max add a few slams, basically delay the actual onset of the big problem, but you cannot solve the DB challenges just like that. It would require patience. It would require a lot of re-architecturing, do a lot of sharting. Even if you have to do, let's say, multi-master replication without your UIDs in your application, you cannot do that on the sky. One fine day your product engineer comes in and says, tonight I want this master-master replication. It does not work. You will be in a serious soup if you think you can get away with DB scaling like that. You cannot do that. So it's important to understand that these things require patience and planning. That is it. If you understand that, then you are fine. And again, networks take up like I told you before. So we have seen instances where someone was actually getting bottlenecked on various lookups, because they were doing like millions of lookups to update their customer apps with location data for their own regions, weather data for the customer's region, and so on. Millions of request per second. The deal is quite short. And the entire info was fine. So then we had to set up things like unbound and so on, which can do millions of requests per second without a strike. So those are the kind of things that you have to benchmark, run siege, and stuff like that to ensure that you understand what you have to do. And this is how a typical cloud diagnostic infrastructure looks like. So you have a lot of incoming. Everything is dominated on the firewall and the HEPROXY basically. You essentially have your own HEPROXYs. That should be the single entry point. Anything else fails. First thing, I always say, run HEPROXYs, no matter what kind of traffic you have, it is a stateless traffic, stateful traffic, UDP traffic, whatever it is, just TCP traffic, just running HEPROXY. It takes care of connection management. That overhead is taken off. That's the monkey of your back for the application. So once you do that, it's just essentially building simple microservices using even bare metal or your doctorized infrastructure. For example, I would say the bottom layer here, this typically goes into your pre-provision machines with good IOPS and so on. And this top layer, of course, this is like a simple, simple web and API kind of a setup. But the more complex it gets, the more microservices you have. And this is what typically is stressed in a typical high-scale scenario. If you have a major, major issue in this layer, then you need to go back to your application. That cannot be scaled. Inherently, it cannot be scaled. Although you can manage by doing over-provisioning, setup massive systems, people setup one terabyte RAM boxes, just load everything onto MySQL. You can do those things, but it's inherently not scalable unless you spend that kind of dollars to type out whatever traffic that you have. But the good state problem is here. If you are able to, sorry, I did something. So basically, the good scale problem is here. If you are able to convert all your scaling challenges to this layer, then you are going to give up. If anything else is kind of there, then you essentially need to re-architect your application, figure out what to be done and so on. And this is a slightly low resolution. So this is again that big segmentation I was talking about. So you want some part of your network where you want absolute reliability, no major changes, nothing will change, nothing cannot, nothing can be changed randomly because these are critical pieces of MMO infrastructure. For example, your VAP deployments, your firewalls, or even your major DB deployments and so on. This cannot be changed. Sorry if I have to read this out. Load height of change, high cost of change, and load tolerance for disruption. Changing DB to a sharded mechanism or a master-master-setup mechanism. It's not going to be without its own downtime. And how much downtime you can afford in a production environment, it's always a subject to question. And the app-specific network of the app part of it is what you need to be having maximum maturity with. High weight of change, low cost, and you can potentially have high tolerance of disruption because all these are essentially of the paradigm no single point of failure, and every instance or a container is a pro-way instance. You lose one box or one rack full of servers, you should not break a sweat in your scaling challenges. That is the entire idea behind designing an app-specific network which is of high agility. And how do you get this balance done is, again, there are some very specific open source tools Of course, there are other completing open source technology is very similar to this, but this is something that we have had very, very good results with. And it supports all the container orchestration platforms of today. For example, Kubernetes is, I don't want to flame anyone here, but it looks like that's the one slightly ahead in the race. Although MeSource is also getting quite popular, but Kubernetes is something that a lot of people are betting their next-gen products as far as a lot of public cloud providers are concerned. I'm not talking about hobbyists like me and most of us who are slow, but a lot of public cloud providers are providing betting their stuff on this. And it provides a lot of infrastructure services in terms of your service discovery where you don't have to write Ansible scripts just to add a node to the HF proxy. So those days are gone by. Now, what you need is the moment a node comes up, there should be auto-discovery, service discovery, write the DNS names for that, route traffic to that and automatically detect if that container or whatever goes down. So that is the new paradigm. No one is writing scripts anymore to do manual stuff anymore. So service discovery, load balancing layer, and it also provides an overlay network which you can secure using encryption as well if you want. And it runs essentially on Linux bare metals and you can hook it up with any of the cloud. So essentially you can be truly multi-cloud using this kind of an approach. And again, being multi-cloud, you need to be again prepared for how are you going to handle latency in your data stores across geographical regions. Some of the data stores are very, very amenable to cross geographical replication and so on. For example, Moco is something that a lot of people do. You can run on different continents and it still works very decently. But you have to be very, very careful about what data stores you choose for that kind of a cross geographical replication to ensure that you have a true multi-cloud requirement. So again, Rancher also has a good set of catalogs. So essentially you can set up something like a Parcona extra DP cluster in no time. It's just a single click, just select the number of containers you want and you're on. So you can build your catalogs, you can build your microservices approach and you can scale it up. So what Rancher automatically does for you is it monitors the health of each and every container which is something that Docker has solved recently but that was a very big concern area for Docker deployments. How do I monitor my containers? What do I do if they go down? How do I write those detection algorithms and so on? So Rancher essentially solves all of them along with using a HF proxy for your health checks and so on so that it detects if a container is under due rest and kills them and restarts another container if required so that your application does not lose sense of the hold of this constant. And you can integrate with your existing CI CD pipelines, your bamboo jig lab, whatever it is. Right, so this is one simple multi-cloud deployment that we have. So two instances are AWS and one on Google. So these are the three containers I'm running. I'm running a Google container and a Postgres container. If you can see this, this is actually split between these two boxes. So you can set container affinity rules. For example, you can say that this particular microservice will go into, let's say Azure, this particular microservice will go into AWS or whatever. This particular microservice will go or this particular thing cannot even be a microservice. It will be a traditional VM and it will go into a bare metal with a huge amount of IOPS that I have in VM, ASSG, whatever it is. So there are essentially a school of thought that says that some applications shouldn't be containerized, right? So although that's like a very divisive topic. So the school of thought which says everything that can run on a VM can run on a container but there's a school of thought which conservatively says that look, everything should not be containerized. So with that in mind, you can design your application to again, if I have to go back to this. So this entire region is going to be your containerized microservices infrastructure. So having 562 microservices these days for a major deployment is no big deal. People have like hundreds of microservices with different cross-functional teams like handling each one of them and so on. And this is where the most of the action is going to be when a scaling challenge happens, okay? And yeah, this is my summary and we should go for cloud agnostic deployment from the cloud for cost efficiency and also performance improvements because the moment you get built down and run into the bare metal you have a lot of scope for further optimization into it. And for scaling up your app layer, our recommendation is to use microservices for doing this and have new edge metrics like I said, not the traditional monitoring measures follow methods. And the future is going to be any truly web-scale deployment is going to be a mighty zone and mighty cloud, right? You just cannot escape that even if you don't want to, you will be forced to do that. So essentially you have to be prepared for mighty zone and mighty cloud deployments across mighty public cloud. So that is how the deployments are going to be. Time Vimran, we have five minutes for, we have a couple of minutes actually for Q&A. So if you have a question please raise your hand, we'll come to you with a mic and then we can ask you as well. Two questions I have. First one is you're talking about those microservices in the apps architecture, right? So it's a general problem, right? As long as all my communication stays within the box pretty faster, right? Once I disintegrated into multiple microservices and running independently then I have to fit the network model. So is there something that you guys have thought about, like how to disintegrate it? So one simple ways that this congestion happens is typically you have not provisioned enough IOPS on your boxes. So a lot of the times it's actually not network bandwidth but the actual local IOPS bandwidth that you have which becomes the bottleneck for containers across boxes to be able to communicate to each other. And the second challenge is if containers are talking to each other directly, that's an overhead. So which is why what Rancher does and what we also do in our deployments is that every entry point to a microservice should be a load balancer, right? You should not be using one more container service to route request to so on and so on. That should be avoided fully to solve this bottleneck problem. Because Hitchcock C is the one you can pump it like millions of requests literally and it will not take a strike. So every entry point into let's say you have 60 microservices, every entry point to every microservice should be a load balancer with Hitchcock C. So that typically solves the problem. But in practice what we have done is like we deploy our infrastructure on 10 main networks, right? And with enough IOPS on a per-doctor basis which is why again that sizing decision came, what's the good box? So in doctor context you also have to understand what's the good box at a bare metal level where you have to have enough provisioned IOPS. And depending upon what kind of workloads you run you have to decide whether a low frequency processor like a 2.4, 2.5 gigahertz or you need like a 3.2, 3.3 gigahertz for your app. So you need to decide those things and once you get that box and once you have those entry points in your Hitchcock C as a load balancer, typically works out well. Sure, sure, sure, let's do that. In other words the last one which you said, multi-cloud, you also said multi-zone. Is there any specific reason you want to highlight multi-zone? Multi-zone is see the least we want people to be able to do is at least do multi-zone deployment for DR reasons and so on. That is not escapable at all. But multi-cloud is essentially how it is going to be and if you want to be multi-cloud you have to be agnostic. You cannot put your workloads from one cloud to the other just like that. Multi-zone is generally a safer application. Exactly, exactly. So that is like the low-hanging fruit. If you cannot go cloud agnostic today, go multi-zone. That will essentially like highlight challenges in your architecture on where is my bottleneck with cross geographical replication and so on. And then you can go through in multi-cloud. Thanks.