 Good evening, everyone. Welcome to the fourth episode of Scaling from the First Principles series. The theme of today's episode is building products for scale. We have three engineer readers with us today, Ajay Gore, Manjot Pahoa and Puneet Kandare. Ajay is an operating partner at Sequoia Capital, India and SCA. He served as a group CTO at Gojek and he has been part of the Gojek from the very early days. He brings on the table the great perspective from a scaling journey of Gojek. Manjot works as a project manager at Stripe. She was a project manager at Google earlier, responsible for project development and execution for Kubernetes and Kubernetes engine networking. She brings the experience from Google scale onto the table. The third panelist is Puneet Kandare, CEO of SN126, where he is building Isotope, a product that takes care of API regression testing. Previously, he led the development of ML infrastructure and AI services at Twitter. I think we have an amazing panel with great experience of building products for scale. I would like the session to be informal. Please feel free to ask any questions that you may have by tapping it in the chat window, either in Zoom or YouTube. I will review the questions and bring them to the panel at an appropriate time. Let's get started. Thanks Ajay Manjot and me for joining. Thanks for having us Anand. Let's start with actually, yeah, it's my pleasure. There is something about the products that you have built at scale from your past experience. Maybe we'll start with Ajay. Hey, thanks for welcoming me. Thanks for hosting me over here. It is great to be here. So I was at Gojek for five years and we have seen the crazy growth. And I was very fortunate to actually be part of the journey where we scaled from literally some 10,000 orders per day to like more than 5 million orders a day, which was great learning experience. On top of it, we kind of became the first super app of Southeast Asia, where we had more than 17 products to offer. And we, if you go to Indonesia, you'll have amazing experience of whatever we do over there, we do everything possible. So we say, and we used to technically called our projects, our products as go something, so like go car, go ride, go food. And then somebody started calling us go everything. And that literally had, that was literally a very amazing moment because when we used to say, if you want, like when I just talked to people in India, I always used to say, if you want, like, pay to make my trip, Srigi, Zomato, Uber, Ola, and whatever you like think, like, you can find it in one app. And on top of it, we send mussels to your home and we send peanuts to your home and we fix your car on this spot and we get your water as well. So it was that we kind of became lifestyle app where you do not have to do not have to do the things which you have to go and do in multiple places. So we used to say it, we buy people time and then we said we bring people time. First it was buying time and then it was bringing time to your life. So that's what all about Gojek and that's, it will be good to talk about some lessons from that side. Thanks Ajay. Let's see what stories Manjot has. First of all, thank you so much for A, organizing such amazing events and having all of our panelists come together. I will talk a lot about one of the crazy scaling experiences that I have fortunately been a part of in the past. This was related to Kubernetes and a time when I joined the team, it was basically one of those not super popular niche projects which only very, very specific communities in the open source world and in technical infrastructure devops would know about. So and we saw it grow from there to becoming a mass movement that it is where we see like a massive change in not just how Kubernetes itself evolved and the ecosystem built on top of it evolved but just the nature of the whole open source development and how people think about building developer tools, companies that, the whole change that happened with I would say the Kubernetes movement. I would love to chat more about some of the experiences we saw over there and even within that, I mean, there's one aspect on, you know, just managing scale with respect to the features that we needed to support the absolute scale we needed to support in terms of traffic. There's also scaling in terms of teams. There's also scaling in terms of processes and scaling in terms of tools that I think does not, maybe sometimes does not get as much attention as it should. And I think those are things that truly make a difference when you're going through those massive, you know, growth periods of a product and ecosystem. Besides that, I've also been, I've also tried to run scribe.ml in the past, which was my attempt at feeding something from scratch and like a whole different ballgame compared to, you know, working at Google in a super large team where things are, every single gear is moving at 100 miles per hour and you basically have to keep up with it versus at a startup where you're basically working on every single thing and you're obviously keeping up with the community. But at the same time, you're making sure that, you know, first you get to a place where you can even think about scaling. So that journey, I think is extremely different compared to working at a, you know, a product which basically has product market fit and is in the growth phase. And I think the learnings on both ends can really complete the picture in terms of how to think about scale and when to think about scale. Very interesting. Let's now see what Puneet has, what the story is Puneet has. So thanks Anand. I'm going to be talking about my experience at Twitter and, you know, the experience that we've had building SN126 and very different experiences, right? So at Twitter, I joined in 2013 as we were ramping up for an IPO and we had sort of these sour memories from 2010 World Cup when anytime like anybody would score a goal. Twitter would go down and you would see the fail will come up because suddenly everybody starts tweeting all at the same time and you've got the spike that the infrastructure is just not able to handle. So literally like we're talking about physical scale and our systems weren't architected well enough to be able to handle that, right? We were a Ruby on Rails monolith and we had to go through the painful process of migrating into a Scala based microservices architecture. And, you know, in 2014 with every goal, you know, regardless of who, you know, earned the goal, we felt like we were the ones winning because infrastructure was not falling over. And to have gone through that journey with the company, you know, the kind of churn it took on the organization and, you know, in the code base and, you know, the conflicts among the engineers and what not. Those are some of the stories that I am hoping to share as part of this discussion. And then the second part is, you know, with SN 126, you know, I made the mistake of thinking that, okay, we're starting small. So we don't have to think about scale at this point, right? But being, you know, if the business of monitoring, you know, our customers infrastructure to understand all of their key business scenarios that we're then going to use in our simulations, we had to monitor 100% of their traffic. So we had to match our customer scale from day one. So even if we didn't have scale, the customers that we were going after had scale. And so our infrastructure had to match their scale. As a result, we ended up re-architecting, you know, our systems with like at least three times within the first six months, just so that we could like, you know, get to technical maturity with our, you know, alpha customer. So, you know, those are some of the experiences that I hope to talk about as part of this discussion. Thanks for that. Sounds pretty interesting. So let's kind of start with actually taking up maybe one use case and kind of talking about it, like something that's one of the products that you've built at scale. And what has Jeremy has been like? We won't start with Ajay again. I thought it was a random round robin. So early days, right? I think the ride was one of the biggest product which was growing 5% to 10% recon week. And that growth is like explosive, right? Once your traffic moves on. So it does not, does not sound a lot because it's like 10% recon week is what? Like you have 10,000 next, next, next day, next week you'll have 11,000. But the problem is that the next week you will have like 13,000 and the following week you will have like 18,000. And then it will go to like 20 and then like within four weeks you have double the traffic. Then within the four, two more months you double the traffic again. And then within the two quarters you have double the traffic again. So just to give you perspective, we had, as I said, we had like 10,000 orders a day in June 2015. By December 2015, we were like 250,000 orders a day. And by March, we had like half a million orders a day. And that, once you have that crazy thing, we had to rewrite a lot. We were on, we were also on rails. And what we chose is a strategy that all our OMS will remain on rails. But we will quickly move towards asynchronous micro-services for performing the critical functional jobs. So let's talk about one functional jobs. There are two functional jobs which are very critical, right? One is the, when you get a live location from the drivers, correct? And our drivers exploded at that point of time. Till August 2015, we had around like 250,000 drivers by, no, we had like 250 to 200,000 drivers. And by August, September, we had like half a million drivers. Then we had like one million drivers. If you have, if just to give you context of million drivers, what happens is like every driver is going to send you a GPS ping every 10 seconds. So you get six, six pings in a minute. There are 14, 14 minutes in a day. And if you are getting six pings from million drivers, so you're getting around six million pings per minute, that goes around eight billion events a day, right? And if you look at eight billion events a day, that's like crazy. We have to store and we have to update. And we have to do many more things with those events. So that is one thing which we had to take out. And we do that because those things are required because if we don't store those pings appropriately and don't bash them, then we can't allocate a nearest driver to you. So second problem which we had to scale out was driver allocation problem. And it started very simple when I was there, when our team first landed, we used to do a geo search on Redis with Lua plugin. That won't escape. So we went to Mongo which provided geo and that won't scale as well. And we were new, we did not have any idea. So what came to us was a proper algorithms, right? And that's where the algorithms became super important. So we actually got into algorithms and geometry and how do we solve it? And we started using S2 libraries and creating tons and tons of workers and every middle workers and bringing them up and bringing them down. So we kind of wrote our allocation engine like three times or four times within the first six months. And we kind of ended up writing our web app or not a web app, we kind of ended up writing our mobile app around three to four times. And what used to happen every day around three o'clock, systems will break because once you don't have driver, the customer will try again. So your concurrent requests will go up really high. And the first thing we learned is that you need to do amazing, amazing traffic throttling, graceful, graceful traffic throttling for customers saying, please wait. And we are trying to find a driver for you or please wait. There's no more way we can give you a driver. So systems don't go down because one system go down. As soon as you bring it, bring them up, we are dealing with around 1.2 million concurrent connections from the drivers and around three to four million concurrent connections from the customers who are trying to get somewhere. And that was the biggest learning we had was how do you do traffic shopping, drop the traffic as soon as possible, don't let it go to application layer, then authentication, then authorization among microservices and all the stuff. But till now, our actual OMS are still in rates. We still produce those autos in rates. Our functional things are in closure and go. And we have massive, massive message bus of Kafka. And when we were implementing Kafka, it seems we are not even out. And Kafka was like literally alpha, gRPC was really getting mature. Kubernetes was not on the scene. So those kind of things, we are trying to go on the latest technologies and still we don't, and we are only 50 of us. So it was crazy. And so one of the things which we learned is that you need to, first resilience is the first class system, spend a lot of time on resilience. Second, asynchronous is the biggest tool you can have. Use asynchronous as much as you can. Synchronous microservices are useless. They'll still behave like monolith. And third, try to try to drop traffic as soon as you can figure out there's something wrong with it. Not like proxy headers or parameters, whatever it is. So those are the first three learnings from the first six months. And we can talk about more learnings later. But that was the first thing which we learned in the first six months. Thanks, Ajay. I think it's very interesting to see how when you kind of scaling up as the argue is kind of going through the massive scaling path, you have to kind of, I guess it feels more like a firefighting every day, right? I mean, you have to kind of... I'll tell you what joke we used to have. So joke was that there is a river and we have to take a boat from one shore to another shore. And we all are like, what do you say? We are diving. We are in diving suits and oxygen and the boat is in a lot of what do you say? A lot of holes. And we plug our fingers into that boat and just walk across the river every day. And so that the boat does not sink and our fingers are all in. So it's literally like that. We all tense on the deck every day. So what is going down? Which database is failing? We learned the value of... That's what I'm saying. We learned the value of caching. We learned the value of... The thing is, if you can... We found a crazy market fit, right? We found a crazy product market fit. There is no way a lot of products go through this 10% week-on-week growth. And that was the blessing and... Blessing and disguise and curse as well. Because every day you have to go and tell business owners, why did we go down today? And we again, we all have same answers. And they're like, why can't you just put 100 more machines? And like, no, it doesn't work like that. You have to change the fundamental architecture. Like you can't carry, say a big workload of truck using like 10 more cars. You can't do that. And the cars will go down. So that's what it is, right? You can't put 10 engines to make it something bigger. You can't. So yeah, the rearchitecture was like on the fly. It was like changing the distance of running engine every day. Interesting. So let's kind of now look at actually... How does that look like? If you're walking at scale already, you will know actually you're building for scale from the day one. Manju, can you kind of tell us like how was it at Google, building products at Google when you already know that Google scale is already there? So Anand, I was actually going to tell another Kubernetes story, but listening to Ajay's answer on racial degradation, I remember a past life I had before I moved to the dark side of product. I used to be SRE at Google. And I'll tell a very fun story of how, and you'll see both how systems as well as people and organizations involved simultaneously. So when I just joined as a new SRE, I was supposed to do something big known as, hey, you're supposed to basically be responsible for all of Google photos and make sure it scales, etc. And do whatever it takes. So one of the problems that I noticed was, we used to have a lot of cascading failures. It's actually at organizations of the scale of Google, one failure like that, and that's like catastrophe in multiple systems potentially. So I'd seen one of those failures and I came across this interesting document from some other person in some other team. And this particular document spoke about load shedding and the importance of shedding requests if a particular task is overloaded when it receives that request or that particular client is globally exceeding its limits and then essentially abusing the system. So I then started to reach out to that person who was the author of this document and turns out I earned myself an assignment to make it work for Google photos. And so then started the journey of me adding small features here and there to this essentially a load shedding library that would take a complex set of factors, RAM, memory, etc. Individually on that task, that particular process in terms of number of concurrent requests it was handling as well as try to make a prediction and it would look at the global picture through some other external system to see if this particular client, and by client I mean things like Gmail is sending me requests or photos are sending me, YouTube is sending me requests. Are they actually abusing the system overall and should I drop this request before sending it downstream which is exactly what Ajay was also referring to. So what turns out, as we kept on improving that particular library we A, we ensured that it was generic enough and B, we made sure that as we keep on adding features sometimes I would get things by random things and at that time we were four people and all four of us working in different teams. It's a completely different team just collaborating on this one interesting site project which was essentially a site project for us. And so what happened was we completed the first, you know, a couple of versions of that and actually use it in production and the thing about outages is, you know, if people know when they're about to happen they obviously make sure they prepare tons of it nobody knows when they're about to happen and so one day we had already the load sharing library integrated into our backend services and another outage started happening where we suddenly had a flurry of, you know, re-uploads for the Android Google Photos app and that is when we saw the beauty of it and that is when we actually saw how seamlessly, you know, despite how this happened without those particular load sharing protections in place it would have been catastrophe not just for Google Photos for downstream big table, you know, other storage services and the way these services work at Google they're actually horizontal. So it's a shared service that, you know, photos, YouTube, all these services use and so just the fact, the best part about being an SRE is nobody really notices you until something really bad happens and then they say, okay, this is how we protect it. So, yeah, that was a very interesting, you know, thing to see how we evolved from like one special use case of Google Photos to making sure that we kept the library generic enough it essentially within Google it had, you know, internal product market fit all sorts of themes inside of Google started integrating their services with this library so we had around 300, 400 different Google end streams using it. What happened along the way was people recognize hey, this is like an amazing body of work and we should have these people working full-time on it so that ad hoc working group with a site project became a natural team and not just that what started off as a simple load sharing library became a set of four applications with four area leads the first one was as I mentioned the actual core server throttler and node sharing library second was an asynchronous PubSub processing system so that you could offload batch processing third was a caching system for RPCs and the fourth one I'm actually forgetting but if I remember I definitely mentioned that so yeah, that was a very interesting journey of not just how, you know, systems scale but also how teams and organizations scale along with them Yeah, very interesting. So I have one question around that I mean when you kind of, as necessary I mean you said like your presence gets felt only when something goes wrong, right? So now when you can build this kind of library it's now at Google scale how do you test and actually make sure they're really working? That's a great question I mean there are obvious tests, right? I mean while you're actually building these libraries you make sure you do unit tests and things like that which are obviously never enough and that's why we have Kunit's awesome product and I'm sure he'll be the right person to talk a whole lot more on that and I'll make sure that he does so there's unit testing, there's integration testing we for, but particularly for testing this load sharing library we actually built an internal load testing framework so we designed a whole load testing framework and we would try out different types of load and different types of internal replay of traffic even that was not enough to truly verify if this works or not again these are all learnings exactly as I mentioned like one thing goes wrong when you fix that and one thing goes wrong when you fix that so then we finally also had eventually we had this teeing system where we sent a certain percentage of live production traffic to validate that particular load sharing library and whether it worked or not and this is where I think IZOTope today is doing a phenomenal job Thanks so much for that plug I appreciate that Thanks I think quite interesting I think we'll come back and actually look at tooling for scale a little later but I think let's move on to Kunit and say let's see what story he has Yeah so I mean at Twitter we had like a lot of innovations following suit from Google we hired a lot of people from Google who came over and then had the opportunity to build these systems from scratch and then ended up open sourcing a lot of their work so a bunch of observability tools came out of Twitter as their alumni started company Wavefront being one of them and then Matt Klein for example helped build TFE at Twitter which was the front end and then he went on to work at Lyft and built Envoy which is like pretty phenomenal today I myself did some work on an open source tool called Diffie and then went on to build IZOTope at SN126 one of the stories that comes to mind in relation to load shedding here specifically because it's been a hot topic today is a developer had made a very subtle change to their code where they were reading the same downstream value twice so they were querying a data store and then they were supposed to reuse that value that they had read from the downstream data source in their code but they made a small change in their Scala code not realizing that the identifier that they were using as a variable was actually a function call that was tied to an async downstream call to a data store so they just replace that everywhere and then they basically double the amount of downstream calls reading the same thing over and over again so this was one of those disasters that would have happened but actually didn't happen and this code was going to be deployed on 10,000 machine cluster so that 10,000 machine cluster that was sending like a million rps down to the data store would have suddenly doubled and the underlying data store was a multi-tenant data store so when it got hammered with twice the traffic that it expects from its most expensive customer then it would have gone down and then subsequently all of its upstream clients not just the most expensive one that has the bug everything else would have fallen over as well that would have caused a tier 0 event at Twitter nothing would be happening the entire business would be down until that bug was resolved so this bug actually didn't happen and the way it got caught was because of the work that we did on identifying inner simulations here's a disaster that is going to happen if you deploy this code these downstream calls are going to double and because the developer was able to see these kinds of problems from ever happening before they actually happened they were able to make the fixes in their code and then get that out so it's a lot of these problems of scale you don't have to worry about multi-tenant services and what not you don't have to worry about when you're starting out small but at larger companies you have to worry about things like quota having reasonable load shedding mechanisms as definitely an alternative to having strict quota requirements I think the common thread here is there is sophisticated tooling that has to be there to build products for scale can you also mention expand on the tools that you've mentioned what kind of tooling that you use for example that you could call the bed before it went into production so what kind of tooling was put in place and maybe something about how isotope can kind of help in getting there so I mean that's not the way things started out so I'm going to talk a little bit about the genesis of how we got there so you know 2014 is when we had a really big disaster it was Oscars night and there's this famous selfie you know I wish I had it up here right now for you guys to see but if you've seen it you'll remember it and Ellen DeGeneres was about to take a selfie and then Bradley and Angelina Jolie and Brad Pitt and like all these celebrities kind of dog pile and photo bomb the selfie and then as soon as the selfie gets posted on Twitter by Ellen it gets like 3.2 million retweets within a few hours the story that most people don't know or don't remember is that you know a few hours later my team I was part of the core services engineering team we deployed a code change to production to one of our core services the Tweety we called it the Tweety Pie service which is the tweet service and this particular code change had a bug in it and the bug was that you know if you delete a tweet which is a retweet then it automatically deletes all the tweets that are connected to it so like if it is a retweet it will delete the source tweet right so a chain reaction happened and all the 3.2 million tweets disappeared and we were like you know covering our faces in shame and like you know this is a big you know egg on the face that we had so while you know it's an iconic moment in Twitter's history that you know the company as a business should be very proud of it's a humbling reminder of us as engineers that you know when we screw up at scale this is what that screw up looks like it took us a while to figure out what had gone wrong and fixing the data was a mess and then the business loss that had already happened was just something that could not be recovered but you know that prompted the company to take a really hard look at what are we going to do to make sure we are going to get back to it again right and you know they put together a study group of people and I was part of that group I actually volunteered to be in that group I wanted to be a part of it and that's where you know the tool that I was talking about Diffie was born and so you know fast forward to you know a year later you know we had already caught like a bunch of bugs and you know there were like examples right and it would have been broken in a very subtle way where you know if a new user is trying to log in the way it would break is that no matter which username you picked it would always give you the same error saying that this username is already taken so sorry Twitter can't create a new account for you and user growth was extremely important at that time for Twitter right like it still is and that metric was being compromised as a result of this bug except you wouldn't find out until like you know weeks later it was broken so that was the kind of bug that we automatically caught during this Diffie Bay simulation that helped prevent this kind of disaster from happening and pretty soon you know the whole company started using it we ended up funding it I was asked to lead that group of people higher for it we hired some of Manjot's friends from Google who were excited about they're like hey this is cool this is one of those things that these kinds of things at Google but this is a little bit better it was a good pattern in the back to get from Google alumni as well to join their team and then we ended up open sourcing it it got picked up by Airbnb Mixpanel and a whole bunch of other companies and we've been sort of building on top of that open source foundation with isotope where we have sort of more advanced simulation capabilities we basically eliminate the need for there to exist in a staging environment we can basically instrument your production cluster to automatically capture all the distinct business scenarios that your code is going to experience and then bring those business scenarios and by the way like it takes weeks for some of the long tail business scenarios to even happen in production right so we condense all of this traffic down to few thousands of distinct scenarios that can then be run within a matter of minutes so you know with isotope simulations basically running a compressed version of like the last two weeks worth of traffic in two minutes and that tells you that hey you know this is everything that could happen to your code in production it has already happened it's not broken so it's safe to deploy production so that's sort of like the the thesis and premise behind you know Diffy and isotope and you know what we've done to talk about another example this is in the isotope well you know with SN126 one of our clients is an online marketplace and you know they always have people trying to game the system with expired coupon codes so you know two years ago if they were running a campaign with 75 percent off then you know the people still trying to use that same 75 percent off coupon code today and the correct behavior of the system is that it should be rejected right because I'm not in the business of losing money to get customers right now I'm not in the business of you know making more revenue so they they had a subtle bug in the code that would have started accepting these expired coupon codes instead of rejecting it right and what would have happened is that none of the observability tools would have been able to help them because you know when you talk about like scale we think about you know cpu we think about you know memory we think about network all of these things would have been sitting around like we said it would have been a long run and parametered like no instances of you know running out of this nobody's getting page duty alerts and the way they told us they would have found out what it is when the finance guy ran the numbers and came back you know four weeks later and said how come you know the transaction volume is up but the revenues down and by that time they would have already lost like you know north of a million dollars And that's a quick overview and summary of how we create value for our customers and the kind of tools that are really required when you're operating at scale, not just in terms of traffic volume, but also scale in terms of the complexity of your business, which is a very different dimension. As Ajay was talking about his business, I couldn't help but think that when your app does so many things, how do you keep track of everything, making sure that release after release, all the thousands of things you have, nothing is broken. Getting that comfort, getting that peace of mind with the number of things that you're doing, that just goes up. And when we look at even simple microservices with our customers, we're seeing that people have anywhere from a few thousand to tens of thousands of business scenarios per service. And that's just insane number of tests that people have to write. Nobody writes that many tests. It's humanly impossible to do it. And so that's a different dimension of scale that we address with our tools. That's quite interesting. So I have a question here. So when you say like you can replay the traffic and actually identify some things going wrong, does it mean you have kind of have a copy of the entire infrastructure running aside for running this? So for us, we don't need a copy of the entire infrastructure. What we do is we basically intelligently sample traffic from your live production systems. So we basically sample the traffic and then we make sure that we're getting a very diverse sample set and we're constantly updating it in real time. And that anytime you want to run a simulation on a new version of your service, a new version of your code, that latest sample set is available to you to run that simulation. So it's basically like very traffic focused. You don't need to create expensive infrastructure because for a unit of code that needs to be tested, right? We're automatically mocking all the dependencies. Let's say, you know, the service that you're trying to test has a dependency on, you know, MySQL database, a MongoDB instance, and a bunch of other services that speak, let's say, GRPC, Thrift, or HTTP, right? You don't have to deploy any of the underlying dependencies, right? What we are doing is automatically mocking the behavior of those services because we know how they behaved in production, right? So we intercept any kind of outbound traffic that your service needs to create in order to respond to the query or respond to the request that it received from upstream and then dynamically mock the behavior of these dependencies. So we basically reduce your live traffic down to portable tests that can run locally on a developer's laptop without deploying any expensive infrastructure, right? So you can just, you know, your server will start locally just for your service without talking to anything in the outside world. It will just automatically, you know, sit inside this simulation where it thinks it's talking to a real database, but actually it's not because like everything it's getting automatically mocked. Very interesting. I think I remember Uber did something similar for actually their machine learning loads. They can actually, I don't remember what product it was, but they open sourced, I think one of the Michael Angelo. No, I think it's not part of Michael Angelo to something else. Sorry, I don't remember exactly what it was, but yeah. Very interesting. I mean, I think it feels like it's quite sophisticated tooling to kind of build products at scale. I want to kind of get back to Ajay and actually want to expand about more about points mentioned earlier about doing async as much as possible and then drop traffic as soon as possible. And Ajay, can you kind of expand on that kind of bit more and say, like, what does it really mean and then how you kind of did that? Okay, so the thing is, you should think, you should think about building a dams and think about internet traffic coming from one way. And once, and you kind of throttle traffic at every, you're like, you say, I'm only, I'm going to release only 10,000 cubic of water, right? That is your first draft. That is your first proxy. You can put it in wire traffic ship or whatever, Ajay proxy or whatever. Then integration or authorization proxy saying, is this when a request came through? But is this request valid itself? Like, does it have proper authentication token? Somebody's trying to hack using somebody else's valid token as well. Those kind of things happen. Then you go towards saying, okay, does this, does this data water is like, is it read-only data or read-write data? What is read-only data? Does this exist in cash? So you try to avoid going towards database. So basically putting that thought process across and using app servers. When your reach traffic reaches app servers, it's only absolutely when it has supposed to do some processing. So that is the first principle on the scaling on the first site, right? That is like bringing the traffic and dealing with traffic. Second traffic, the second thing is there are two thought process, right? One is fire and forget. Fire and forget requires a lot of processing. Fire and forget is like something like this. Hey, I want a driver. And somebody said response to you think, okay, I will get you a driver and I'll let you know when driver is found, correct? And then you go back to app saying, okay, we'll wait for the driver to come. And then somebody will go read the data and then you can increase the workers from the queue and read the data and figure out the driver is there or not. Once you find the driver and the reason you say fire and forget, once you say, okay, I got your request and I'll get you a driver, that means you close the connection. Because the thing is we are dealing with around 120 to 350 million requests per second across our infrastructure, right? Because you look at, if you think about 120 million requests is like crazy, right? So shut down your network traffic as soon as possible and you're to close the connection. So you have to tweak some of the defaults, CTL configurations as well, like what your network window is, what your network timeout is, all those stuff as well. But then you go towards, so you deal with cash. So you have this workflow thing. If this, if this, if this, eventually you go there and once you find the driver, then what do you do? Now, how do I let somebody know that I have a driver? So you drop a message again and tell somebody go notify that customer. So you send a silent push notification and the customer wakes up and are updates. So you can do a pull-based thing where you pull every five seconds or you can wait and wait for notification to come. We have tried both. Both have worked. Both have different thought process. Like if you do pull-based, that means every, then you have to make sure that you don't pull on your mobile side on the stroke of five seconds. Then what you're doing is you're getting one million requests every five seconds because you're saying 000005080010015. So you're getting a million requests per five seconds. Just pull the connection. Instead of that, you should randomize that as well and distribute it accordingly. And somebody has to, at the backend level, need to keep track of it saying you pull me every third second. Like three, 13, 23. You pull me every fourth second. Something like that. So that is one way. Second way, you can do pub nub or like you can do some messages and all this stuff. So that is the second thing we did. So far and forget is super important. So everywhere you do something, think about you're walking into this good analogy is you're walking into this fast food McDonald's or Burger King. You go to a cashier, say there's a record and cashier says, okay, please wait. And then somebody gets inside on their screen, prepare this, prepare that and bring that thing with a notification or whatever the receipt. So you take the token, there's token and you tell a guy who have order number 155 and here is your food. So that's an extra simplifying it. But that's how it is. That's how the architecture looks like almost everywhere. And that's how it should be in most of the cases. Yeah, I couldn't agree more with you. Can I jump in on this? Yes, please. I couldn't agree more with you. Like, you know, with any kind of distributed system that that you want to build, if you can do like an actor based system like this with the message passing interface, that's like obviously like the most efficient thing. And if you could do at the protocol level, you can go all the way down to UDP and, you know, optimize packet size, then like there's nothing more beautiful than that right. One of the problems that I ran into was like when you start to look like traffic wise, like, you know, that's golden, but when you look at it from the application layer perspective, right. And you have now you've got a product that you need to build. And you have to think in that distributed manner with the context is something that the application, like the simple regular programmer has to figure out like how to manage that context. It's like typical asynchronous, any kind of asynchronous library that you use that implicitly manages that context for you, right. If whether it's like through an underlying persistent connection that is being multiplexed across like, you know, tens of thousands of context or however it's being done right like that message passing architecture itself. The programming interface for the developer right that ends up being, you know, a fairly sophisticated task and it's hard to find programmers who can who can write code for those kinds of things right that's one of the reasons why, you know, systems like Akka, you know, didn't take off and Erlang for, you know, all the credit that has in that telecom all the success that has in the telcos, you know, didn't become so much mainstream. So these message passing interfaces while you know, being extremely efficient at the sort of traffic level right, you know, it's, it's just like a trade off at the application level right. They do and the thing is when you, when you deal with developers who come first time after like they have not dealt with it. And when you're every API of your returns HTTP 200 okay. 201. And they get confused like, okay, I got this 200 okay and nothing else. And I got 201 and nothing else I got zero as a response. Like, why doesn't give me one or two or something else it gives me the thing and in the response is just zero. I like, yeah, but because the request you have done it succeeded, and they'll call you back. And like, they call back. Then you go, they'll call back your network which you have registered with the microservice if you are within the service. And if it is not a book then, and then they will actually whatever token your pass, they'll call back that mobile mobile phone token or mobile phone token and they will get it and putting your, putting your head around it. So what I used to do I used to actually give this manuals example or example saying this is what it is. This is one of the actors, and you're supposed to do this is like, then I am not getting a full picture and you can get the full picture as soon as you move from the kitchen and go towards order and you can see what is happening. So you should, you should sit between the these two roles. If you don't sit, if you don't work on front end and if you don't work on backend, and you don't sit roles, it is super confusing. But that happens. And it's very efficient, but then you need to have a very different thought process, the very different paradigm shift in terms of how do you write programs and how do you write overall service itself. Like a lot of times people write microservices is actually not a microservice is like whole monolith. And then I used to a little bit get angry saying look dude, can you rewrite this in like two weeks again, like no it's not a month and then it's not a microservice. It should be, it should be you should be able to manage it, you should be able to discard it you should be able to like, don't have any love for it and implement it better. And if you can't do all that stuff then it's not microservice you're writing monolith except if you are writing monolith, then don't do asynchronous as well, then try to increase the performance on the throughput, and we are okay with that. Yeah, I guess distributed monoliths. Yeah, the student monolith used to happen a lot. It happens everywhere actually one of the basic. One of the basic things I have seen, they will have one with, and then the what they'll do they'll make 10 copies of that more than we have put behind a load balancer, and they start calling it microservice is not. And it kind of, and because they forget that they can increase the app servers as many as much they want. So this is not going to go anywhere. So you have to think very differently they all microservice should share their own databases as well. You can't share database among microservice. And that's like, again crazy thing. The way I saw this, you know, happen at Twitter, the kind of microservice architecture we had at Twitter was was one where we saw the, like, if you look at a monolith right it's basically a call stack right any request can be perceived as a call stack right functions calling and the thing that Twitter did was that, you know, we wrote a library that made these function calls asynchronous, and that kind of led to layers of division so you take like a chunk out and that it becomes your microservice. And so that what used to be a shared library called birdhouse or if I name, or some bird cage, like there were a lot of bird names at Twitter. Which is another thing that was wrong with Twitter, but everything was a bird name and like people didn't know what what what was like my service was called gizmo duck and people like what the hell does gizmo duck do? It was the user service and then people would ask me why isn't it just called user service? Well, you know, the people who originally named the service. Anyway, that's a that's a detour from what I wanted to say but this idea that you know you can take a monolith apart and look at like the layers of functions that are being called and then make those function calls themselves asynchronous. And then you know that give you that get you that network separation. So now you're calling, you know, across one server another, which you know, then if you try to optimize that right that led us down the path of, you know, a multiplexing and like client side load balancing and so you get rid of the load balancer in between you have service discovery where you your client library will automatically discover all the servers that belong to the cluster that it needs to target in that name space. And then it'll basically round robin between them, right, and it'll basically have like a pool of persistent connections with some of them. And on that pool of persistent connections whenever you have calls, it'll multiplex on those, you know, those calls so like the protocol stack got a little bit more complex so this was like the whole finagle library that Twitter built and shipped out. And, you know, as much optimization like this was all the work that went into sort of trying to maintain that procedural programming, you know, paradigm that developers are so familiar with that, you know, function a calls function b which calls function c, and developers are used to thinking in that paradigm, whereas you know if you remove the top half and bottom half right that okay, you've done it, and you're, you know, it's fire and forget and then when they call back that path is completely up to you so figure out like how you're going to deal with that call back and you know how you're going to retrieve that context by itself. That's a very different paradigm which, you know, at least at Twitter I didn't see people having the courage to go down the path of. So, you know, I want to commend Ajay for having achieved that, you know, and I wish, you know, I had the opportunity to see some of that code. Yeah, the thing is we could be we did not do it just by that we kind of learned very hard, but I want to go back to one point which you made and I think it was very important point. The naming the services like while it is not in the scale but at least part of the scale, because we had an early days we had a Stan March and Ellis and wonderland and whatnot right. So we used to call our primary production environment sometimes wonderland because Ellis was a service which will allocate the drivers. So allocation service which Ellis and it is in wonderland and like people like me to really get all the time confused what is Stan March what is Ellis what is this what is that. And then after like, and that we killed it very, very early on saying okay, we're going to call you the service user service com service com service food service food service. While you want to call waiter as a OMS for this but we'll call OMS. So we can put in bracket, your lonely service name, but we'll call it food OMS, it's not waiter, or like search service or whatever right. So we had like, all everything they love naming we love naming humans love me. But sometimes what happens once you look at 800 services with the various names, you lose the context what the hell is happening. Correct, because we had 17 products, each product largest was food and transport and payments and they're like a lot of services. So given that we kind of got the name right so my request everybody was listening. One of the things which you should always do call the service what it does, not what you like it to me. So this is one of the apps, absolutely agree on that. So in order to be actually saw similar, you know, set up as what we describe which is internally inside we, we had a framework and that's where the load sharing library also fit in. We basically had this framework that took away the task of ensuring that all calls are asynchronous. Everything. There's client side load sharing this client side throttling. You know, all those things were basically taken away as responsibilities of the framework rather than responsibilities of every single individual business team in order to ensure that they make progress as soon as possible. And eventually we developed like a whole sre platform on top of it. And one of the tasks was exactly that automatically generating names on the basis of configured graph nodes. So to give you some sense I mean, like every single product on on earth school photos was a gigantic monolith for the longest time until we started breaking down into my services. And the point that one gigantic monolith was broken down into at least 100 services that microservices and it was still a monolith despite breaking out 100 different microservices. You know, and we had to take care of not only you know a just as sre when you're debugging a service you just have to know what exactly is happening in that particular binary, and also be able to trace down who's responsible is that sre platform was actually a very beautiful, very beautiful one that took away a whole lot of guesswork around a what the service is doing be which team is responsible. See you know what type of like core function that this is supposed to serve. And most importantly, the mission of that platform was day zero onboarding by sres. So, till date I mean in the past. In the past what was happening was every single time we had to onboard a new service. There was a whole two four months long procedure, where we would have like a checklist of you know n number of things that were done. But with if a service was built using this particular framework that made the promise was we'll onboard your service in on the same day that you're ready to put in production. So that really changed the game around, you know, just managing because the other sort of, you know, tools and features we build around it was, for example, releases, which is a very critical part and automatically releasing services on a daily basis. So yeah, fun, fun times. I mean, I think it brings like two broad teams right I mean something I want to touch upon one is a developer productivity when you kind of work with such a complex systems and the failure modes inside complex systems are a little bit okay but before kind of getting to that. I want to kind of contrast building products at scale in a large organizations with actually building products at a small at a startup and how is the process different. Okay. And maybe like what's the right time. Think about scale I think we mentioned about his experience but I'm sure there may be conflicting opinions. I want to kind of hear from all of you. What does it feel to kind of build products, build products at a when you don't have scale or you're anticipating scale at a later point in time so how do you kind of build and I used to approaching building products or like any of these lessons applied there or do you have to kind of think differently. I have a very strong opinion on this. I just have an opinion, and you can call qualified being strong. So, I think, when you start always started monolith always started NBC always start with the model template view kind of way, whatever it is right. You will be surprised that monoliths can actually go on for very, very long time. When when Puneet did rails, removed rails or moved away from rails to something else. Let's call our land, you might have seen that they all already a very large scale. A lot of time people say that rails does not a scale or Django does not scale I don't believe that. Second thing, our, our two popular rdms which is like my sequel and postgres are very resilient and very good in many many things till the time you hit like a lot of records into it. So you don't have to use anything fancy for for up to like 100,000 or even up to half a million already, we don't have to do anything fancy that that is what it is. The second thing you might want to put is a caching server in between or cash cash is in between and a good load balancer understand load balancers very well. If you get to start getting more and more request than his first thing would be structure your API so they can you can redirect read write request to different clusters there are read clusters the right clusters and things will just change because what happens a lot of products you'll see around 60 to 80% data is read data or more like in Twitter a lot of data will be read data Wikipedia a lot of data is read data read data is very less right compared to consumption. And there is two for almost every product like you look at the e-commerce site you look at anything you are browsing a lot then you are making order you are making order like once a week but you're browsing almost for four days. So read write version. So those are two or three things we can go in more details I would say simple load balancer understand your load balancer use NYHF proxy whatever you want to use it. Have a very nice proxy it's like a actually proxy like engine x something is standard and use a cache use radius use your Django or MySQL or Rails or whatever and just use your database and you're good and if you serve good API is then you can use any front end from framework it will just work beautifully for the longest period of time. That's my one advice. I completely I would like to add one more thing I think adding data partitioning. Thinking about it a person would probably be another good weapon that kind of take you quite even longer. Yeah, you can. If you go really crazy then yeah data part is there you can do that then you put the user is can do a data part in your perfectly fine as well. I mean a lot of times like data when data time series just kind of portion by the date or month or something and then. Data rotation, data rotation archival all that stuff will start you will start learning that but my point is just just to start with you happen in order you don't have to do any of the stuff for a longest period of time. Just increase just and post an ID up to the month is amazing. The really really amazing. Thanks. I mean, let's kind of see. Do you want to kind of talk about your experience of building like from a startup or building when you don't really have scale like what do you kind of how do you approach it. So, I would also advocate for starting with the monolith until you're absolutely sure you have, you know, product market fit and you are then experiencing scale and then you start hitting those problems. We. The product that we were building scribe dot ml. It was essentially an experiment tracker for data scientists. So imagine you're basically writing your, your notebook right and you want to make sure as a data scientist and as a data science team overall that your all your artifacts are your data set your features your models your your metrics and your graphs your plots etc. All of those things are trapped in a in a you know in a meaningful way in one place and shareable with your team. Some of these things we my team experience as data scientists themselves and being able to track those things seamlessly can help your team become a lot more productive so that was the problem state. And our architecture was there was a client side. So there was literally a package a Python package you would provide that you would import in your in your notebook server, and that would send some some of this data across to our back end. And then have we had a dashboard that would display all of that's a very, very simple, you know architecture and application. And we were also we were using Postgres. And so no problem with respect to we reached a scale of a couple of customers and we were in just in pilot phase, so not gigantic amounts of scale. That said we were already uploading you know if you have a team of 30 data scientists. There are so many models and you know runs and experiment runs that you're already uploading on a daily basis. So, we didn't hit any scaling problems on the back end or the front end services, or the client library for that matter concern. So I would also really advocate for going for a moment when you're beginning, when you're just beginning building your product and you start figuring out product market fit that the interesting dimension of scaling we hit was how do we display all of this information to the user. So the scaling problem we were solving initially was now that you know our back end is crossing so many models on on on a daily basis, even displaying all of that. You know amount of information to a user is an interesting problem to solve for, but it definitely doesn't have any anything to do with scaling our back end infrastructure and be able to handle that that code. Great, so I think I think we're kind of running out of time. So what I would like you to pick one more topic. And also there's some questions coming up. Okay, so we'll close at 745. So before we kind of get on to questions I want to get up. I think here, when it's take on billing. Sure. Yeah, so I agree with the other two panelists here that models is the way to go with the exception of, you know, our current startup, I because in previous lives, even at Twitter when I was working on a stealth project where we were asked to like go build a new app from scratch. And that is going to help us get the next 70 million users out of India. We went with the monolith and there are a lot of advantages to, you know, I want to look at this from, you know, the development speed perspective right in early days the only thing that matters is agility, right the only thing you've got going for yourself is speed, and you cannot get that right like you come if you're you know trying to move fast towards an opportunity for competing with a bigger guy, or anything like that right. The only thing you've got is your agility and focus right so the monolith is is by far the best thing you can do to yourself in terms of agility because like if something goes wrong like you can trace through the entire stack stack trace like all your code is in one place you can make change you can deploy very very quickly. So all of those efficiencies are working for you. They never do any kind of distributed architecture because it's cool like there's a lot of tools addiction that happens with engineers it's like, you know all the big guys are doing microservices I should start like a $0 billion company and also do microservices that's that's really not you know the way to think about it. Do it when you have to when you have a good reason to even in our case like when I talked about SN126 we started with a monolith. We knew that that was the fastest way to get and you know when I say that you know we ended up re architecting ourselves within the first six months that was because we like we stuck with the monolith for as long as we could diluted down the product requirements that would fit inside a monolith, whether it meant right like you know not looking at all the traffic so we're basically dropping the quality of coverage that we're giving to our customers or whatever right because we did not want to give up that agility until we found paying customers right. It was only after we found customers were willing to give us real money that we said okay now we have to deliver quality and for that we are going to have to you know in our particular business we were required to you know do microservices and distributed systems right. That basically you know meant that we now had to take a hit on you know how efficiently we can debug our systems, how quickly we can you know iterate on things and whatnot. But you know we got used to all of those things you know soon after so again you know agree with the other panelists and you know recommend monolith as you know the only place to start to anyone who's looking to start a new business. I guess that brings to my other question like, like you said like developer agility is what kind of suffers when you kind of go to a distributed architecture or building for scale. Now I want to kind of turn it around and ask like, what do you do to kind of make sure that developer productivity won't hit because you are building for scale. So what kind of techniques, tools or processes you follow to make sure that developers continue to be proactive even when you're building for scale. Who want to take it. I want one of the other panelists to start with this one because I'm going to if I start ranting about this one I'm not going to stop so I want the other other panelists to take this one. This is this is too greedy for me to take this. In my view, the way we worked it out was I wrote a book I've not a book I wrote a blog post about it which says checklist to save the like I quite software infant mortality rates. Basically whenever you deploy breaks things right and we don't know what happened. So what we did we kind of created a checklist saying these are the things you should check before you deploy. That is too far and second thing we actually aggressively monitored business networks. So what happens with business matters suppose your business you, your network infrastructure will say everything is fantastic and fine, and you see your orders going down. Now, how do you know where is the problem. But one thing is very fundamentally the traffic was there yesterday and the same traffic is there today why the traffic is slow. Why we are processing this order and when you go a little bit deeper deeper down, you will figure out some service has gone slow, and that is the weakest link. Like we are not able to so you have to where you have auto scaling and also suppose your user profile service, which is, which is even auto scale, but it's going your database is hitting IOPS, which might be possible, and then it gets slow. So those kind of things. So first is checklist implement as many checklists as you can for everything. I'll use simple simple simplest example saying. One of the rules was don't delete the column. And if you're adding the column, tell us why you're adding this column are you putting index if you're putting index what is default value default value, are you putting in cash if you're putting the cash. When are you going to deploy this and how many call how many rows that table already has. Suppose you do did a migration added a column to a table which is like 300 million rows. And nothing is going to move for next one and half hours, because the database is just going to adding the column to all of it without no default values and ground one index on top of it. So create a checklist, I think checklist is very good for everything, create templates or monitor your business metrics, and these are not very fancy things. I do not want to get a tool set and like Rand which is going to do, because then there are a lot of things you can have we can talk about a lot of, like, we can talk about like ours on this, I can tell you that. But those are the three things which I found anywhere where people are just getting into it, but because later the tracing comes and logging comes and then a lot of other things come which is you can give you more and more signals. What you're looking for a beacon or signal in this whole thing, finding a fault in a distributed architecture is like finding a needle in haystack. So start with these three things, and that will give you kind of basic things and then you'll learn over time, and go read books, go read their amazing books about it with books are nothing but failures of other people, or experiences of other people, or successes of other people and we should just not do that again so please read books, read a lot of books. Thanks. I think we'll come back to recommendations about what we'd etc. Once we finish this. So, what to take up next. I actually have one last comment. I don't, I'm not going to answer that question because I think again, this is like one of those topics where all of us can go on for hours. So this is actually a cliffhanger for Anand and content for the next talk. I actually have a question which is how do you measure developer productivity. I think that in itself is a very interesting problem that I have been thinking about. Yeah. One of the things to measure developer productivity, you can't measure it very well, but one of the things to measure a symptom of delivery is that figure out how much time you had to be disciplined but figure out how much time we are, we are spending on processes and production upkeep. First thing is that that is one thing which kind of gives you where your systems are. And second thing which we did to the product management is that we log every bug, where the story belongs, and start looking at which, which team produces most number of bugs. I'm not trying to blame developers, please don't react like that, but the thing is that there are multiple problems. The analyst are not writing proper stories and acceptance criteria is the specs are not done very well, unit testing is not done very well. So there are multiple areas where the bugs are being originated and try to fix those problems, but there is no fixed way of delivering the measuring the productivity, but these are the things which can improve it. While you can't say it is one or 1.5, but you know that you can whatever baseline it is, these things will improve it much much better. And then spend a lot of time in platform engineering and platform tooling, which allows developers to do repetitive tasks in a much more automatic way. I just want to add to that, you know, slightly different perspective. I feel that, you know, a lot of the problems that we see there's like a plethora of tooling and I think it's a good idea not to not to drive into that at this point in this discussion but you know when when you use that plethora of tooling at larger companies right like often it leads to a realization that you know there's a lot of problems so you know distributed tracing for example will give you insights into the fact that you know when when a request has to pinballs across you know 3040 different services. And the question that comes to mind why why is this pinballing across 3040 different services, is it actually required is there redundancy in the architecture. And you know at one point I came to the realization that you know we had more than 450 microservices at the company. And no single person in the organization had a clue as to what the bigger picture looked like. And that was, you know, terribly scary thought because you know I found instances where you know there was the same logic the same business logic had been written three different times by three different people. Right, how do you make sure that there's no redundancy being built, because I ultimately the system that that you have is a reflection of the organization that you have right. So, distributed systems isn't so much a code or engineering problems it's more of an organization problem, right like, how do you organize people in a way that you don't lose that central leadership. Because you know, and this, this, you know, goes a little bit into the people side of thing also right like because as you look at the leadership hierarchies right like the higher up you go. These are like increasingly people focused roles right you are you know you have engineering managers for sort of people managers right and then their leaders are senior engineering managers and then directors and then VP of engineering and so forth right. The CTO is a role that is like supposed to be super technical but then the CTO does not have a hierarchy of tech leads that you know roles of the just the technical aspect of the organization right. So, so when you have like people leaders leading the organization, it often leads to these kinds of situations where the distributed system architecture becomes so incredibly complex that you know your productivity starts, you know, losing. And you end up having a lot of overhead in terms of communication that you know just to understand like how to use this product and how to use this service and how to get this product featured and you have to you know schedule 10 meetings with like, you know, five different stakeholders and then you have to do like a 30 quarter planning exercise and whatnot right like that's not the kind of agility startups are used to right like it's like if I have an idea now I should be able to have it in production within a week and within two weeks I should be able to kill it because it's not working. Right, whereas you know the with the larger size organizations, you kind of lose that agility, at least like that's what I felt love to hear comments from the other panelists on along these lines. I think I would say the converse law right I mean. Yeah, I think we're almost will just pass the time I think. Let me take some of the questions from the panelists and then we will quickly close it okay. So one question that we have here is, we talked a lot about scaling back and services at scale but when how is it in the front end. So this is a question from Peter Thomas, people talk about micro friends sense etc like anyone want to kind of have a comment on scaling front end in terms of microservices like architecture. Um, we kind of. So basically what we kind of did is see software design designed by abstraction is the best. So even you look at microservices that we are, we are designing by abstraction. So whether you call it micro front end or whether you call it monorepo, as long as you put abstraction over there. So we did we kind of created a UI UI UI UI and UI platform UI engine, and people, which will provide components will people can use those components to build the front end services right. And that kind of gave us a freedom of two things one of the people who are at network or device level they can focus on that people who are at platform level they can focus on designing UI components, so that you will look in look very similar on the Android and RS both. And then you're on top of it you would reuse those components so you have similar experience for user because we are used we are doing a lot of similar things, we also develop our own design language. This is only for scale. This is not for an early days if you're doing early days and maybe please choose either to go native or you react native or use whatever you are comfortable with, but once you go towards it. I'm designed by abstract is abstraction is one of the principles you should apply everywhere, and that kind of gets you a productivity and speed both and gives you kind of interlocking interlocking. Not not mutually exclusive and what is called me see that music and cumulatively exhaustive kind of hot process, and you don't have a lot of conflicts in the source code you don't have a lot of much conflicts and anywhere else, and people can still work independently and that's what this kind of because micro front ends, if you give the only component based UI development for people. Very interesting. I mean, I think we can go on talking about the stories forever, but I think it's we kind of end of the time I want to kind of stop with one last question. Okay, so we shall have asking any good recommendations to learn scaling systems or even teams. So, sorry book from Google, please read those both two books that are available, and they're free of cost. That's it to start anything else. Software design policy philosophy by John us to help. That's a very good book. Can you kind of type it in the chat. If you don't mind so that people can actually I will just search and you design of software software design philosophy. Yeah, I think I'll coordinate with you and actually post it on the. Yeah, as big page so everyone gets it. Okay. Great. Anyone else want to kind of add more references. I mean, I think multiple different books for different, you know, things and I would say the, there's still a lot of if people are still thinking of writing in depth about pieces and a lot of white space for scaling machine learning architectures. And I haven't been able to find that perfect book. So that would be an interesting one. So I think have amazing discussion and I'm sure you can go on for us kind of talking about interesting success stories and also the stories and what what what did work. These are all failures. If you notice these are all failures. None of them success came afterwards. So basically messed up, and then you fix it and you're telling you how to fix it. Yeah, actually, I think that is the best answer to the question I just gave indirectly, like the best way to learn is to fail. So just, you know, fail at large scale companies and that's the best way to learn about how to scale and then come back and really start up. Yeah, nothing, nothing teaches you more than battles cars. Yeah, and one of the things which happens is in today's world, like iterative development and failing iteratively is much, much better than trying to go towards a perfect product. It's okay to ship the product with bugs and fix it later on. But there's no, there's no silver bullet. There's one more paper to buy. I think so it's no silver bullet written in like 1970s. That's what it is. So fail by iterative development as much and as fast as you can, whether you fail in first week, then fail in like 12th week.