 Hello. Welcome everyone. Welcome to the second episode of Scaling with the First Principles series. Today we have Kailash Nath, CTU of Zerodha with us to discuss about scaling with common sense. Zerodha, as many of you already know, is India's largest stock brokerage. That means a big percentage of transactions of Indian stock exchange goes through Zerodha. That's not a real scale. It wouldn't be interesting to know what kind of technology or processes goes into getting such a system. Let's welcome Kailash Nath. Welcome Kailash. Thanks, Anand. Yeah, so I'm really glad to have you on the series. So it's really quite curious to kind of know what kind of tech that goes behind Zerodha. It's really nice to hear that what you're able to achieve with such a small team is quite remarkable and live to kind of know more about an inside story of that. So let's, for the benefit of everyone, can you start us by telling us what is Zerodha? Yeah, Zerodha is a stockbroking firm. We started out as a discount stock broker. Then some seven years ago, we started building technology for everyone to invest in trading the markets. And over the last seven years, we've transformed from just being a pricing-based discount stockbroker to an investment technology platform. So we offer investment and trading services to everyone. Anybody can sign up, download our apps or use our web apps and just start investing in trading. So yeah, that's what we are. And today we are also the largest stockbroker in India in terms of the user base. So can you tell me about like the kind of scale that you work with to get us a kind of idea of the scale that is like the number of transactions that goes through or the amount of money that gets exchanged and so on. Yeah, so we have, I think we are approaching a client base of four million users, but what matters to us is concurrency and the markets is all about concurrency. And on peak days, we have more than a million people who log in and transact on our platforms. I think our highest day recently on our biggest day, on a peak day, we clocked between eight and nine million transactions from end users. And on an average day, we do six to seven million transactions. And when a key facet of trading in the markets is consuming market data, you know, prices and ticks live by our platform. So when people, when users log on to our trading platforms and track market data, we're streaming out ticks just to give you a scale, sorry, just to give you a rough idea of the scale at which we stream out market ticks. At any given second, we are streaming out probably 30 to 40 million packets ticks of prices that flash to hundreds of thousands of concurrent users. That sounds like pretty great scale. So, can you tell us like what's the tech stack like when what to get a kind of set the context of what we're dealing with. The first initial versions of the trading platform will build in Python, because you wanted to build it quickly, but we've since migrated most of our user phasing systems to go go. We really like goes concurrency primitives and the writing go. And we use Kite, our trading platform, all our systems that handle heavy load and traffic, they're all written in go. And we use Redis heavily as a database in memory database for low latency reads and writes. And for a non critical passive storage, we have postgres databases, there are many instances. Some are a few GBs in size some are several terabytes in size. Postgres, Redis and go and for non non for non critical or systems that don't really have to handle a lot of traffic we still heavily use Python. And most of our back end number crunching systems, lots of those things back office systems are still in Python. Yeah. So, how do you host this do you kind of host manage your own servers or do you go with the cloud or how does that work. We self host all of these systems. We use AWS for hosting these applications we also have presence in physical data centers. And you have to have those in this industry you know terminating physical lease lands etc. It's, it's the nominal stop working industry but these, the applications that I just described our databases etc are posted on AWS. And all these instances, we self host them we manage our own Redis instances on bear. Easy to instances. We don't really have managed database services or services at all. And most of these big critical applications are on this raw instances that we self manage and many smaller applications. So many smaller applications we recently started using Kubernetes and Docker, just for the sake of these are just for the ease of deployment and also to standardize the deployment process not necessarily for Steve. Yeah. I think you also use your next I really built a lot of your stuff. Right. Yeah, yeah. So stop working for men. For that matter any financial firm will have ridiculously large number of backend processes, you know, accounts ledgers transactions, all sorts of complex financial legal and regulatory things that you have to do. And you'll have lots of different departments, different people with different roles and permissions who need to do generate reports who need to do run tons of processes. So the bulk of our back office systems the crux in fact the system is called crux we actually call it crux because it's the crux of the business is built on top of frappe slashery I've been next. And that gives us a context about the kind of technology we're dealing with. Okay, so now I want to ask you kind of a broad general question like, when how do you, what's your approach to building software right in the context of scale. So, I think it's a quite a generic question but I think what I feel is like you have a very different perspective than many of the people in the ministry so that I would like to kind of hear about that. I personally I think I have a very hacker tinkerer sort of perspective. No matter what you're building you start out with the smallest unit the simplest unit doesn't matter what you're building. And doesn't matter at what scale it has to operate you focus on that problem and pick the right tools to build it in it could be Python it could be go it could be Java could be anything. But by saying I am going to only use Java or Node.js or whatever so right tool for the right job is definitely a big. It's a big principle and starting small with small units is another thing you don't, you can't foresee infrastructural infrastructural or scale scaling requirements on day one of any business, especially a complex financial business. And scale, you can't even quantify because it could be the number of users could be the number of transactions could be a few users and massive amounts of transactions generated by those users so you can't really quantify and foresee those things so you don't start out by saying, I'm going to create the X number of servers using white technology, you know, and create this sort of infrastructure on day one. So you start out really small it could just be one server instance with one process running something. And you build it up from there. Once you have that sort of a framework running in a business environment, let's say a stop broker. Next time, or the 10th or 12th time you start the project you will have a pretty decent, a decent idea of what stack you have to use because 90% of it is probably similar to the rest of the projects that you've done so you always start out small and simple. So, I think one of the things that you always mentioned like one of your blog post is like, and keep it simple principle the next philosophy right I mean can you expand on that like, how does it kind of, how do you apply that to software, you keep also saying that use a boring technology and so I think you always started to use proven technologies rather than going with the cutting edge. So can you kind of expand on like, what's your use on that and also how it played a role in actually building your stack. Yeah, it's a trade off. There's no, it's, it's a little difficult to describe. Again, the principle that one must use to use when picking technologies. Like I said, boring technologies boring is a cinnamon cinnamon for battle tested so when you pick a battle tested technology as opposed to something that was released just a week ago. These are obvious. You get years worth of battle testedness to back up your choice. So that doesn't mean that you only use boring technologies you also have to use bleeding edge technology technologies that's where you take that bet that's where you make that trade off. So we used some bleeding edge technologies and the bleeding edge technologies that have proven to be great choices. For example, we decided to ditch our mobile applications and rewrite them in Flutter when Flutter was alpha or even pre alpha. But it didn't, it doesn't mean that, you know, we just, we read Flutter's marketing page and decided to just pick Flutter. We in fact built a prototype over several weeks and built up confidence and then decided to use that bleeding edge technology. So it's basically, again, it comes down to the right tool for the right job battle tested, often wins 90% of the time and 10% of the time you make an informed bet a trade off on an experimental or a bleeding edge piece of technology but staying away from hype. Definitely for sure you never read a technologies marketing page or whatever and just decide to use it. You have to build experiment prototype whatever and build objective confidence before picking it. And that's the rule that we apply. So the couple of questions. So I think I'll take once which are relevant to the discussion as of now and then take the remaining ones later. Yeah. So one of the questions is what are regulatory compliances and how do you kind of manage that because I think it's it's ever changing industry right of a lot of regulations etc. So how do you kind of keep up with that and what's your approach to that. Oh, I think ever changing is putting is putting it lightly. I think in the last 12 to 16 months there have been so many massive regulatory changes. How the industry has worked for 1015 years has literally been rewritten with a deadline of let's say two weeks or three weeks. So it is, there's no playbook for dealing with crazy regulatory changes. But if you have a good enough software software architecture, even inside an entity as complex as a stock broker. It gives you agility and that also means regulatory agility. So we've, like I said, we've kept everything really simple, even complex units within the business with different departments etc. The systems they interact via, you know, let's say unified APIs and there are no hacky, there's no hacky memory sharing or database sharing for instance, there are no undocumented undocumented connections in API so these units are all. I don't like calling them microservices because they're not microservices. Some of them are really large services but they have been architected in a way that they don't get in each other's way. So just common sense architecture. So when you, when your software system architecture is malleable like that with enough moving pieces that are all independent from each other, that gives you a lot of regulatory agility. And I would say that thanks to our simple software architecture we've been able to cope with huge regulatory changes in really small amounts of time without actually having to go rewrite everything that we've done. So one of the things that people talk about in software world is move fast and break things and right and so no I won't get your views on that. Okay, I know that you won't agree but I want to kind of hear the other side of your side of the story. Yeah. Move fast and break things has, I mean that phrase has been abused abused so much. And that phrase has been used to justify so many random things it's lost its meaning. I would say move. I don't know. It should be fast enough, and you shouldn't just outright fail the failure. It's a rate of a game. Move fast enough so that you don't really fall flat on your face, because you can't really, you know, 100% predict success or failure so it's basically that we are although, like I said, we are very agile, thanks to our systems. The regulatory changes etc can completely change how the business works but despite all of that and despite the number of features etc that we release internally we are extremely conservative we may actually dwell on a feature for two months, three months or six months. We may dwell on one tiny change up until we are confident that that can be pushed out. So, that's not exactly, you know, move fast and fail and you can't really do that in a critical environment like financial transactions. Again, ironically, because we are so cautious and conservative and we've taken a time to design these systems in whatever ways we've done them. That's in fact made us faster. We don't really have to move at breakneck speed to innovate. It happens at reasonably fast pace, despite us being extremely conservative. So, like I don't like that phrase, it's again a trade off. So, I mean, I think you're saying that your attribute that to your architecture, how it kind of grown up. For sure. And the way our team has evolved, the culture of the team. So, can you kind of say more about that's a question like what do you say keeping architecture simple what does that really mean. There is a question from Abhishek. Can you expand on that. So what do you mean by keeping architecture simple. Very broad. There are many parts that are many complicated parts of a stock broken system but just to give you maybe one example. So earlier I said that we have hundreds of thousands of concurrent users who are connected to our systems using web sockets to receive live streaming streaming market. Now, you might imagine it to be a system with hundreds if not thousands of instances compute nodes connecting and processing these requests and streaming out data. But, and that would be a really complex distributed architecture, but in fact, that's not what we've done. We just have I think a handful it doesn't instances. It's not easy to instances on the AWS cloud that come up and each instance has just one process one binary. We call it the guide because written and go that runs. And that one process with no dependencies just starts up connects to connects to a Nats feed. I mean, it's a pub subsystem and starts receiving market ticks. That one process can handle up to let's say 50,000 web socket connections coming from our end users and stream data to it. So it's literally one process on one easy to instance with zero dependencies, zero distributed architecture handling 50,000 concurrent users and streaming market data to them. So if you want to handle hundreds of thousands you just add one more instance and run one more copy of the process, or you can just have one big server with let's say 16 or 20 course easy to instance, and just run two or three copies of this. That, that even, even to handle massive concurrency and massive low latency text streaming and this is I think twice every twice or thrice every second that we broadcast sticks to every single user. You just need literally one binary that runs on a bear, you know, bear, bear metal, easy to instance Linux instance, you don't need a complicated distributed mesh architecture, etc. It's just one process. And you can apply this to many of our services internally they're just simple. Sorry, they're just independent single processes that come up. Do the job. Yeah, that's great. So, so one part that you can mention right in your postage like most of these applications that you typically build the bottleneck is always the database right and that's. So, I'm sure like you must be using so a lot of data and database and then using them and serving through your API's. So, so what's your approach to that and how do you, how do you, what's your, what are the practices that you follow when you can building applications with databases. You start out with the problem at hand so console is the name of a back office platform that's where you go and fetch reports, etc. And these reports can go back several years. And the database underline console is it's shattered and it has terabytes of data, billions and billions of financial records. And when you pull a report, it might take one second or two seconds or three seconds or even four seconds, but it's a fair trade off. So the leeway that we have is that it's okay for the report on console to take three or four seconds to fetch. So on kite, everything happens in about 50 milliseconds kite is our low latency trading platform right everything there has to be instant when you use a place as an order, it should. The order request should go through instantly, which is that's a 50 milliseconds, it cannot. We definitely wait for n seconds. So, there are two kinds of these applications when we build applications that's two kinds of the two categories low latency and applications where latency is acceptable and tolerable. So in low latency applications we make sure that we reduce the database IO, let's say postgres database IO, as much as possible when you place an order when you click on a button to be a portfolio, etc. We're not really hitting a postgres databases at all because an rdbms or any sort of distributed database, the latency will be unpredictable could be two seconds could be 100 milliseconds. So you just avoid database as much as possible that's where we use that's where it is shines. So red is red is latency its performance is deterministic very predictable, it'll always give you a result in, let's say, I know one millisecond. So on low latency applications in all our hotpots for storing stuff we use redis for long term storage, we might move that parallel to postgres. So that's what we do. Just avoid rdbms database IO anything that is disk based anything that commits or flushes to disk completely on low latency hotpots and you're on mute. I think there is a lot of questions from participants about how do you scale postgres and what kind of scale what's the data sizes that you operate. Can you kind of tell us more about the typical what kind of what slides database that you have that you use. And have you tried any distributed databases. Yes, if not, why not. Like, as I said earlier, we start out small. So we have, I think we recently few months ago we shouted our largest postgres database I think it was several terabytes. I can't really remember it was two or four or five or six terabytes but definitely several So, and there was a single postgres node. Of course it had a backup and all of that but it was a single postgres node with several billion rows, tens of billions of rows I think 100, 100 plus billion rows for some of our financial breakdown data. And that was a single node. Now if a single postgres node with terabytes of data and 100 billion records works just fine. Why even look at a distributed database, right. So when we felt that maybe we were approaching the peak of what a single node could handle. We shouted it. And we already had logical partitions within it. So that's our console, our back office. So it's very organic. We wouldn't really just go pick a distributed database because we're anticipating 100 billion rows, you know, 10 years from now. You push every system to its max and max and as you feel that you may be approaching its threshold, then you try to push it more. That's a shouting or whatever you know making distributed. If you feel that it might not scale. That is, that's when you look at a completely different solution but with battle tested systems like postgres that have been familiar to us over the last 20 years, you can very confidently push this to, I mean push this forever. So, yeah, hundreds of billions of rows, terabytes of data, even on single nodes, simple architecture, yeah. Wouldn't that cause a single point of failure? I think there are some questions again about how do you deal with availability. You have backups and replicas of course. So what would we tip, I mean have you ever faced a situation where the database node went off and you had to kind of switch over to replica. Yeah, that has that happens. That can happen. It's very rare. It's completely unpredictable, but you have to have hot fail overs. So we've had for mysterious reasons, our main redis nodes on let's say kite the reading platform become inaccessible. But in the next couple seconds. Redis has a really simple thing called Redis Sentinel, which can, it's a little agent that runs between Redis nodes and it'll let you, it'll detect failures and promote a replica as master. So those things kick in randomly. And the whole point of a hot failover is that you shouldn't even get to know it should just happen. You shouldn't have to be involved. And then again, like I said, it's a trade off. If you want hot fail, sorry, hot replicas, instant fail overs, there is additional, there may be additional complexity. And for a trading platform that is absolutely necessary. It should just switch over. But let's say the back office platform that people only really use at the end of the day to look up their reports. If it's not available for a minute when something is getting switched over, that is actually a fair trade off. It doesn't affect anyone. It doesn't affect. I'm talking about a specific report, for example, like a passive report, something that is not critical in nature. It's okay for it to fail over over several seconds or a minute. So you don't apply the same yardstick for making for failovers and replicas, you, you do it based on the problem at hand. So do you see like, are you reaching at the shoulder of what you can handle with Postgres yet? I mean, 100 billion rows, it works just fine. But we shouted recently, because we just felt that we've been running this for a long time. It was working just fine. Like, maybe we should just shout and we shouted, but of course there were backups and replicas, but I'm only talking about shouting. So now you may have like two or four nodes. I think two nodes. Two nodes, okay. So you still see kind of long path before you can worry about figuring out that's not enough, right? Absolutely. I mean, this is still premature optimization. That's why battle tested technologies are beautiful, right? The whole world has been using Postgres for two decades. We know how much it can be pushed. So you don't have to, on day one, think that I might get billions of rows. Let me just go get Cassandra or whatever. So these things work. Battle tested technologies work well. Yeah. So, so there's a lot of questions pouring in. So I think I'll take questions as in when it's relevant to the current topic. Okay. So I think the questions are kind of pouring in. Okay. Everyone, please bear with me. I'm going to kind of go over all of them, but as in one, it's relevant to the discussion. Okay. So, so one thing that like a lot of people kind of asking is like you talked about you're going to, you're self hosting all of it, right? So why do you self host? I mean, isn't it better to kind of actually take a managed service? And I think the general wisdom in the startup world is basically you want to move fast. You want to spend your energy on actually building the product and use all the services around rather than actually building everything yourself. But you seem to be kind of doing completely the other way. So can you tell us like what's your idea behind that? There are multiple reasons. I, there are business reasons. There are personal reasons from my perspective and us as a team, the kind of decisions we've taken. I think it's important for if you're if you're running a service and your business depends on it, imagine a stock brokers database, right? I think it's important to have 100% control over every aspect of it. So in a managed scenario, you might really not get all the control that you need. So the kind of R&D that we've done on a postgres instances, the kind of experiments that we've been able to do because we had access to the underlying file system, for example, is immense. We've done, so there are these trade processes, etc. that happen at the end of the day when we get massive data dumps from stock exchanges. Those things, seven years ago, before we had our own tech stack, seven years ago when we had 1000th the size of the data that we have today, those things would take eight hours on external vendor base systems. Over the years, we are indeed pushed the databases to the limits, figured out lots of tricks, all because we have complete 100% control over the systems to optimize them to a point where what would take a decade ago, what would take eight hours to take 10 minutes. So I think it's important for a business to have 100% control over its stack and to completely own it. There's also, personally, I'm very wary of vendor locking. I would, when it comes to my personal files or when it comes to the data that the business holds, any hint of vendor locking is, I'm quite paranoid about it. And at today, it wouldn't matter, the cost wouldn't matter, but at some point the cost also mattered was quite a bit. Zero is bootstrapped, we've never raised funding, really lean company. And one of the reasons why we were able to scale the way we scale, not just technology, everything around it is also because our infra costs have been really low. So, like I said, today, it's not a factor, but five, six years ago when we started pushing these apps out, every single dollar mattered. So, there was the aspect of cost, there's the aspect of vendor locking, there's the aspect of having 100% control of your data within your stack and 100% control of the infra that you manage so that you can push it to the next level. And it's also about visibility, if you want to scale something like a postgres to a single node, with a single node to whatever, you need to have full visibility into its innards. And the managed service, it could be anything, I'm not just talking about postgres, a managed service only gives you high level abstractions running on somebody else's infra. So, it's about that confidence also, I'm personally more confident about these systems and the scale to the extent to which they can scale when I have full visibility and I think all of us in the team, we operate like that. Our stack decisions are all based on these principles. So, that probably means that it's complete no to any SaaS service, right? Largely, pretty much everything inside a tech stack, absolutely everything is, you know, we self host and manage. Yeah, so then it comes on to it. I guess that also comes at a cost, right? You can't really, so everything you have to kind of invest, I'm sure like there's a long term benefit of doing that, okay, but in the short term, are you kind of ready? I mean, it's a constellation that you're ready to pay that price for that. I think it's a misconception that if you self host something, the cost, not in just not just in terms of money, but in terms of expertise required efficiency, energy, etc, could be massive. But that's clearly not the case as you know, we proven in our case. So, I think there is an undue, sorry, there's this irrational fear that irrational fear of the cost that you have to pay to self host and I think to a large extent a lot of SaaS slash cloud companies have perpetrated that. I don't think that's true. So how can you expect to build a solid tech stack that can achieve massive scale at lean costs if you're not ready to understand a core system to your stack, for example, postgres database works, if you're not willing to spend time and energy in postgres or running your own, whatever queue or whatever it is, how is it possible for you to just keep the fundamentals and go to market faster, maybe it's probably one to a couple months. The thing is, ironically, if you do that, and if you end up building everything on managed services, the cost in terms of money that you'll end up being at some point when you actually hit whatever scale will be far greater than the few months you would have saved trying to learn these things. And more importantly, I think it's very important for developers who build these systems to understand everything about it. You should understand how the underlying systems work. All components of it work, how the database works, how latency affects these things. So you can't really be a good developer if you just offload bulk of your things into some, you know, SaaS API. So there are a lot of questions about like how do you manage postgres, what kind of application to use, what kind of file system to use, okay. There are many questions I think, can you kind of talk about like what's the postgres setup like? So, I think, Satya, if you're watching this, we have to write a blog post, I mean, so the console team, it's a, all our teams internally are three to four members max within the team. So the console team does all the R&D and management of the postgres databases, and blog post is long due. So the data, bulk of the data that we store in a postgres databases, they're not really real time. And that's actually a big boon. So as I said, the back office processes in this industry, it's all end of the day. When the market closes, we get massive file dumps from all exchanges, depositories, etc. Trades of rows, you know, trades, positions, all sorts of things. And what we're supposed to do is take all of these files, collate, run them to many different rate processes, figure out accounting ledges, etc. Generate the results and bulk load them into a databases postgres or whatever. Some of the frappe instances that we use also have huge MySQL databases underneath it, by the way, because frappe by default uses MySQL. So we have that too. So, all of our postgres tuning, there is organized around time stamped end of the day bulk loads of data. When I say time stamped, I mean by date, everything that happens in the stock markets is time stamped with a particular day, the trace that happened today, today's accounts, today's end of the day balance, today's positions, etc. So after lots and lots of trial and error experimenting, experimentation, etc. We figured that partitioning, logically partitioning these massive bulk loads that we do at the end of the day, they work better when they partition by months. So we had also, I think, experimented with partitioning by date, but then you end up with too many partitions. So we hit the sweet spot for our back office financial databases with logical partitions that were done by month. So that helps immensely. That's the right rate of between reads and writes. But you can already see that logically partitioning massive databases based on months will not be relevant to most use cases out there, most business use cases out there. So, yeah, there is, there's literally no formula. You just have to figure out the right rate of for your business and you're the kind of data that you're storing. But yeah, logical partitioning with postgres is one of the biggest growth hacks we've done for our financial databases. What's a file system we use. E64 or ZFS or something fancy. Just X for XT for yeah, nothing fancy. Yeah. And what about the There was some of the question about what what about replication, how do you do the replication. For instance, for kite, we have a, we have a master and we have a hot replica. And it's just standard postgres master to slave replica. Sorry replication. And we've used we've experimented with PG Bouncer. In fact, we still use PG Bouncer is a little proxy sort of a system. So, if you have multiple with multiple replicas, if you want to redirect some queries, read queries to replicas to offload the database, you can just use PG Bouncer. So, our setup is again, really simple master slave replica with PG Bouncer in places for splitting loads. So, is it a synchronous or asynchronous replication. Is in replication. As I said, for us, our biggest databases are all end of the day. So we don't really have to worry about real time low latency syncing. So that's the that's a use case. Got it. Okay. So now. So one of the things that I think underlying about like how we build software is also how you build the teams. I think that is also very important part of how, how your text up can evolve in the shape it is right. I mean, so can tell me like how do you kind of, what's your team like team size and how many people and how do you hire. And how do you try to kind of find right kind of people. I think it reflects how we build our systems starting from a small unit is exactly how we ended up building our team. When the technology was born in 2013. It was just me. So there was one tech person and suddenly we had this thing called zero the technology. In the initial days, there were no grand plans or goals, we didn't even have plans to build a training platform. So I would just go sit with everyone in the company, you know, sales folks support folks, risk management folks, etc, sit with them on their terminals, understand their problems and build little Python automations to fix them. So file downloads from the exchanges that would take someone manually an hour to process a Python script could just download it in 20 seconds. So that's how it started completely organically, and I kept on doing that. And after a couple months few months, I figured that I had. There were too many things to solve, or all organic problems that we discovered, and we needed one more person. And then we hired our first engineer after me. Now we had the tech team at two people. So the same process continued after a few months we had more things to fix, we had more little projects to do. And we said, need to add more hands than the third engineer came in. So as the problems space expanded, the team also very slowly expanded and how we hired I mean, most of most of our hires have been from has Geeks job board. So there's that. So just a post, and we spoke to people and the criteria was really simple, hobbies programmer with some cool little hobby projects educational qualifications didn't matter work experience didn't matter. In fact, the 90% of the people in the team have had no prior industry work experience. So the team, the team grew evolved very organically. We got our first designer, I think in late 2016 or early 2017. So on day one we didn't say that we need a designer. So when we felt that it was time to have someone look at design, focus on design full time, we got a designer. Then we had, I think a year and a half later, when we felt that we need two more hands, you know, a pair of extra hands to do design we had another designer, and everybody, everyone who came in, they're all hobbies. So, that's how the team has grown completely organically, never plant on day one that we need a DevOps team, we need a deployment team, we need a database team, we need an engineering team, sorry, you know, product management team never had that. So today, there's 30 of us, we have two designers, everybody else's full stack developer, two people focus on DevOps. And that's a recent, that's the DevOps team was formed, I think, several months ago last year earlier this year, current who's a full stack developer. So one day he just stood up and said I want to be DevOps and suddenly we had a DevOps team, but even even these little departments are organic focuses when we when somebody felt that it was time to spend, it was time to focus on something called DevOps, you know, figure out deployments, you know, make a system around it. We suddenly had a DevOps team. So the whole team has evolved very organically, and only criteria has been, you know, full stack hobbies developers. And what's the team size. Yeah, sorry, it's 30 of us. That includes two designers, two people who focus on DevOps, but apart from the two designers, everybody else is a full stack developer. Oh, there's actually, there's one person Shantanu, who's not a developer, joined us last year from another department and he helps us deal with all the communication that happens with the rest of the company. Zerodha is 1100 people strong. So Zerodha is a large company, but the tech team is 30. So we need a pair of full time hands just to deal with the communication with other departments. So I think a lot of questions pouring in, I don't think I'll do justice to all of them. So let me try to kind of ask some of these questions. Okay. So one is like, how do you manage your logs and monitoring at scale. So Karen, in fact, has written a really good blog post on our entire monitoring setup. So we use Prometheus, we use Grafana, and we have different kinds of agents, exporters embedded into all sorts of things. They extract real time metrics from Postgres, Redis, our processes, HTTP requests, AWS metrics, and everything is aggregated on a Grafana dashboard. We have lots and lots of dashboards on the Grafana instances. It's all backed by a Victoria metrics DB. It's a time series database. I think it has billions of records again, works single, even a single note can scale really well. Lots of alerts and alarms set on a Prometheus plus Victoria metrics plus Grafana setup. So that works really well. We get good insights into all corners of the system and not just system metrics, but app level, semantic app level metrics. We use Metabase for aggregating more higher level business metrics and numbers also. Sorry, there was something else that you asked. I missed that. Logs. Oh yeah, yeah. Right. So we have Kibana. We are required by law to store certain kinds of logs for up to three years. So we have different indexes with different retention policies for that. And we have Kibana. We use Kibana to aggregate logs from many different sources. And what's kind of scale like? I'm sure it must be kind of quite a typical. Yeah, the underlying storage is several terabytes in size and the elk stack is reasonably slow for really large queries, but that's a fair trade off for passive logging. Got it. So let me see if I have more questions. Yeah, so other question that a lot of last is like, what are any failures that you can handle? Okay. And what are the lessons learned? There were, I mean, there are countless tiny failures that we've handled. And thankfully with our real time trading platforms, we haven't really had internal failures. We've had significant issues in 2018 and 90. In fact, today we had an issue where our ticks stopped for 60 seconds. That's because a B2P lease line abruptly died and it's supposed to automatically fail over and it hasn't the past, but you know how technology is, it didn't. And then it came back up. So it was 60 seconds worth of drop in market feeds because one of the lines just died when black. So there are lots of these little lessons that happen just like in any complex technology stack, but one of the biggest lessons that we've learned being in this industry is building scale on top of infrastructure, institutional infrastructure that is not built for scale. So the exchanges, the repositories, all these institutions on top of which we built our scale. They were never really built for 2020 levels of scale, you know, millions of trades a day, etc. They were never. So it's taken us, it's been very painful, the journey to building the scale that we have in our stack on top of these limited things. For example, we have to pull an exchange lease line that you pull up until two years ago. The max capacity of a lease line, I think was to 400 messages per second, which is, let's say 400 orders that can push the exchange sector per second and each lease line takes months to commission, you know, this bureaucracy, there's all of that. So you apply for a line to process 400 more orders per second and it takes three or four n number of months for it to commission for it to be commissioned. So how do you scale on top of that? So we 2018-19, we had tons of those institutional issues, but of course, it's still, I mean, it's zero as an entity's issue, but unfortunately, most of these things were completely out of a stack. So that was a really painful experience when something goes wrong and you know where it's gone wrong, but there's nothing you can do about it because there are inherent institutional limitations. How do you build on? How do you patch that? Or how do you build hacks and scale on top of that? Those lessons have been extremely valuable and painful. So the exchanges recently upped the limit to 1000 messages per second, but if you think about it, that's really nothing. One physical lease line that takes months to commission can do 1000 things per second. Imagine two months from now, suddenly our client base doubles and we are supposed to process 2x the orders, but the exchanges don't even have the capacity to accept so many orders. How do you deal with that? So you always have to owe provision. So yeah, these are the biggest pinpoints that we've had and the biggest failures have all originated from these things outside of a control. So you just have to live with that, building stuff on somebody else's stack. I think my question was more on actually, what are the neat things that you've done as an architecture wise didn't work out as you expected and you had to kind of roll back or do something else with that. I think you could have interrupted me. I went on a rant for like three minutes and I wasn't. That's quite valuable, but I was kind of also want to look at the other side of it. Yeah, so the first version of kite when we launched it in 2000 late 2015 early 2016 was written in Python cherry pie server, a CDP server and we used angler.js as a front end but you can scale Python but because of its design because of the heavy runtime because of which, it's not really meant for super low latency high throughput services. So one single go service today that we have can handle it's 100,000 HTTP request per second and no dependency nothing you just run that program that you just have done that binary, and it works, but you can't do that in Python, you have to have workers, each worker consumes resources to handle that whole by then VM and we had we tried you was gay you we tried junior count. So, but thankfully we we only had 1000s the scale we have today. So we figured that Python or pretty much any interpreted language that has a VM, a reasonably heavy VM was not ideal to ideal for building a training platform. Or if you had to do that you would have had to expend a lot of resources on hardware to make it scale. So that was an extremely valuable lesson and switching switching to go was has in fact been very pivotal to us. I was experimenting with go in personal capacity in around 2014, and I don't think go was very big 2000 early 2014. So, we started putting small little components that we were we've written Python to go and eventually scaled it from that so that has been an extremely valuable lesson. But other other technology choices like Redis and postgres or even my skill in some places. They've been, they've never really failed us. And also, the fact that use Python and frappe for our non latency critical heavy data stuff in the back and that's also that also works well. So, yeah, I think that's it that's the switching from Python to go for a high concurrency really low resources and throughput was the single biggest technology switch that we did and everything else has been quite all right. So, so what's your plans for your tech architecture like how is it going to evolve so there are some questions about that. Like I said, we don't really have grand plans at zero the as a business. So we saw small little problems at a time and you saw one problem and that is sorted, you might see something else worth solving and talking about user experience business so the technology and architecture kind of follows that and another important thing that we do is we're very wary of technical debt so we've written some of our systems multiple times. In fact, the entire console back in has been rewritten from Python to go and we were just doing a final review yesterday. So we found it to be like 10x faster, easier to deploy and use us in a fraction of the resources. So we internally, when we're not building new features. We're always cleaning up systems cleaning up architecture rewriting conference. The process, the program that streams market data. It's been rewritten six times. And it has gone from using a custom TCP protocol for communication to the man was protocol to zero MQ to nano message all of these things to finally using that and from using channels to mute access. I mean just to give you an example. So continuously cleaning things up, making keeping systems lean and reducing technical debt is how we evolve the stack we don't really have a goal, for example to change everything into something else entirely in the next 12 months we don't work like that. We always look at little pieces and solve them meaningfully. It could be reducing latency in a process from 20 MS to 10 MS, or it could just be rewriting a whole different system for the sixth time because it can just get better. And into this process you bring in the whole process of building new features and revolving the product as the business slash market needs change. Yeah, so the other side is like you're also kind of very strong footprint of free software, you're one of the people started post united with Richard. So, so what's your ideas of free software and then that in actually how does that relate to building an organization. It's unfortunate that we have to even have this discussion in 2020, if you think about it. So it's as simple as I wouldn't be sitting here today talking about the stuff it wasn't for force. I wouldn't have become a programmer I wouldn't have learned the things I have learned I wouldn't have done the things I've done. Zero that wouldn't exist without free software I don't think the world we know today, as we know it wouldn't exist without free software so it's a no brainer. I mean, even the question, would you use proprietary versus free, etc. I think they are in 2020. They're meaningless. We, like I said, zero that wouldn't exist if it wasn't for free software be it go read his post grace my skill or frappe. Absolutely everything that we use is false. The scale we've achieved is thanks to force the technological efficiency thanks to force the regulatory agility. Thanks to force. Being being able to build all of this with zero external funding, keeping the team lean keeping the stack lean and efficient. Thanks to force. So, that is, that's my philosophy I, I, as a programmer I started learning programming by looking at copy based in code from the internet. That other people are posted I had no concept of free software then, but it just seemed natural to me I copied somebody's code modified it played around with it learned and I posted it back on the internet. Eventually figured out that it's called force and their licenses but it just seemed natural if it wasn't for that in the early days this was some 20 years ago. I definitely would not have become a developer or, or even built anything meaningful. So, unquestionably, yeah. So, it's expanding on that so let's say you have to give an advice to an engineer or someone is still in the college right I mean, what would be your advice and what to kind of focus on. It's the one thing that I can say from my personal experience as a developer is that you have to keep developing. You have to keep writing code you have to keep building tools, whether you're doing it for a company where it's your job or you're doing it for hobby. Unless you experiment constantly unless you write code unless you build little tools or even large tools that are useful for people that solve problems. It's very difficult to mature as a developer. You can read all the books you want you can read, you know, listen to all the podcasts you want but if you don't have experience in writing software from scratch, connecting to a database from scratch understanding how SQL works or no SQL works from scratch and by building software on top of it to solve people's problems so that you get a perspective on usability and how human beings interact with software. It's very difficult to mature as a developer so if you're a developer, you should develop software that solves little problems so that it helps others and also you can grow as a developer. So that's my that that would be my one advice you have to constantly write software and experiment. So thanks. There are a lot of questions that maybe I think we should try to blog post on for example how do you scale that is and how do you use that is how do you post this etc. So I think a lot of things people are waiting to hear more about more detailed stuff I think those things you can plan for blog posts, many of those stuff. There are some also questions about like hiring it I mean how do you hire I mean what's your hiring process. Just kind of answer. There are many questions I thought good to kind of bring it up. Cool. On Redis, Redis is pure magic. I think it's I absolutely am a huge Redis fan. It is absolutely just pure magic. You install Redis you run it to you maybe disable writing to disk flushing and do with replicas to remove that latency and Redis will practically scale forever as long as it has enough CPU and RAM. And it's just absolutely it's beautiful to work with it's a beautiful piece of technology and it is so usable. Unlike many other different even Postgres is not even Postgres it's a little hard to install setup you know create a user run stuff Redis is just Redis nails all of these aspects so scaling Redis is easy you install you run that's it it just scales there's no knobs to tune. Hiring as I said again we don't really look at we've never looked at industry experience and we don't really have a hiring quota. So this year I don't know I think maybe we've added one person or I don't even know maybe that was last year. So we only hire when there is we feel as a team that we need more hands to solve the problem or we've reached a limit as a team that's when we hire someone. And that someone it doesn't their educational qualifications don't matter their work experience doesn't matter but their aptitude as a developer matters. And as I said, if you if you do hobby projects if you're passionate about doing hobby projects I think that's a good indicator of how you'd be as a developer because you're doing these things because you like it not really for a particular job or not because somebody asked you to do. So the hiring process is simple you look at somebody's portfolio which is what they've done you know the hobby projects or or whatever they work and you speak to them. And that's about it there's actually one really simple task that we send out to do when we're hiring developers for you know beginner to intermediate positions it's literally a Python. Which scrapes a particular page from energy is website and runs a cherry by server with a search box it's really simple. It is not intended to judge somebody's technical proficiency, because that can be learned but it's, it's meant to judge people's attitude if the task is so simple, you can tell a lot by how someone approaches it. Some people just don't care at all they just you know there's no comments for example that all of that shows attitude. So when we hire the single biggest factor we look at is, is the person a hobbyist hacker, do they like developing. There's one question I think I can answer that. The question from Sagar Manjali is, do you still code Elash? I think he does and it was two minutes of waiting and he was banging on the keyboard so you know what he's doing. Oh yeah I mean if you don't code then you're no longer a developer and honestly the only thing I like being really is a developer so yeah of course I write code every single day. All of us in the team, all of us are full stack developers we write code every single day, 6 hours a day, 8 hours a day, 14 hours a day, yeah. I think we're pretty much, I think we're just crossed our time 730. I think one last question and then see if there are any more questions. I think I've went through most of the questions that are there in the chat. So when you look at the most of the startups right now and they're starting to scale. A lot of times like the most of the funded startups I think they have team sizes which are goes into hundreds of people and so I mean, do you see any problem in the way that industry kind of approaching scaling or building teams or is it something that Zero has been have a different kind of space. So what is how do you kind of see that. I think it's a serious problem. I think massive tech teams are a huge problem and it's the business slows down the communication slows down the efficiency goes away. You have lots of departments that now need to coordinate you probably have five people trying to solve the same problem where one good developer could just solve it. And most importantly, I think developers become jaded. So imagine you're you're a mutual friend platform and you have 300 developers. What are 300 developers doing right everyone will probably get a tiny little thing to work on and they can feel no ownership of it. They get a requirement. They write a few lines of code they fix a bug or whatever and they push it. There's no sense of ownership. There's no sense of growth as a developer. But in a small team, every single person in the team, including me, all of us we own many different stacks that we operate on. All of us are principal developers on one of the hundreds of projects that we maintain internally. So it's just, it's, it's good for developer morale. And I think, again, with our experience, if you use us as a case study, I think smaller teams are far more efficient and scalable. So even by taking things cautiously and slowly, we are still far ahead of, let's say our competition, we push out rainbow things than anybody else with really tiny teams. It shows that people are generally happier. And, and the work is more meaningful because you get to own things, the communication between people. We are like a social club that our team, we're like a social club we have fun we work together. There are no departments within the team. I mean, it's, it's very casual environment so I think that this tendency of growing really large teams I really don't understand. And I think even the fact that certain companies startups when they start out on day one they say, we need to develop steam with four people. We need a whatever team with front end, we need six front end developers, we need eight back end developers, you can't quantify developers into units you can't say that this project requires two human developers to build. You might require 10, you might require 20, you might just require one. But on day one setting these goals and hiring big teams, I've never understood how that works. I think it's pretty meaningless to quantify developers into units, human units where you say that we're going to hire four front end developers, they'll build a website. That's not how it works. So I think we're open for more questions. I think I'm kind of it's done with the stuff that I can afford to ask. Justices can continue for 10 15 minutes. So people can continue putting your questions I think I'll pass on them to Kailash. So if no one has anything I think I have some of them, let me kind of go with my list. Oh, I think I just got to pop up I think Karen is sending some nonsense on our tech group. I think he's watching it. Yeah, so one of the questions is from Shubham is what can you do to be confident that they'll be able to make a good technical decisions. Always and never. I know it's a weird answer, but it has to come with experience, and that could be two years, that could be 20 years, and that entirely depends on the problem. There's no such thing as a technical problem that needs a solid decision. There are an infinite number of technical problems in varying complexity. A technical problem at a stockbroker might be completely different from a technical problem at another company. So unless you are in it unless you have your hands dirty, and unless you have been doing it for a while building experience you know writing systems building and breaking systems not in the you know fail fast sense. It's impossible. So there's no, there's no metric to that. I personally lots of decisions are really easy for me from personal experience, lots of decisions scare me. So I think that will always be the case for any developer. It all depends on the environment. One of the questions from the shake is like, you're always hiring, when you're hiring people you're not looking at their technical skills, but only at attitude. So now, how does the mentoring work at zero that you take them is there a period where they actually work with someone and then kind of someone mentors them, how does that work. It's not true that it's not, it's not true that we don't look at technical skills of course we look I mean, we would not hire non developers right that's why we hire hobbyists and hobbyists have worked to show that program little things so of course there's a clear indication of their technical prowess when we when we hire them, but for beginners mentoring is simple because we are a flat team. And until you know until code happened all of us would just sit together I mean on one floor, they were, you know, 20 30 of us 30 of us now. We're a flat team so everybody would just sit with everybody else and and work with each other and beginners always when they come in they always get a real task to do something really non critical. It could just be a simple automation script, something, you know, and once they did that over what gave it to them it could be the console team could be the kite team, they would sit with them and tell them that this is right this is wrong this is how it should be done. So you do one real one tiny real project, then you do it slightly bigger real project and it's likely bigger real project keyword being these are all real little projects. And gradually, you graduate them into a more serious project, and after that they just become embedded in one of those projects. So mentoring is very hands on you give people real projects real tasks really small non critical ones you sit with them. I sat with I think everyone and written code with them I've debugged people's problems and all of us in our team, we go sit with each other and debug each other's program so mentoring is hands on one on one and everybody in the team helps and mentors everybody else, especially beginners. So what are any hiring hires that can work out for you. Yeah, there have been. So I think we over the last seven years seven, seven and a half years. I think we've let maybe five people go five or six on recollect but yeah. But yeah, small number here. So there are a bunch of more questions. How does authentication other Asian is handled across across the platform. Yeah, we use multiple things. There's the Kong API gateway that we use as a central gateway for internal non critical non low latency internal API services where things communicate with each other. And for certain things, we have I am roles that's the Amazon identity concept where certain big systems have to interact with another big systems. For all our external applications. We they're all first class citizens built on top of our kind connect API so coin console all of these systems for end user authentication. We use the kind connect or system and that makes it really simple because it's the same or system that all the external platforms and systems also use. So, if we all these apps are first class citizen so all external auth is kind connect between apps and internal lot is a mix of many things including, like I said the Kong API gateway, we don't use a host API gateway. So Ganesh has a question. How does how your internal small and large services communicate with each other. Sorry, I didn't, I didn't catch that. How does the internal services talk to each other. So internal services talk to each other via HTTP APIs. And if it's, if it's back and forth communication for a lot of our systems it's just one way communication so we have like a Kafka backbone. We use it like a company wide bus into which systems push data and lots of little consumers from different departments completely and different segments consume for instance let's say there's a fund update. So systems that don't have access to each other they don't even know each other let's say back office system and the trading platform system that have that have no knowledge of each other might need the same data so all of them just connect the to this central Kafka bus and consume data. So this is only for one way communication for pretty much everything else we use. So one of the questions is since major parts of zero that written in frappe and you still say that you're agile while depending on a vendor. No, we don't have vendor right we frappe the framework we use just like we use Django no vendor we just use it internally we built all our modules ourselves we hosted ourselves. So other question is actually with industry not really what scaling. How do you overcome the fear of financial penalties if mistake happens in the FinTech industry. I think fear of financial penalties is the least of our fears to be honest penalties you can pay and you can get up but imagine having a catastrophic event like a bunch of lease line failures or a data center failure. There are millions of people who are unable to make critical crucial financial transactions so that fear and the risk the inherent risk in stockbroking which is infinite risk for a stockbroker because we do leveraged trades complex derivative instruments. These two fears trump all our fears to be honest if we if there were no other issues but paying penalties that would have been much simpler for the whole industry, but every broker has a hundred other really crazy fears that they have to sweep it. It's a part of life really. So they're interesting so so other question about again teams is how does product manager and tech teams work together. We don't have product managers so yeah we don't have any product managers as I said everybody's whole stack developer and there are two designers and the form small units and design the two designers that we have. They've they've worked with different applications and teams and I think at this point. Some of us we've emerged to be product developer plus product managers. So Vivek and I for instance be product managed guide we also develop guide and the console team. The team themselves are developers and product managers and all of us work with designers so we don't have dedicated product managers developers themselves work like product managers and when it comes to certain financial things that we may not really have insight into. We get help from, let's say the people who sit with Nitin Nitin has a team called the Z team, and Nitin is the CEO of Zeroda. So the Z team, they have expertise and knowledge in a lot of things pertaining to the markets, markets behavior, traders psychology, etc. So ideas come from them we sit with them. So yeah, it's, it's very pragmatic product management we don't have dedicated product managers. I think there's still some more questions something I will have to kind of leave them because we're already past 15 minutes. I think, yeah, so thanks, Kailash. I think it's really quite engaging discussion and it's something really fascinating kind of know something that's not usual we have a building an organization or building text tags. Usually we hear about people talking about scale in a sense that they throw more resources that are kind of approaching first principles. It's quite interesting kind of see something that I myself kind of try to kind of believe in and kind of see a success story and and come on show and talk about it. Thanks a lot. And I hope everyone really enjoyed the series. Thanks everyone and we'll be back with one more in the next month. Thanks. Thanks Anna. Thank you. Thank you. Bye. Bye everyone.