 Good afternoon everyone. Is the font and whatever is written visible at the back? All right, OK. So, hi again. My name is Mahesh Lal. I work at Gojek, which is an Indonesian startup. And we have our tech office, basically our R&D center in India, Bangalore. I've been working at Gojek since about April last year. Have been a part of their journey since slightly before that. Rather, I was aware of what they were doing and how they were growing from before that. I have been lucky to be a part of the team that does or develops the payment software at Gojek. We call it GoPay. And the associated services around that, around payments. And this is basically a recollection of the journey that we've had to come to this point. We, whatever tech decisions we took, whatever work for us, whatever product decisions we took, whatever team structures that we have built over time, more on the tech, less on the team. Please understand that what we are trying to do here is share our knowledge and try to learn from the community. It is not gospel. I can't guarantee that whatever we do or whatever we say is going to work for you. All I'm saying that this has worked for us. So to begin with, we are basically a logistics company. Or rather, Gojek started as a logistics company. The idea was to allow people to book OJECs, which is basically a two-wheeler cab from the convenience of their homes. The company itself started in 2010. That time, someone used to call a call center. And a person used to call up the OJEC driver and say that, hey, why don't you go and pick this person up from this particular place? In 2015, January, we launched the app. And within a year, we had 900x growth. After that, what we realized was considering two-wheeler taxis are legal in Indonesia. Why not actually go ahead and be able to deliver anything that can be delivered on a two-wheeler? So it opens up a lot of avenues for you. So you can send parcels. You can actually order, deliver food. And food was the next orthogonal thing for us. So to give you a certain idea, we actually deliver more food orders in a day in Indonesia than all the food aggregator companies in India put together. Till date, we have delivered about 150 million food orders. And we have done more than 100 billion rides. Given all of this and given the spending patterns in Indonesia, we actually thought that payments is a natural extension for us. If you are, let's say, ordering food or if you are taking a ride to work, it's easier to actually just take the package from the person or just get off the OJEC and get to your workplace rather than trying to pay from your pocket and then having to deal over cash. That is where the whole idea of cashless came in. And in Indonesia, cashless is a pretty unique thing because for a lot of people, this is their first cashless experience. Indonesia is far less banked than we are. They are actually more averse to using cards than Indians are. But for some reason, we are able to actually handle this. And Indonesia, we are able to do well on this. So the version one is basically when we built a monolith that sat inside the original Gojek monolith, and we were able to handle most of the things that we wanted to do for the day-to-day products and offerings that Gojek had. But given the fact that we would not want to restrict ourselves to whatever services we have, and just a tidbit, like Indonesia has twice the per capita GDP of India. So that is a huge, huge market. If you restrict yourself to just logistics, you are actually going to lose out on a major chunk of the market. And that is why we are moving to the version 2.0. Or rather, we already have version 2.0 in production being used by all customers and all drivers that Gojek has. So the considerations that we needed to have while building 2.0 were, you need a system that can handle a really high throughput. You should be, there should be predictability built into the system. That is, if let's say I do a transaction and the system accepts a transaction, we need to take the transaction from start to end and not reject it. We need to give SLAs around how much time we are going to respond in. Because we are actually trying to move away from the monolith, we actually tried to build microservices. Ease of debugging was a major concern that we had and how we do it. Also, considering it's a bunch of services working in tandem, you don't want one part of the system to go down and pull the rest of the part down. That's why we needed isolation. And obviously, you need repeatability in whatever you do. Let's say you do a transaction once. You try the same transaction again. Money shouldn't be deducted twice, so on and so forth. So our friend decisions that we took were using GRPC and we use Protobuf for our communication. So Protobuf has extremely low overhead of serialization and deserialization. It also allows you to form a contract between the client and the server. And the client and server both need to have the same copy of the Protobuf. So what ends up happening is you are defining these are my values. These are the data types. And any change in that makes you or rather you have to do incremental changes. You can't actually do breaking changes. Otherwise, people consuming this won't be able to use it. The original system was already built on Java. So we decided to stick to JVM because we had tech capability within the teams. For systems that actually you need to say that, OK, these need to be statically typed, we decided to use Java, where that was not as important. We decided to use JRuby. We have started to use Clojure also at places where you need a low footprint, high throughput kind of a scenario we are using Go. Other bits was we tried to keep it as simple as possible. There is no ORM layer as such. Whatever is there is basically JDBI and some hand written code on top of that. No magic. The DI itself is hand coded. Whenever the server starts, it actually creates everything and just bundles up a server and gives it to run. We are actually not doing a lot of normalization. We are repeating data across our services. While few of us might frown at it, it actually allows us to stay away from joins, which in our context will actually slow down the system. And we've seen that happening. Whatever transactions we do, we try not having DB level locks. Instead, we try to actually lock at an application level. And that is something we are doing because we don't want to end up with deadlocks, which to actually debug will be a nightmare. Eventually, for the debugability, we decided that our IDs should be smart. So when we look at an ID in our system, we know exactly what can be done using that particular ID, what all capabilities that has, which type of a user it belongs to, so on, and multiple other things. So this is one of the things that I had said stupidly when I was at one of our partner places, like what can go wrong in 60 seconds. But because the system is spread across multiple machines, things will go wrong. And you have to be aware of that. I'll give you a few examples or a few stories that we've had on the floor and what we had to do to actually mitigate that. Early days, we had GRPC as an endpoint. So we had actually separated out the Gojek systems from the GoPay systems. And Gojek systems are actually talking to GoPay systems using GRPC. We were actually using HAProxy and TCP level load balancing. But what ended up happening was we realized that HAProxy does not do that good a job of load balancing on TCP, at least when it comes to GRPC. So anything that went down on GRPC side, let's say you even take off a service from the HAProxy, it kept on receiving the request. So it basically meant that if something is burning on the GRPC side, it's going to take down everything. And apart from that, there were false positives, false negatives, whatever you want to call it. We had 30 minute time to live for every connection on GRPC. And because of that, we thought that our requests are taking 30 minutes, while it was really not. GRPC is fairly new. And what we realized was it's probably important to spend a lot of time on the various mailing channels. And from that, what we ended up understanding was we were doing GRPC wrong. We were creating a channel for every request that we had. Rather than that, you should create a pool of channels. And you reuse them over the period of your lifetime of your server. Also, apart from that, we learned that you should be doing client-side load balancing for which we have actually built the pool. And apart from that, we are actually using console for ensuring that we do rolling deploys rather than HAProxy. Funny bit, a couple of days ago, once we were done with the console integration, we were able to deploy a service with full uptime at pcar. That is something we hadn't been able to do with HAProxy. Other bits that we have figured out is you can't let anything or any error just default to the GRPC error handling mechanism. If you do that, there are a lot of chances that your service will pull the peripheral services down. And you'll need to avoid that. The other bit we have also done is added circuit breakers for the integration points with our services. So the Excel services, wherever Gojek services call GoPay services, we have added historic circuit breakers so that if anything is going wrong on GoPay, the Gojek services aren't affected. This is a particular date that I'll always remember. The reason being, everything was fine till the first half. Suddenly, at 3 PM, which is about 4 30 Jakarta time, we started getting a lot of major duty calls. Turns out our DB disk had run out of space, which means that we were actually generating a ridiculous amount of data. We hadn't thought that it was the case. Our mistake that we actually did not put any sort of monitoring on how much disk space is being consumed. So in this case, what will you possibly think? You'll think that, OK, we'll run into heavy losses. You'll probably have to do a rollback of whatever services you have rolled out, right? But we actually lost absolutely no money that day. The reason was, Ajay is our CTO. We keep on discussing a lot of tech stuff with him. Back in October, we were trying to figure out, OK, how are we going to handle the avalanche of requests that are going to come in? And Ajay had said that it's not possible to do using sync. So we decided to figure out that, OK, what part of the system can be synchronous? What part can be asynchronous? For that, you actually need to figure out the app behavior. You need to figure out how the people can use the app. And then we came to a conclusion that, OK, whatever reservations need to happen can go on a separate service. That can be synchronous. You keep your monitoring on that. The transactions can keep on coming in in a Kafka queue. And if the transaction workers go down, nothing really happens. Nothing goes wrong. All you'll need to do is basically switch on the Kafka workers, and that's what we exactly did. We swapped out the hard disk. We put in a new hard disk with a much, much larger capacity. Put in enough monitoring, switched it on. And for the next five minutes, I think we saw the highest load in history of Gopay. Because all the transactions that had backed up had suddenly started executing. And even then, nothing went down. So considering we now have queues, obviously Kafka is, even though it's a distributed system and it's available and all of that, there are chances that it might go down. So what do you do then? For that, we have actually built a Redis cluster that backs Kafka, so that let's say you're trying to post a message on Kafka and something goes wrong, you post that message on to Redis. And from there, some other worker actually picks it up and puts it back on the Kafka queue whenever it can. That actually happened today morning as well, because at 6 AM, I was getting major duty calls about, hey, you know what, there's a message that's lying on the Redis. But by the time you actually look at it, it's already taken care of. But this then opens another problem. Let's say a message is put on the queue. And sorry, it's put on to Redis. And if, for some reason, the worker that's picking up from Redis and putting it on to the queue does it twice. Or if there is some bug there, what do you do? That is where the repeatability becomes important. That is where we are building added to put into the system. What we have done is used keys that the external system that interfaces with us provides so that we don't actually do the same transaction twice. This is not necessary for something that just reads the database. For example, let's say you're doing a balance inquiry. That doesn't need to be ad-important. What needs to be ad-important is, let's say you're actually deducting money from someone's wallet. You should not be deducting twice. That is where we are actually using request IDs. Now, earlier, as I mentioned, Gojek and GoPay are actually based on two different data centers. We have to do that because of legal constraints. We can't actually mix up the payments and the logistics but so the data center lag is 50 milliseconds because of which and all the only current consumer is the phone app that is there. What ends up happening is at times if there is a lot of traffic, your total time between when the request comes in and when we send a response goes to about greater than 100 This especially happens when in the morning people wake up and want to go to their work or in the afternoon when they're trying to order food or in the evening when they want to go back home. So basically you have a avalanche of balance inquiry request coming in and that is where we were observing that we were taking most of time. To add to that, we had actually set a conservative limit on history saying that, if there is 120 MS of lag, you should break the circuit and you should not probably slow the Gojek systems down. Because of this what ended up happening was obviously a lot of timeout errors were received. So we tried to think of a solution that we could and we put in a cache but as everyone knows like cache invalidation is one of the hardest issues that we have and it is here again that we had to think a bit. We decided to invalidate the cache every time you made any sort of a transaction. So let's say you receive money, you invalidate the cache. You reserve money, you invalidate the cache. You pay for something, you invalidate the cache. So all this considering are most of our transaction structure is completely acing. We actually decided to put this cache invalidation commands onto a queue from where a streaming Kafka actually picks it up and keeps on firing to the rest base proxy that we have on the Gojek end which actually maintains the cache there. So every time a request comes in, it only checks in the cache. If it is not in the cache, it tries to get it from the backend services. It's not, we still do get a lot of traffic at the backend but it's not the same as getting, it's not the same load that we get when suddenly everyone wakes up in the morning and tries to figure out what their balances are. Other lessons that we've learned is one of the evenings I think one of our databases actually went down for some reason. I think someone was trying to take a snapshot and the database restarted. That was hardly for a second but for the next 15 minutes we actually, it was a nightmarish experience because that particular service actually held most of the wallets and that was actually trying to pull everything else down. What we realized that day is we need to think of what happens when your external sources or when I say external sources, I mean sources that are external to your code. It could be a DB, it could be another service so on so forth. What happens when they are down? How do you handle your graceful degradation? And that is when we decided that we already used to do TDD. Now we'll also write some chaos test. The idea of a chaos test is very simple. You pass in some parameters that are invalid for a connection to a service or a DB or any external resource that you connect to. And you ensure that your system throws a proper error rather than starting to flap, sorry. A few things on the team structure that have worked decently well for us are is the fact that I would call a Swiss knife because we don't just dabble in delivery. It's not that because I'm a dev I'm going to just be worried about how to deliver this product and get done with it. I need to understand what the product roadmap is. Suggest how we can do things better or suggest alternative means so that we can actually deliver faster. In terms of XP practices, we do pairing more or less consistently unless and until there is some burning issue which needs to be sorted and pairing is not possible. TDD is non-negotiable. We have tests for almost everything. Discipline in terms of times you come to work or ensuring that you get more face time with the team that you're working with. That's very important. Time off as such is something that we encourage taking and let's say we observe that, okay, someone is working far too hard. So we try to ensure that that person gets some load taken off them. And last but not the least type is respect. What needs to be understood is every person in the team is actually working to the best of their capability and irrespective of what the outcome is, it's not that they have not given 100%. One of the things that I've found that really works is changing the terminology. Often we talk about, okay, we want to allocate resources for a particular thing to be done. They're not resources, they are people. If you start with that, it's a long, you've gone a long way already. Actually, that's all I really have. I'm happy to take questions, but at the same time, I'll caveat this. My responses are really limited. You must ask the right question. I really can't share numbers, I can't share details about what the data center set up is and all of that, partly because of legal issues. Thank you Mahesh. So can we have some questions? Yeah, gentleman in the red shirt. Hey Mahesh, this is Benki. Hey, from version one to version two, what was your migration strategy? Was it like a Greenfield version two approach with a big bang cutover or did you somehow transition? That's actually interesting, because I think we took a long time to migrate and it wasn't a clean cutover for sure. In fact, even now while we are live, the older system is still running as a backup. And if we want to actually switch back, we can actually switch back to it. Initially what we did was try to ensure that every new wallet that gets created in the system gets created in our system as well. So whatever was getting on the Gojek side was being created here. For the other wallets that we had, we tried a big bang approach. It really did not work well for us. We had to take a four hour downtime and that's really not doable, right? So we came up with a thought that, okay, we just migrate the wallets and then we start migrating the balances cut by cut. So one of the things that we did was start with migrating balances for the wallets that have interacted in the past hour and keep on doing this. So that way you have a decent overlap of the people who have actually used the wallets. And then also, the other bit that we followed was we did not release to 100% immediately. We released in a staggered manner and the release basically, sorry, once we had started off, once we had deployed GoPay in production, we were already getting all the transactions. So the balances were more or less matching. Wherever they didn't match, we had to do some fixes. But otherwise, what we really had to do was just switch from where the balance gets called. So that way we were able to release in a staggered manner and I think over like four, five batches, we were able to do a full rollout somewhere towards end of December. The migration itself, I think, took about a month and a half and I think that is where most of the problems lie because the data might not match between the two systems. The rules might be slightly different because you would have wanted to correct all the mistakes that you made earlier and you might have made newer mistakes. Any question here? Yeah, so currently version one and version two both are, so version one and version two are live, right? How we are maintaining the backward compatibility because when version two goes down, version one was in the backup. So was that been also tested and the new features which have been developed are being both implemented in both version one and version two? So hello, yeah. So as such, what version two offers you is a full blown, it offers you a lot more auditability and it offers you isolation of the systems, right? In terms of features, it's basically you're deducting money from a wallet and putting it to another. That is what version one was already doing. So in terms of maintaining the, what would you call it? Sorry. In terms of maintaining the same sort of balances and all, we are actually transacting on both the ends. So a transaction that goes to 2.0 also goes to 1.0. If you have to actually switch over, you can still switch over and the switch over is maintained by a flag. If you change the flag, the switch will happen. So it's just a matter of where your balance has come from. But yeah, we would like to kill that off soon enough because we want to do other things like, okay, how do you show the history, right? You need to show the history of the user. You need to be able to ensure that the user can track exactly where the money went. So while that is all available in one place, sorry, in both the places, it is kind of becoming difficult to maintain. So yeah, you're right. Like as long as we keep on doing transactions at both points, it is fine, but it's not a sustainable thing. We need to move away from that. And this is just for the temporary bit. As soon as we are sure that, okay, everything is perfectly okay, we are going to switch. There's a question here. Sorry. Yeah, just in front, that purple. Hi. It's okay, it's okay, it's okay. Hi, this is Akshay from PayPal. Sorry, you'll have to be a bit louder. Loud, okay. Yeah, thanks. So I just had a question. In the beginning, you had a slide in which some five important points were listed down. All of those seem to have got covered. The last one, the smart ID thing is something that I didn't get exactly. Seemed like a very obvious thing if you are trying to say that the IDs are unique and you're trying to maintain that, but what is a smartness part? Right, I'm sorry if I actually did not elaborate enough on that. So as a system, right, you'll be recognizing different type of users or you might want to build different rules around different type of users. For example, let's take, if a merchant wants to interact on your system, their balances would be different, the way they interact would be different. So what we have done is build sequences into our IDs and we have slots of those sequences. So to give a very rudimentary example, let's say like 100 to 120 is a merchant. Anyone else is some other category. And we have actually have, we have those small slotted windows that are there. So the wallet ID is last three digits. Tell you what type of a user it is. Part of the wallet ID actually tells which merchant registered this particular user. A part of the wallet ID also tells you like what was the order or rather what was, in a particular day, let's say, if a thousand users are registering, what number was this particular user at? And those numbers are unique across those different slots. So by looking at the information and also we have the timestamp of the creation of the user encoded into the ID. So that way you have the ID itself is very information dense. We try to follow this across most of our systems. For example, let's say, Gojek when you're trying to book a ride on Gojek, your money is first reserved. So the reservation itself it contains that, what time was it done? Which was the calling service? And apart from that, what was the wallet ID that was involved in the reservation? And there's a bunch of information around that. So by looking at the ID, you can start debugging. And this becomes important as your databases are actually spread across multiple services. And the same ID will be used across different services for different purposes. When I say the same ID use across different services for different purposes, I mean you might have a service that actually just fronts everything and does an authentication. And it also acts as some form of a user information service. But your wallet might be somewhere else. So the user information service needs to have a handle for the wallet and all. So if you actually have these IDs or smart IDs, it becomes your debugging becomes far more easier. Does that answer your question or is it still unclear? Sure, yeah, we can discuss that later. I'll have a question from back there. Think right at the back. Yeah, that's right. From Deakin. Yeah, hi, there's a Shiva from PayPal. Hi. On the response time, which I mentioned, less than 100 milliseconds. Is it for the round trip or based out of the balance or it is actually reaching the processor and the actual response time is reaching the user within less than 100 milliseconds? Okay. The response time of less than 100 milliseconds is basically from the process which is calling our service. So it could be any backend. It could be anything that integrates with us directly. And that is right now chosen because we have that 50 MS time lag and Gojek is the only thing GoPay serves. But I think as we go forward, if we things will have to change according to the SLAs that we have, end up having with different vendors and merchants. And in case as a follow up, right? Like in case if it breaches the time limit, right? So do you do a stand in or like you fail the transaction? I think that is the behavior that will be had more by the consumer. I don't think we can have a call on that. Yeah, we probably, if it takes more than, let's say 100 MS in this case, right now it probably would time it out. And then let's say for example, if you're trying to book a ride and we do a reservation, it takes more than 100 MS, then the consumer should be calling a canceled reservation immediately after that. I don't think we can handle it. Okay, there was some more questions. There's one here. Hi, Mukesh. Good talk. Thanks. So I'm curious to know what your experience with denormalized data, how do you keep them eventually consistent? Do you have housekeeping jobs that you run behind scenes? Essentially you give up on atomicity for performance. So essentially you give up locks or atomicity for performance, right? Rather than doing joins and whatnot. Yeah, so thanks. Giving up database locks does not necessarily mean that you'll give up on atomicity. There are different ways to do locking. So what we try to do is use Redis cluster to do some of locking. So whenever we are trying to modify a particular wallet, we lock it and then we make a modification there. On the denormalization bit, what we are trying to do is put, let's say, let's say you have you as a customer and there will be multiple details around you, right? Instead of making that in two tables, we are actually putting that in one single record. That is what I meant by denormalizing. You obviously need to have some sort of a check, like for example, if a wallet creation fails, you don't create the user or you put a request somewhere where you say that, okay, whenever the wallet creation is successful, let's just call back and update this guy, right? So the callback and update is not done. What we do right now is let's say a wallet creation fails for some reason, we just fail the creation of the user and that is how you try to maintain consistency across two different services that you have. Does that answer your question or does it raise more questions? Sure, yeah, sure. Okay, we had some more questions. We'll take this one here. Hi, this is Nemo from Rezapay. Oh, one second. Please use the mic. Oh, I think that mic might be on. Hi, this is Nemo from Rezapay. Hey. And you mentioned console in passing as in you do deployments using console. Can you elaborate on that? So whenever a service comes up, it registers itself with console and we try to do service discovery before actually sending across a particular call. So whenever, also whenever a service is going down it registers itself with, sorry, mentions to console that is going down and console doesn't actually show it in its active services anymore. That is something similar to HAProxy where you can say that I'm in drain mode and HAProxy won't actually send more requests across. So it is similar to that. HAProxy can't do this because HAProxy is not aware of the channels that are being created in GRPC. So if you're aware, GRPC actually uses channels to communicate and those channels are long-lived. HAProxy is not aware of them, so it can't kill that. But on the other hand we can out here when the service goes down the channel itself is killed anyway. So your, and console does not actually expose it as a live service. So it actually routes the request to the other two live services. Okay, we can have one or maybe two last questions. Anybody else? Yeah, we have one question here. Anybody else after this? So one second, just hold on. Anybody else has a question after this? Okay, that one. I think we'll end it at that. So please, just one second, hold on for the. Just one second, just hold on for the mic, yeah. Could you please talk more about why the distributed payments are prone to failure? You mentioned in the slide. Can you actually twist that question a bit? I'm not sure exactly what do you want to know about that? Can you go back to the slides? Yeah, distributed systems are prone to failure. Okay. So distributed systems, when I say that, okay, distributed systems are prone to failure, you'll obviously face the network issues. You'll have other issues around that, right? For example, I was in Jakarta, I was having dinner with the team. And our CEO was also there with us. Unfortunately, he actually made a top up at that point in time. Five minutes later, he didn't have his money. And that was because of a network issue that was there. So you will have network issues, you'll have other issues. Like for example, the other day, we had a disk full, right? While that can happen on the same system, I think on the same system, it'll be easier to detect than another system that something is being spoken to. So that's why I said that distributed systems are prone to failure and you need to probably prep for that. Yeah.