 Thank you, Edwin, for the description. So yeah, so you heard about grab scale and how fast we've grown. You all know. I hope most of you guys use grab to come here. And as Edwin said, we're doing almost three million bookings, three million rides a day. And rides is actually not a good indicator of load because bookings is what we care about. How many requests actually come to our system? And our drivers, how many drivers do we have online? So we've had basically a lot of learnings on how we've improved reliability. There's a huge amount of content which is not enough. We'll probably be here the whole night. So I've tried to cover only a few aspects of it. So, sorry, before that, yeah. So basically I want to cover five things. The first thing is, how do we kick off standardized monitoring and alerting? Because that's one of the most important things to get through. The second thing is, what do we monitor and what do we alert on? Because this is one of the things I feel that it took us a long time to get right. And it led us to a lot of sleepless nights. The third thing is about handling failures and how we handle failures better. The fourth thing is something that I feel that, again, we took a very long time to drive across the org. And that is operational responsibility. Usually people think it's an assistive steam or only the engineering team is supposed to keep the system up. And the fifth, basically, where do we go on from here, right? So, yeah. So the first thing that we made sure is whenever any infrastructure on AWS, because we're pretty much entirely on AWS, we used to run into a lot of incidents where someone provisioned an EC2 instance, a service, and then they forgot to add monitoring to it. And the CPU is being consistently growing over the days. And suddenly we only know it when, boom, 100% at peak hour and then everything's down, right? And it's just a human slip up, right? And so the first thing that we made sure we started implementing is that ensuring that any infrastructure that's provisioned always has standard monitoring configured. And I come back on how we exactly do this. Now, the thing with standard monitoring, it's hard to get right for everyone. It's not easy to say everything has got to measure these things. And different resources will have their own metrics and that you need to figure out your own, what you think is important. And usually this will come through trial by fire. You will learn it after some incidents. One of our painful incidents we learned was we had elastic cache downtime, where once, we've been running elastic cache for more than two years now. And suddenly with this downtime, we had no clue what was going on. It was just slow. And the only thing that we had done that day was we had created a read replica for it. And everyone's confused. And the only thing that we found out later was African-tagging AWS support and so on. We basically hit network saturation on that node. Because of the replica is now doing twice the number of network traffic, we hit saturation. And we had never considered adding network alerts for elastic cache. So these act as baselines. People will have to tweak and configure it, but that's where we start off. So these are the main tools that we use to keep track of all of these things. CloudWatch, because basically it comes out of the box with AWS, you'll significantly use it. It's pretty good. If you use Datadog, Datadog is basically a graphing and monitoring tool. I'll show some pictures of it later. We've integrated CloudWatch to pull stats into Datadog. There's a bit of a delay. And we use Datadog as our main alerting platform. So that we have one way of alerting things. Scaler, I think was alluded to in Jack's talk as well. Scaler is actually a log aggregation tool. It's similar to Splunk, Sumo Logic, if you use those things or if you're running your own ELK stack. They say that they're not on elastic search. They have their secret source. Then we have Slack, which is pretty much always on Slack and everything goes to Slack. You would find audit logs in Slack. You'd find alerts on Slack. You'd find hacking of alerts on Slack. So everything goes to Slack. And then finally, we have page duty for when shit hits the fan, situations. So that's how we configure everything. Now we use Datadog as our main platform, right? Every single metric that we send is sent to Datadog in some way or form, right? And then we use that to configure all our alerts. Datadog is pretty good platform. So this is how it would look like, right? So in Slack, you would probably get something like a Slack alert saying, hey, you know, something's failing. It'll trigger some people. If something is really going bad, it'll trigger page duty alert. And that also comes on Slack, even though you're being paged. You see, database is our common problem. There are things like, you know, we use Scaler also to basically do log analysis and it'll pull out stuff like, hey, it goes panicking, you know, on this server, go look at it, right? And this actually really works really well because people are always on Slack and you know, you always have eyes on it. So even if the on call is not responding to the page, other people are aware. Like if they're on Slack at that time, they are aware. And most of our incidents usually happen during working hours when things are changing. So the first thing that everyone starts measuring is the technical metrics, right? CPU, memory, latency, throughput, network throughput, if you didn't do it before, and so on. You know, it's good for a start, but it's really not enough, right? The main thing that we learn is that's like, I have no idea what that slide is, but. It's a lot of money, maybe it's the next slide after that. No, so it's supposed to say that I'm really missing my key message here, but it's supposed to say, you know, you're supposed to measure your business metrics and that's what you're really supposed to alert on. Grab's main metrics is rides, right? You know, Netflix, their main metrics is what they call plays, right? How many times you click play and you actually got the content. Amazon.com's metric might be, you know, purchases, right? You should figure out your own main business metric and that's on a platform level, right? As a whole company. Then each service themselves will figure out their metrics. For example, our pricing service, you know, will probably figure out a business metric on what's our average price we're charging in Singapore and so on, right? You know, our POI service probably be serving like, what's the number of successful calls that we're returning. But each service should define this and have an entire dashboard dedicated to it. And that will become your authoritative source of truth. This is a chart that everyone looks on at Friday evenings because Friday evenings is like the highest figure we get. And this basically shows our rides compared to the previous week. Unfortunately, I have to remove the y-axis. But the blue line basically shows today and the purple line shows last week, right? And so if we see, this graph is very peaceful, right? And there's nothing more comforting than all the alarms going off. And you see this and nothing looks, nothing's wrong. And you can- Yeah, that's data dog. That's data dog. All your alarms are going off and you can come back and say to your business like, yes, we know probably there's something's going on but people are still getting rides, right? There's nothing actually impacting the passenger that he can't take a ride. And that's, that visibility, something is really important to have. You can tell business, you can tell them, look, this is what exactly is happening and you know how fast you need to respond. Like if that's gone down to zero, it's basically do anything it takes to get back the system back up. But if that's not gone down to zero, take your time to actually solve the problem because you might actually inadvertently bring that down to zero as we've done, right? The next part is about, you need to constantly tune alerts, right? And that's again something that requires people to take an interest in. You need to go look at all the time because if you're overalerting, people will start ignoring alerts and that's what happens all the time. We used to have some databases that alert all the time and databases again, sorry. In my opinion, it's always better to under alert than over alert because there's nothing worse than you missing an alert because it alerted and you ignored it, right? It's better that you didn't get alerted at all and every time an alert comes, it's actually something that you need to look into. You'd have much better response time that way. So we actually try to go through a pruning process of saying these are non-actionable alerts, let's remove them and keep only actionable alerts. I don't say we're perfect there, but we try, right? The next aspect is, right, you got your monitoring setup good, you got your alerting setup good, things will fail, right? That's why this thing called failover, right? Things will fail. And you've got to be prepared to handle them. The perfect example or analogy that they give, I'm not sure if anyone's familiar, is cattle, not pets. And if anyone's heard that analogy before, basically you think of pets, you think of your servers, you give them cute little names like Tom, Harry, and you know them really well and if one of them dies, you're really sad, right? But with cattle, it's not really like that. It's like you've got so many, one goes missing, it's okay, it's not the end of the world. And that's the kind of, I did not come up with this analogy, by the way. I'm not so cruel to animals. But that's the kind of analogy on how you think of your machines. Your machines have got to be replaceable, right? And you'd want to prefer ephemeral things rather than stateful things. And that's why database is actually the hardest thing to do because they cannot be ephemeral. Essentially they have to have state. But things like your, you know, your EC2 instances, try to keep them as ephemeral as possible. You don't want your EBS volume attached to have certain things that if that goes down, you know your service crashes. You'd want, you know, any instance if it goes down and new one spins up, fine. You probably have a short period of time where you may be overloaded. Great, you could handle that as well. The third thing that really helped us a lot is the concept of circuit breakers. Basically circuit breakers, the design principle works in the fashion that if you hit a resource, let's say service A is hitting service B and is getting an error. Usually the natural response is to retry. But what happens is if there are multiple services talking to service B and they all retry, you're basically bringing it down even further and they keep retrying, you know, you're just causing a storm there, right? The circuit breaker pattern basically says after a certain threshold, you stop, right? You wait a certain amount of time, then it will go into what is called yellow state, right? So now you're in red state, you go to yellow state where you try a few more times, a few requests. If they pass, you potentially increase the number of requests that you keep trying until you find it's good, you go back to Green State and all the requests go through. If during the yellow state, the requests fail again, you go back to red waiting. This basically gives time for the service to recover and also prevents the cascading effect because every time you hit the service, you're waiting for a timeout, maybe a second or two seconds, right? And each time you timeout, you're becoming slower back to your client. And if you're doing microservice architecture, you know, you've probably got 10 clients all the way back up. Each of them timing out one after the other. Now you've got 10 second response time back to the pack, for our case apps, right? But if the first client was to fail fast, you know, you could return an error back way faster. And this basically allows us to know things are wrong when basically the circuit breaks, right? And the final thing is actually to test out your assumptions. One of the biggest failings was, we implemented circuit breakers, but we never tested them. So our thresholds were all wrong, right? Like our system was down, but the circuit breaker never fired because someone had set the threshold to 100 seconds, which we were never gonna hit, right? Everything times out after 20 seconds because that was the timeout value. So you need to test out your assumptions. What we do now is we actually can, not chaos monkey style, but we actually do controlled down times on production late night. Thankfully, I hope you guys don't try when you're like after clubbing that or something and we're down because of a controlled outage. But we do this and generally, we don't bring the system down completely, but we basically bring a small component down which doesn't affect rides. We test it out to make sure that things are actually working and assumptions are right. And that, I mean, assumptions are wrong. That's a good indicator because finding out then is better than finding it out at like, you know, five PM when everyone's trying to take a ride. So all of that great. Now you've got this problem, right? Features, deployments, velocity, right? The trifecta of things that people ask from an operations team and was a stability, right? They always won both. And they always seem to be odd against each other, at odds against each other, right? The first, the way you can drive operational responsibility across the team, is first you need to start reporting, right? You need to start, how we started this really late. Like we started doing this properly only like last year. We started sending weekly uptime reports to everyone. Everyone in the company gets it, right? Like they know what our platform uptime is. They know like each service, what we even track AWS is like on how we use AWS. We track like whether any of them went down and so on and so forth. And then we would conduct postmortems on every incident and we would basically send that out to everyone as well because postmortems generally are not only technical learnings but they are also process learnings. And we had postmortems where we basically came out to marketing and said, you guys can't do push notifications to every passenger in Singapore all at once. Because all of them open the app at the same time and we're not ready to handle that burst in traffic, right? And so they, okay, let's do batches, right? So these are the kind of things that people learn after a postmortem and it has to involve everyone. The fourth thing, we still do this in some way is we actually restrict velocity in certain ways. So we have something called the Hall of Overlords. We are basically people, it's a gated process of getting to deploy to production. Basically we have slots. And only 1% can be using a slot at a time. And basically what this gives us is if something goes right wrong at 1030 and the last system deployed at 1020 and there was only one system, it's most likely that one. It's better to roll back that one immediately and see the fixes of the problem. And it prevents overlapping mistakes happening. It's easy to rework. This is hitting its limits right now as we grow because there's just so many services and limited slots. And it's only restricted during working hours. So it's like 10 to five. So it works in the short term, but we'll probably have to get rid of it at one point. Right? How many of you guys are devs here? Because they'll probably give me dirty looks right now. Devs should be on call, right? This is something, thankfully, we started very early in the process where if you put a service in production, you're responsible for it. You're on call for it. That's when you care for it. That's when you'll do things to prevent you from being woken up at three o'clock in the morning. The ops teams cannot know everything, especially once we have us now, like ATE services in production, you just cannot know everything which is on production. The other aspect is about illustrating the cost of downtime. You need to show the impact of what happens actually when you lose, when you go down for one hour. Thankfully, our business knows that really well because all the drivers call in like, ah, they understand that, but for y'all, you might not, right? Like the business teams might not. The third thing that we did is small issues usually add up. And this is something that's usually, you know, sideline and brush under the carpet. We have a separate process of tracking them where, you know, these small issues that get reported by one passenger, two passengers, they come into our customer support, tech support will pick them up and create tickets for them. And so then people will once in a while keep track of that and we have reporting on that as well. And the final thing is we never had, you know, uptime targets at one point. We were just, you know, freestyling it. We finally set targets, right? This is where we want to go. And this sort of, does sort of internal checks and balances because you realize like, okay, we only have like 10 minutes to break, meet up. Like, you know, there's two more weeks to meet our goal. You know, let's try a bit harder to keep it stable. Let's go through a bit more preventive measure because now it's a team effort, right? And that really helps as well. Ah, sorry. Okay, so, I mean, we have limited time as I said, you know, there's a lot more things we could cover. Hopefully maybe in another talk, or you could catch me later. But there's a lot of things we did in terms of architecture patterns, right? How do we improve reliability from actually tackling how we designed our software and how we designed our services? The second thing that we're rolling out now is SLA's between services, right? Internally, right? And holding each other accountable. The third thing is we've all rolled out in production is what we've distributed tracing. Distributed tracing to, because you know, in a monolith it's easy, but when you have so many microservices, it's very difficult to pinpoint where exactly the issue is happening and distributed tracing helps with that. You can check out the open tracing framework. We use that. The fourth thing is runbooks. We don't do this really well. People are picking it up now. But runbooks simplify it, especially at the rate of, which we onboard people, no one can be, you know, no one can have run at this scale all the time. So especially, it's almost a fresh grab, right? The fifth thing is just training, right? How do you prepare people for on-call, right? It's, and especially on-call lead roles. Before it was basically Edwin, me, and another amazing my boss and my intern to basically, you know, be available all the time. Because if something goes wrong, no one else really knows all the interactions. This is mostly trial by fire, but there are ways to improve that as well. And then, you know, deployment pipelines, how do you improve that and so on? So, I mean, feel free to talk to me about it, but there's a whole lot of content. And yeah, doesn't go to the next slide, next slide. Yeah, so we're just getting started. This is a long way. The SRE org is basically like, only one year old in grab, grab five years. So it's just a start of it. Any questions? Yeah. I've grabbed, how do you manage to reserve access because you need a place that like everyone has access to? The console? Yeah, or maybe a T2 or C1 or whatever. So all engineers have full read access. Full read access. Only the, what we call the CIS OPS team have pretty much admin access, with basically write access. So no one can do infrastructure changes other than them. That's a model that we're actually having problems with right now as, you know, there we have like 250 plus devs and only like eight CIS OPS teams. So that's a bottleneck that we're trying to address, but right now that's the situation. Yes. You mentioned that the dev is in 65 cities. Yeah, 65 cities. 65 cities, sorry. So now in some of the cities, they might not be that close to where you are in five states. So how do you handle that latency? Can't really just put up with it at the moment. Ask AWS to create more. No. Regents, unfortunately, unfortunately we are pretty much entirely on AWS. So, well, fortunately actually. They are only in Singapore here. So, but we don't really notice it too much because the networks in Indonesia, which is one of our biggest markets is so bad that the latency to it is a big difference because the network providers is just terrible in most of the case, right? So, like if you've gone and experienced it, like it's a world difference here. Like you get stuck in the app where you're like, what's happening, you know? The app is very unresponsive. Yes. How do you follow up the metrics that you have decided to measure? What actions do you do when you see a dip or increase some kind of metric? So, a different metric. Generally, if it's right, that's basically like taking everyone, bring the world down kind of situation. And we pretty much try to do that because usually when rights goes down, there's a bunch of other let's figuring as well. So, generally we would figure out how we keep, if it's something small, if it's a service specific one, then they would, you know, figure out what's the level of escalation they will discuss with the business teams, what are they facing and so on. But generally our symptom based alerting, like if it's on CPU memory, we would look into it, what's happening, right? If it's a database, a CPU, then you know people will go and check, you know, if there's low-quake is triggering and so on. If it's, you know, elastic categories, they will go and check what's happening. So, they would, generally some of the engineers will go, whoever owns that piece of infrastructure. So, we do have a static environment as well, but even those changes right now, so that one is a bit more lax. People have, like more engineers have SSH access, they have more control over it. There's no deployment window and so on, but they do not have right access. Yes. Also, in your databases and your easy-to-instances, do you allow the devs as well or the other QA's access on it during troubleshooting? On staging only. On staging. On purposes of standard of the... On production, basically a few people would have SSH access, but generally you really don't need, I mean generally you won't need SSH access to troubleshoot because we pull out all those stats and we publish it into Datadog or something. So, generally everyone should have visibility of what they would do. It's in very rare cases where you want to do a TCP dump or something where you'd want to go and actually SSH in. How about the database? You mentioned you have like a couple of hundred database there. How do you manage the users? They go manually at everyone each time. So, yeah, we do have something called an engineering replica, which is just specifically meant for ad hoc use for people to do and do like queries to find out certain things and so on, yeah. But that's only for certain databases, not all, because some databases will have sensitive information. You said devs are only going back to, say, for packaging also they won't access the production. No, so it depends on the severity of this thing. So, most of the time, all of the metrics should be already pushed into Datadog. So, they should have the visibility outside. I mean, they don't need, if they do need and they don't have access, usually, a subperson is also on call. They can ping or page him to get online to help or they can request for access immediately, temporarily to get it. Yes. So, like I said, we have that catered deployment window, where we do weekdays or well, working days, 10 to five, Singapore time, each person has to request a slot, then they update the docs and then they can go, deployment's automated in the sense that we use Jenkins and Ansible. So, it's a rolling deploy that happens there. Is there anything else specific to that point? Yeah. So, that's generally how we do it, but it's gated at the moment. So, you have a piece of the deployment window, how does it look like you have in different regions you have those things in there, the deployment window? So, the only other deployment window we have is for our synergy. Because our deployment window in Singapore time is pretty much night time for them. They have a separate window when they are awake that they're deploying on. So, all of Asia, the Indian model is easy. Yes. Yes. And what has been your service experience to the worst practice ever happened like, oh my God, we have never even thought that. When Singapore MRT went down in 2015, when the green line and the red line both went down. I'm not kidding. I'm not kidding. There was nothing actually we could fix at that point because we only had one database at that point. And basically the database just got bottlenecked on the load. This was quite some time ago, right? I think it was 2015 or 2014 even. Can't remember. But long story short, we literally couldn't do anything other than actually manually shed load on the API layer to just reject bookings because there were no drivers in Singapore enough to accept all of them. Anyways, so right after that incident, we introduced proper ways of load shedding on the API layer. And thankfully that has proved to be a very useful technique as well. Actually, that should have been one of my slides, but yeah. Yeah. It's always like Karen Maus here. It's always like Karen Maus here. You can only get reliable and do a certain aspect. So it's basically how much downtime are you willing to, how much effort are you willing to invest in and how much are you, what's your targets? Yes, sure. You have dedicated Jenkins instances that actually run the code by trying to build the code. We have two separate Jenkins. So one separate Jenkins for CI in the sense that making the build itself because we use go for most of apps just to get the binary out doing the test and so on. Then we have separate Jenkins cluster to actually do the deploys. So they are completely separate. No, we do roll it. Roll it, that's all. All right, thank you very much.