 Hello everyone. Welcome to the second part of the three part talk series doing site reliability engineering the right way by Piyush Varma. Thank you for attending this session. We'll be starting off now. A little bit about me. I'll be your moderator for the evening. I'm working in the operations team at Autodesk. I'm basically a typical Linux guy who still fondly remembers the last time he ran FSCK. It's been a long time. I'm moderating this session because like some of you, I too am searching for an answer to that question. Who am I? Am I a DevOps guy, an SRE or a sysadmin who just knows how to code? I found the answers to some of those questions in the previous session and I'm sure I'll learn a few more in this session. And to conduct this session, we have the inimitable Piyush Varma with us, CTO of Last9.io. The first part of the series laid the foundation for the need of an SRE team and the benefits that it brings to an org. And in this session, Piyush will be building upon that and he'll be sharing his hard earned knowledge and experience around the approach, choices and tradeoffs that SREs must make when choosing the right tools in the trade. And I'm sure that you will get your time's worth. I know I will. And walk away with useful actionable insights. Now, before I hand over to Piyush, just a few simple guidelines. The audience can ask questions both during or after the session. Please click on the Q&A feature of Zoom and paste your question and I'll try to communicate them to Piyush whenever possible. This event is being live streamed. Some of the participants are there on YouTube also. So check it out. If at all you can't forward the Zoom invite, please forward the YouTube invite to your friends. And the slides used in this session will be posted by Piyush later. They will be made available. And lastly, please do not switch off your phones. Some of you might be on call and it would be really ironical if you miss a major duty alert while attending a session which talks about doing SRE the right way. So, alright then, all over to you Piyush. All the best. Thanks hold my hope my Wi-Fi holds through this time round. Yeah. So this second part is about understanding the mindset and the tools you know that part one was more about setting up a culture of the why SRE is different from DevOps. I try to move in slightly into the depths of what does it be like being on call, what tools and what mindset to use. I think I've already been introduced so I don't need to do that. Few frequently asked questions that I get faced, you know, like which one is better instead of a better is automatic better is should I go Kubernetes all the way to deploy my emails or should I actually use Ansible. Sometimes, you know, or even to an extent where with communities coming in with serverless coming in is the right way to go. And also one of the oldest existing choices of code less, you know, like where you can actually generate code using email is that the right way. The other point of time, I used to believe that the answers to these will get me closer to reliability and things will begin to fail less. But the question that we want to ask ourselves is when something fails, or does it break, does it make a sound? How else would you know? So if I go step by step, will this make a sound when it breaks? Would this make a sound when it breaks? Or would this make a sound when it breaks? For a moment, I want to take this as well. This is a piece of code. I'm running something. Will this make a sound when it breaks? I think it's time for probably asking audience this question, you know, like why don't you just take a quick poll and tell me would there be a sound when any of this system breaks? Yeah, so everyone in the call, there is a poll button in your Zoom link. Just click on that and there is the first poll that has come up. Would this make a sound when it breaks? Yes or no? Just click on whatever answer you feel is right. And you know, go ahead and submit launching the poll. There will be a bunch of these polls during the talk. Yes. We see people answering. But that's a decent choice. Some of them say yes, some of them say no. Okay. Still coming in, still coming in. I think a few more seconds and we can hold this. Yeah, let's stop that. So pretty interestingly, 64% people said that, you know, it should make a sound and 37% said that it doesn't make a sound. It should not make a sound. You know, the answer to this question, I'm going to go back slightly towards a bit into one of the famous questions that psychology has ever asked us. If a tree falls in the forest and no one is around to hear it, does it make a sound? Similar question. If a bug happens and no user is around to see it, should it raise an alert? You know, that's the more important question to ask. So more often than not, what happens is that we get carried away or paged into attending every issue as a priority. Is it really a priority? Is it something that should be addressed? There's a lot of economics of reliability that goes behind it. What if I said that this bug was actually in a log out button at 4am? Should it make a sound right now? Well, probably not. It also depends on how frequently it's occurring. It's never just a plain black and white question of there's an error, you got to rush towards it. At the same time, I'm also not saying that be careless about it. Regardless, this is exactly what leads to a lot of what we call as post bug stress. We get paged. Now, what do we do? Now, how do we go about it? There's a small incident that I'll walk you through a journey of an incident which was one of the most remarkable ones that I had gone through. And this incident was typically everything in place and in order and yet something falls through the cracks. So it was around 7.30am, almost 25 hours before a country launched that we were doing the pager duty went off. All our buses started going on that something's breaking. We went in, we searched, elastic search. There were a few, just a few, just five or six, five hundred requests were there, five xx. We start debugging what's going on. There were a lot of logs coming in around at one NBPS. There wasn't a correlation ID so that we could actually isolate between the tons of logs where the exact issue is happening. Precisely five minutes later, 500 stop. Pager duty is out of the story. We're like fine, must be something, not something very high priority. We can come back at this later. Five minutes later, exactly sharp by the clock. Kingdom starts alerting pager duty again. That public API is unreachable. Every time you see a red font, I'm basically trying to introduce a new tool here that most of the tools that we could think of are already there. There's Grafana as well, which has been set up. That also sets up PagerDuty for some 500 requests. Alerts are going to sentry, which also triggers PagerDuty, elastic search, elastic alert, sentry, standard route, goes in, goes on, goes on, goes on. 745, exactly five minutes later, 500 stop. PagerDuty is out of the story again. And the cycle is repeating. Five minutes on, five minutes off. Five minutes on, five minutes off. When these systems fail, the very first thing that we ask ourselves is, was there a deployment that had happened? I mean, quite obviously, because that may be the easiest answer, because more often than not that we say that most bugs lie between the computer and the chair, which is obviously as somebody must have done a deployment. So we go in and check our comeback. There wasn't a deployment. Looks good. We call up the release manager, actually physical call. Was there a new deployment? So either lies or doesn't acknowledge, but there wasn't a deployment. Am I still on the call as my internet holding group? Yeah, yeah, all good. Okay, perfect. So we call up the on call SRE. Was there a change? The last on call SRE, there wasn't a change. So Grafana looks okay. I mean, doing their job. Ramesh is doing okay. APM doing okay. Where exactly is the bug we don't know? We check firewall. Is this dropping any traffic? No, it doesn't drop. This whole ordeal, believe it or not, took us 20 hours and 20 hours later, we realized that there was a simple mount command which hadn't run on one of the DB charts. And because of which data was being written in memory, the system got rebooted. Data was wiped. Now, this machine wasn't supposed to be commissioned, but we had fixed the machine. So we put it into circulation service discovery brought it up. Maybe shard the one of the commands did not run via Ansible as it should have. So one machine was left out. One command did not run. The machine rebooted data gone only a certain section of the data is gone. What that means is that a certain type of request only started failing, which would actually disconnect the load balancer as an unhealthy upstream. And that's why the circulation. Bigger question here is we figured it out. What is the next step of this? Quite obviously the next step is where you fix this. True, but the more important question that we have while being on call is where else is this failing? How do we avoid this? These are the two important questions that we actually learned in part one as well that this is what gets my generosity going whenever I see a bug. My first reaction is where else is this failing because systems are designed by culture. If I've done a mistake at one place, it's quite obvious I would have done it like some other places. And the second thing is like really going to the bottom of it. How am I going to avoid the situation ever from happening again? So to avoid a situation from happening again, we must know what the root cause of the problem is. What would be the root cause in this case? So raise the poll again for the same. What do you think could have been the resolution? Like should we, what root cause do we say that hey use Chef, use Ansible, use Kubernetes, use Terraform or nothing that we could do? Audience member you can check the poll and you could put in your results. It's interesting when you say nothing you could do, you mean nothing I could do, nothing one could do. Cool, let's just give it 10 more seconds and then we can actually, I think we can stop the poll. Let us share the results. I would encourage everybody to vote. So how interesting 80% people say that there is nothing you could do. 10% actually say that Ansible when the reason, so it was actually Ansible's wrongdoing. So because we used to run Ansible, it would run commands, but a machine was brought into circulation. And when it was brought into circulation, one of the Ansible runs hadn't run on that. It was a newly provisioned machine, but service discovery picked it up. Use Kubernetes will exactly what happened when you're using Nomad at that point of time. And because the machines are available, this dynamic load that would just go on to anyone on the machine. So I would say actually these were, if I had gone really old school and not use any of these, maybe we would have avoided the other. But nothing you could do. Sure, but you know, business won't take that as an answer. How do we really solve this? I mean, we can end an RC. This is what goes into building a good root cause analysis. A root cause analysis must identify it. What is it exact? What was the loss of this? If the loss of this was data loss, which resulted in actually a significant amount of embarrassment. If not a financial loss, which we will not have been measured because the system was actually down. There has to be something to avoid this. It can't end that look. Sorry, boys, we can't do anything about it. So there has to be something to understand root cause. One of the important question we must ask ourselves is what are the reasons why things fail? Because only then we can identify that, hey, this is the category in which this has failed. So if I was to raise a couple of choices here, you know, like failure regions. So this is almost shamelessly from Google's SRE book as well and also a lot of other study that has happened. There are a few choices in classification, why we say that, hey, failure reasons of... Yeah, they can let the poll on. You can let the poll on, people can actually answer that. So the question here is, in what ways do systems fail? This is a summation of most of the broad categories. The bigger question is, what do you think is the single most prominent, or maybe pick two, you know, like pick multiple choices. What do you think are the reasons why which software fails? Yeah, incorrect expectation is obviously, I mean, that usually comes in from, I expected something to do something, but that's not really a, like, is interesting. Network failure, traffic load is throughput increasing, sudden search, spike, etc. Service problems of fault would cover everything starting from your ISP failing or AWS failing, etc. I encourage the rest of the five people to vote as well. I think five more seconds, we can stop the poll. Sure. Interesting. All right. Let's look at the results. Sweet. So majority of you believe that configuration change and new deployment are the biggest cause of failures. Actually, that's right. Those are the single most reasons why things fail. And this goes back into our talk one, you know, reliability is a constant challenge between stability and release. Our first principle that we follow is how do we increase the velocity of other aims to ship more and more software. And when that happens is when bugs happen, but we have to avoid those as well. So these are the reasons why failure happened, right? But interestingly, I want to raise a point here. We never say failures happen because of individuals, right? But more often than not, well, root cause analysis come out, which is also coined as RCA. Hopefully there's no room on the call. We end up mentioning this. We forgot to execute a month. Which, but we were just, if you go back, right, we say that look, individuals are not the reasons. Reasons are these. So the RCA cannot actually say that an individual failed to do something. It has to be something else. How do we improve this RCA? We take another stab at it. We say that look infrastructure team forgot to execute Mount command, but it's not really actionable. Like somebody forgot how, like, is the resolution that we're going to boost their memory? That's not possible, right? So this RCA doesn't really have an actionable limit. Yes, it does summarize the situation. Well, it's being less as I'd seek folks like to call it, but it's not really actionable here. So what's the actionable? Is it the right question to ask? Why is it possible to assess it? So we should not allow questions, but that's hindering the work, right? There's a, there's a reliability team who is in charge of the system. There's a deployment team. There's a DevOps team all put together. The system administrator, they are going to assess into the system. So that's not the answer. You know, one thing that I tell, and a lot of times we ask ourselves, what is the best way to actually put security? It's not logs, it's security cameras. Logs can be broken. They can be avoided, but security cameras are capturing it. The fear of being observed is actually the biggest. Now, in this case, we're not dissenting fear here, but what you're saying is that preventing somebody from doing their job is not going to make such some reliable. So this RCA probably also, it asks the right question, but probably it's not the right resolution. It's only going to hamper other things. We're taking on the stab at it. Maybe this is the right question to ask. Why don't we have a tool which actually matches and alerts on a configuration mismatch? If all my system in production had a way to match their configurations and raise an error every time there was a mismatch, that would be the best thing, right? Because now people are free to do, are not burdened and also can actually take their decisions more freely because the mundane work of having to validate a configuration has been offloaded to a computer. Because of this underlying principle, very simple, we've read this over and over to our human. Failures are inevitable. As long as there are humans in charge of system, they are going to make mistakes. That takes us to another question here. Do you think changing personnel or people in charge would have changed anything in this overall situation? You know, that's the bad Apple theory, right? Bad Apple theory says that you take out a bad Apple of a basket. What remains is really good quality Apple. So what we're saying is that there was a person who could have been in charge, which replaced them and what we get is a better result. Would that solve the problem? We wait for the poor results. Anyone else on the phone, please? Okay, we have a good number. Yeah, good that the polls are anonymous. Saravan, are you polling as well? No, no, as a panelist, I can't. Okay, good. Let's just give it five more seconds. Okay, sterling results. So, well, 80% people think, oh, that's like Pareto thinking. 80% think it's not going to solve it. 20% think it's not going to solve it. We can apply the 80% rule. So, but here's the thing. Skill and ownership are one of the most important things. I would say that it really depends again, you know, like I wouldn't say that either of you are right here, but not are you wrong? Because why do we say that look personal changing doesn't really solve anything. But at the same time, we have to be pragmatic and cognizant of the fact that skill and ownership always go hand in hand. And what that means is not less skill. It always means somebody who doesn't have a real passion or is not really somebody who's not really a guy who is at her girl who is handled. Sorry to interrupt. I think some some of the attendees on zoom cannot see the poll results. So it would help if you could call out the numbers. Oh, okay. So 80% of the people think that changing personal will not change anything. 20% think that changing a personal might affect something might improve things. So skill and ownership are something which go hand in hand. You have to have the right skill people. You have to give them the right ownership as well. Once this goes in a balance. Mistakes are going to happen as well. You know, like at that point, you basically take a start and okay, look, I got almost the best here, but it doesn't mean it's going to eliminate the mistakes forever. So we need to go back to finding the real root causes and actually improving by putting technology in the place of doing these mundane tasks. And this is exactly what an SRE does as our goal as our what should our tools look like. These are the tools which basically catch these my new test of the details of what exactly can a person overlook and miss. What is so mundane that it becomes boring that it's going to result in an error. So here's the thing. Anything that you're not paying attention to is going to fail at some point or the other. And that is where these tools actually come in really handy. So if I take a stab again at what's the real root cause here. This is the real root cause I should have isolated down system configuration validator of that set. There was not really a system which was actually validating the state of the machines every now and then. And we built it and when we built it, we started realizing it became one of our Britain butter tools before after launches. We actually started running those configuration validators to see if everything looked okay. If all ports could actually scan all machines. If the connectivity was fine. I mean, obviously the scope kept increasing. And this is this is really what we did. One of the greatest tools that fit here is OS query. That is a great tool by Facebook. You can actually make SQL queries to systems. You can match them across systems and see which one's working well. There's a difference, et cetera. Start introducing those tools. Introducing failure mode effect analysis. This is a slightly older term FMA like this. This is borrowed from the industrial sector, which what we call as these days is chaos engineering. Failure mode effect analysis basically talks about you run an application in a failure mode and you see what is the impact of it. Now, if in our lab, in our systems, if we would have actually run these systems against failures in our laws, fault injection techniques, we could have caught a some of this. Again, we're not saying that failures would be eliminated. We're saying that failures would be reduced. Another thing that was missing was non-later configuration validator. So there's a lot of research that actually went around. There's a lot of good research papers as well, which classify configurations as later and non-later. And what they realized was that around, if I quote the number correctly, around 72% of the failures could have been reduced if configuration validation was not latent. What that means is usually more often than not, in our code basis, we want to write to a file. We want to read from a directory. We want to hit a URL. This configuration variable while it is passed through the system, but the code point that actually uses it, the invocation happens very late. What that means is if it was a bad URL after my application has been deployed and has been running for a while, only when that code point, code path is hit is when that validation will happen or whether it is correct or not. So they say bring it back at the start of the configuration of system startup, validate every single configuration variable that is there and you may be able to reduce your failures by 77%. One of the techniques, not the only technique. These three we identified as good cause of this situation. And when we did this, we realized that the same bug we actually did not. When I say same bug, similar issues also, we did not catch it. So one could say maybe the bug never happened again or maybe our techniques ensured that we don't run into those issues again. So coming back to the same question. We have a question from the audience. Yes. The question is that does bad Apple theory contradict blameless RCA? Wouldn't it lead to blame driven questions after each incident? It doesn't because what we are technically saying is I'm not an advocate of bad Apple theory. I'm an advocate of balancing skill and ownership. So that's the point that I want to focus on. One of the things I want to focus on is if I have a reliability team, but they don't have the right authority or the ownership to actually call off a deployment, it's going to result in an issue. So don't call out names. Ensure that there is enough ownership resting with the right skill people. And also ensure that you invest in those skills as that because that's the balance that needs to be maintained, which is what the whole idea of this point was. I discussed that in detail at the end of the call, if we still have a number of questions. Sure. But does that address? I think it does. It does fall out to the subject. In case there is further elaboration needed, would you please call out that question again at the end of the session, please? Thank you. Coming back to the same curiosity, same two questions. Where else is this failing and how do we avoid this? This was an avoidance that we did by actually writing a good RCA. I mean, good in terms of its qualitative, but as we saw a result that we didn't hit the same bug again or similar bug again. So maybe it was working. But how do we actually move this towards we want to eliminate such failures from happening again? More often than not, I've actually seen in the industry that we move towards the direction of more managers. More managers doesn't mean less failures. Why? Because more managers are more people. We just established that if there are people, they're going to be bugs. So more number of people doesn't really amplify my stability. If anything, if we say that it's 0.6% stable because of a person, 0.6, multiplied by 0.6, it totally becomes 0.36. It won't become, it won't add up. Does that mean that more processes are going to introduce less failures? No, like the simple question that we ask ourselves, do we disallow SSH? That's not really going to solve the problem because these processes are probably going to slow you down. They will require more approvals, authentications which are going to result in more failures and not less failures for sure. So what leads to less failures? One of the biggest things that leads to less failures is this curiosity. Curiosity to ask this question. While I'm sitting idle, while there's a downtime, downtime in terms of my downtime, not the system downtime. While I'm actually, I don't have an active bug to chase, but I'm on call. This curiosity of asking myself, can I run a prototype of this? Can I actually deliberately inject faults into my system which is a precursor of failure mode effect analysis, goes back into the same theory? Or as we call chaos engineering, can I actually take this application, try to bring this in as many possible ways and actually record that down? And I think I should have posted a comment out of one of the FMEs that we used to do. Do remind me at the, along the slides are likely ensured that when I distribute the slides out, I'll add a photo there. We started running our applications against detailed failure failures. Take the disks out. In a cluster mode, actually bring a node down. See how long does it take for the cluster to actually sync back. See also check how long does it take for the application to talk to the database in case the database was not reachable. Or in case of non cluster, in case of a master slave, how long does it really take for the slave to get promoted? Is there any data loss between those situations? If we record these down, the answer that we get is, look, this is the impact that is going to have. And then we can measure an impact because there's a cost to improve as well. And this is what leads to lesser failures. One of the techniques is standardization leads to less failures. This is where I really, really encourage everything as code, infrastructure as code, policies as code. And what I mean by code is not just near configuration, really well designed code. I was having this conversation with somebody a week back or something. And infrastructure as code doesn't mean that you're just copy pasting a lot of these configuration variables. You really have to get into the essence of modularity of reusability. It applies there as well equally. For example, if you're writing terraform modules, terraform code, ensure that your terraform code is also written in a way where it is reusing stuff because it's code after all. You can actually do so much with it. Do not go about copy pasting X number of variables from each section to another section because that's only going to result in bugs. Spend some thought about actually doing that as well. Policies as code. Now policies as code is basically again, all your decisions need to be brought in as a code to a platform. I forgot the name of the tool which Kubernetes has. So can you remind me on that with basically policies coming? Open policy agent. What is that called? I know about terraform sentinel. Terraform sentinel is one way but that's paid. We can still go ahead and use it. Open policy agent is a Yeah, that's correct. Someone in the audience called it out. Yeah. So that is one thing that is available in the communities world. Basically, what you're doing is policies are decisions. Decisions get buried outside of the code. If we actually think of any decision that we make, like, hey, a security group should not be or we should have two nets across each easy or we should have total of three replication factor. These are decisions, right? These decisions will not reflect the why part in the code. They will only reflect the how part in the code. And they, if they are available somewhere, they will actually make it easy for anybody to follow it up from data. This sets that culture of transfer of knowledge. So these three words policy as a code actually has a deeper impact on the culture of the entire team. One other thing that changes and improves this reliability is clear service level objectives. In the previous session, we had this question where we were asking transactions per second and concurrency. A lot of times I've seen that when we define throughput, there's a big confusion around here is how do we define throughput of a system where I would go to my business teams and I would ask what should the what throughput am I designing for? You will say 10 requests per minute. Now 10 requests per minute is broken down into transactions per second or is it just mean that if my one request is one second, I will have 10 of them. I can literally run them on one single server. I actually don't even need to go. If that means that I have a recovery time of 10 seconds, I can still serve those requests. Right now, is that the throughput am I chasing? Or is it that what we're trying to say is that the concurrency is what we're trying to chase. Concurrency is number of at a given point of time, how many of the given set of requests are open? 10 million users using the same system at the same time is vastly different from 100,000 users using it over a span of 10 over 100 seconds where each request is one second. It's a massive difference in design because concurrency is going to kill you. It's going to take it to the problems of C10K, the standard problem and this tier SLO makes it very easy to actually design systems where they fail lesser and this is something that needs to be cleared upfront. Going further, we say that better tools are the ones which actually avoid failures. But the biggest question is how do you define what's a tool which is better? Now, all our existing tools that we have around us, one of the biggest questions that people get asked is and I was asked as well that you guys are site reliability engineers, why do you need to keep these tools every time? There are so many tools of the share that you can just take, it's just plumbing one X to another pipe and you get your job done. The problem with that is most of our tools are clever hands. Clever hands is actually an interesting story. So in 1904 there was a horse a horse which used to live in Germany and there was a claim that hands could do arithmetic, it could actually do computation, floating point algebra calculations as well and a lot. They first thought it's a hoax. So the owner went around the whole of Germany taking the horse on a simple exhibition and they've gotten different evaluators. Turned out that hands could actually do it. Regardless of the owner irrespective of who the evaluator was hands could actually solve mathematical problems and simple questions like somebody will ask hands, if 10th is a Monday what is a Friday? Hands would basically show it's 15th, how does hands show? Hands would either tap on the floor 15 times or wiggle his tail, wag his tail and the answer was there. Interestingly questions could either be asked as written or ordered and hands would get them and this used to stump a lot of them. So this actually went into a lot of psychology study as well. So how did hands do it? The reason hands could do this was because hands had learnt the idea of reading human emotion every time and how did they validate this? They did a simple experiment when the observer was in front of hands where hands could see hands got 95% accurate results then they put the observer behind hands where hands could not see and that resulted in 5% accuracy or 6% accuracy. The reason was hands could read anxiety in humans no matter and if the evaluator knew the answer, hands knew the answer if an evaluator didn't know the answer hands did not know the answer as well. The reason was hands would pick it up. Every time the question was asked hands would start tapping the evaluator is bound to get slightly anxious and excited. So by the 14th tap the evaluator would get anxious and the horse would actually make another one and then the resting emotion and hands would stop very smart horse. No one knows what happened to him after world war last records of 1960 but coming back to the point most of our tools right now are like lever hands when I look at grapana when I look at my dashboards I am only looking for things that I know how a system can fail and obviously once I know how a system can fail it will not fail because now I fixed it. So all those graphs that have been made to catch failures are only there for once not twice. It's like saying you don't need a parachute to air jump once or twice. So coming back to it are existing tools limit us the limit what they offer and they only show as much as I know and it won't take me anything more than that. That's why we need to keep building these tools which actually are able to predict and are able to think more than what we are able to analyze and process and this is why you would see that the industry is slightly shifting towards a big world going out there that a site and library engineers actually need a lot of machine learning and rule based systems and this is exactly why because we don't want systems that are just clear to us. Coming back to it there are no better tools there is only better usage of the tools I am going to go back to asking a bunch of questions which the reason why we started this talk I mean I was written in the description as well which one is a better tool so I am going to ask a question here what do you think is a tool which is better here to use overall? I wish I could vote on this one we go ahead you can say it you can influence people no no I did not set that option to be honest Pulumi is new Terraform is old. We see a hand being raised okay we didn't see that we didn't see that okay can I ask okay Pulumi can you shoot your question if you have any I think it is an accidental rate of hand probably okay okay okay cool I think 5 more seconds and we can stop the call okay interesting results here I think I can stop speaking you guys have already gotten the gist of it 77% people think it depends 24% say total 3% say that it is Terraform and nobody has actually voted for Pulumi to be honest I would have said Terraform interestingly because it is that theory that you only if you know the only tool you know is a hammer then everything feels like a name so yeah so before I answer I will actually take this tweet no amount of and this is just a frustration I was venting a few days back because we were running some bringing up some infrastructure repeatedly and no amount of Terraform Pulumi can over through the fact that if you are bringing up infrastructure repeatedly they are going to be new failure each time even on a very bad destiny and it doesn't eliminate anything but coming back to how do we choose am I going to leave the answer like saying that look it just depends take a bit no here is a framework that we need to evaluate it let's take a design choice here you know like carefully what does Terraform give what does Pulumi give right Pulumi believes in the choice of look everything can be used as an SDK and you can actually write very well code which is logic which is deeply complex and gives you insane amount of power that you can actually use you can write it in python you can basically write it in type script you can actually invoke golang as well and a couple of other things and the SDK is available which is great however there is a there is a chance there there is a catch there suddenly when I am actually saying that Pulumi does all of this I am raising the bar or the entry level of people who can actually get in and actually write those valid content I am overlooking the fact that Terraform started as a simple DSL and this is a convention that Saurabha Dahi was just having you know every DSL lives long enough to not be used or actually evolve into a programming language but that's what Terraform is doing you know the DSL is evolving but at least it's so very well documented the use cases are narrowed down they are marginal they are lesser so you can actually just find a lot of cookbooks and recipes out there to actually do that job but that is not available in Pulumi but if you are a power user you will find Terraform very restrictive so you need to answer this question by yourself that what is your team made of what is your culture made of if you have a team out there which is already very well capable of actually writing these codes and they just want another extension of it why would you not go with Pulumi you would but at the same time if you think that look I have a certain which cannot dedicate any time on this but I really need to get this is a scale that I really need to get from the market you will find it very easily with Terraform because that's a tool which is battle tested out there which has been running now with Terraform so do we say that Terraform is better not really because let's say now you got a bug right now in Terraform because it's a DSM any bug down there like the simple thing of M into N, M instances and this you just cannot get this right in Terraform there's just no way now when you change one it's either going to add one more this wrong place or it's going to add one more instance incorrect you cannot just do a plus or a minus simply now but in the same problem when you have Pulumi the owners is on you it gives you more power now I am responsible for my own bugs it's just an SDK and I want to try catch I can easily like the other day itself with Terraform we ran into an issue the remote state will not work if there's a third party ARN to be a role ARN to use it just won't work if I had something like Pulumi I would have done it better because the code is in my hands so coming back to it it really depends and these are the choices that you will have to make and there are a bunch of factors that you need to get into simple answer being neither of those two is going to solve the problem if you are not doing it the right way so it's the usage of the tool is not really the tool itself similar questions we ask here is Ansible greater than Chef or is Chef better than Ansible not really because if you have a team which is already very well capable of writing Ruby don't reinvent the entire choice at all if you have a team which is actually very well capable of writing Python another question different choice is Prometheus greater than influx dv this takes to a bigger question which is coming up next which one is a better tool that's it for this one this is a poll I want to take here Elasticsearch and Hamayo a very interesting question that I get asked and here there is a challenge of when we evaluate this tool it's also important we understand the breed of that tool like Elasticsearch is an index logging system Hamayo is basically the they are the advocates of index free logging systems now which one of the two would be better I really want to take index free logging is basically like I give you a raw system access and you have a bunch of gaps running on it as simple as that and would you rather do those gaps or would you rather do those Elasticsearch queries and okay I think we can stop the poll while the poll is going on someone has asked a question on chat Piyush are tool wars really the crux of SRE that would be answered in the very next slide I think my slides were neat so okay cool I think we can just stop the poll okay ending poll okay so 75% people actually believe that index logging is the right thing to be done 20% believe that index free logging is the right thing to be done and 7% believe that it depends I would like to go with the 7% but sort of what do you think which one would you prefer I would have said index logging primarily because of similar reason never used Hamayo non-enemy Elasticsearch right so that great point you know like I forgot who said this do boring things you know like there's a very famous phrase do boring things because a reinvention of these really doesn't help do what you know best so if majority of you who have actually been using index logging for a fairly long time just continue to use that but the same time also be cognizant of the fact that there are going to be inherent challenges that come with every single tool as well like Elasticsearch does the job well there's a massive ecosystem out there then you can set Elasticsearch then you got Kibana then you got everything around there right works really well so if it is something that you are very well capable of running run it it also depends on what choices that you are running right like example the footprint that Elasticsearch will run into after a while is going to be massive like I haven't seen any one single person who actually has been happy for a prolonged duration of 2 years with Elasticsearch that's like the expiry time how does this thing scale so and that's where you know like that's the primary pitch of people like how I would say that no, index logging doesn't matter how much data you jump in a simple grep will actually just run massively parallel commands and they are going to be fine so it also depends on do you have somebody on your team who is really capable like if you have a team which is massively skilled in Kafka Hamaio would be an excellent choice you know like you would say that you already have it we can actually just plug it into our own Kafka and now we just need to a pretty user interface on top so again coming back to a recap and flowing back to the question when we ask Terraform vs Pulumi Ansible vs Chef Prometheus vs InfluxDB Elasticsearch vs Hamaio vs a lot more what we are really asking here is there is a different layer of reliability of each software and if we look at the essence of it there is infrastructure that's the first layer and that's the crux of the whole site reliability that this area of tools is going to keep increasing the answer that I need to give is is my infrastructure if I am a company which actually deals in infrastructure up and down I should be surprised that it's going to fail at some point of time and I shouldn't assume that this is going to be perennial this is going to last forever build I tell people every time you build a system build it for a fact that it won't be there tomorrow what I mean by that is that every single server that you are storing your data on don't because it's going to wipe out it's going to destroy itself when you don't know it, when you are not looking at it so design your systems in a way where repeatability idempotency is baked into step every step of the way and that's really key idempotency because if something can be brought back into it's same state where it was the amount of expenditure that you will have to do both in terms of money and time to keep it maintainable is going to marginally drop because it's become commodity like you said okay fine if the machine went away it does not run infrastructure script again it will come back and just freeze you up I do remember that I was once in charge of running these VPNs and every night at 350 VPNs would go off every night I don't know what it was maybe we just want some root table which used to actually get flushed or something and this happened for a month and then I asked myself I can I just automate this thing and I did and thereafter it also used to fail first initially in my sleep I had taught myself a muscle memory press and enter and it would get done later I had actually then designed a small slack command which would get done so you keep automating it point being this is a layer that you need to take care of above the infrastructure layer is a configuration management layer do not change these things by hand if you see yourself doing that stop yourself because I mean again you got one or two servers it doesn't even make a difference but if you got massive set of them it makes a big difference because you would remember today but I mean these products these companies last beyond people what that means is that tomorrow there will be somebody who you need to pass this information to or you need to go out and do better things but somebody still needs to take care of this decision this policy that goes into that code next is a point that I actually bring out is observability one of the the very height keywords of there observability how do we actually do it observability just means that no matter what you use you use you use infrastructure it doesn't matter ensure that your system is actually observable what that and then take a call on what is the right tool for your choice what that means is let's say if I'm using like I analyze a lot of metrics where I look at trend and spikes every day like I really like to look at is this abnormal for 7 p.m. on a Tuesday night is this abnormal for 5 a.m. on a Monday morning these are the questions that I constantly ask myself how do I ask myself that question if I'm building any sort of pattern on it if I simply run a very small if nothing but a bell curve analysis 10 data points let's say for I'm taking a data of Monday morning 5 a.m. how many Mondays exist in a month 4 or 5 right multiplied over a year they're not going to be that many you need data over that period to actually do any sort of analysis get into this habit that look my system should be able to answer my question and that that amount of observability where you have a curiosity to ask your system these questions actually both of these tools are going to play because you can keep your entire year's worth of metric which come in a terabyte a month and impact with your familiar source so you will have to design tools and technologies and strategies which actually go beyond these so this array is going to expand expand to a point where you actually might take both of them at a point when you say that I need this and I need this as well because they both solve a different purpose at a different time same thing right like logs carry so much valuable information so much more valuable information that ask yourself a question that is my logging even worth it what am I logging there like a lot of times I've seen people they log they log information but what they miss is just a simple thread ID as well so they can actually not tell which logs connect to what application so in a concurrent system where there are hundreds and thousands of requests coming in you won't be able to tell so it's not which tool do you use for logging is the quality of logging as well which is way more important than the tool of it what do you do with those logs do you analyze them later because if they just come in and they go away and you're only going to refer to them when something fails first of all by that time you may not have the right information maybe you're not even logging the correlation ID a lot of time I've seen people actually log really customer sensitive information into those logs as well the moment we do this habit there's a price that we're paying of actually ensuring that now our logs have to be very well protected each protection comes at its own cost with a different policy with a different layer around that so these are the these are the this is the stack that we actually divide reliability into on and then we pick tools and choices around it and then we see which one goes with but again the question is not what tool do we use but what culture do we have around that? summarizing this in a point culture will need tools for breakfast lunch and dinner or if there is not a culture of curiosity and aiming towards reliability tools are not going to help they will only assist you in your cause but the cause has to be there if I have to summarize this I would say this is how you attain reliability or how I have maybe work for you as well first build a culture there's still that curiosity that look failure is going to happen what can we do there's a very good paper by mem SQL guys very old paper where they said that if you go to war what is the thing that you would actually win win assuming that I know it all assuming I'm the smartest or the best reaction time assuming that you know it all is kind of a false fallacy it's never the case there's always going to be somebody who knows it better than me assuming I'm the smartest never the case as well the definition of all this vast smartest obviously as a specie we are increasing our intelligence by the day so not true reaction time is the best reaction time is we have some we have some audio sorry we have some audio issues on your end so maybe switch off your I'm not going to tell you no that's okay everything came through but there were some issues I'm on audio now okay cool adopting these tools embracing of standardization yes let's adopt a tool to our strategy standardize it all across if you have a design decision which you say that look I understand this but this is a compromise I'm making for tomorrow ensure that you actually standardize it because then you can come back and fix it very do very frequent and very detailed RCS but just don't say what failed but also why did it fail and how is it going to improve improve this culture by actually sharing this knowledge across keep this knowledge really don't hide failures you know what we have seen is that what we actually do early on in our careers as well we tend to restrict failures to ourselves that's not really right and the onus is not on us but also the one who do you share the failure with you know if you are a senior right now in your team running it build a culture where you actually allow people to come up and look at this fail do in-depth analysis of it brutal analysis of why did fail how do we prevent this from happening again sometimes those conversations are really tough because a lot of times we are not able to detach person and people from the problem but one has to do it so that's a culture that needs to build as well and last accept that they are going to be failures so just a summary there are going to be failures that are inevitable and one important thing here is that we don't learn from success as much as we actually learn from failures so that's pretty much it there's a link here as well which is our fortnightly news letter which you can actually subscribe to we basically are pretty open and vocal about RCAs and would like to share this that was very interesting session views I think the polls really had give weightage to the audiences inside also do we have any questions from the audience you can either paste them in the chat in the chat while someone calls out the question I think we can revisit the earlier question which was does bad apple theory contradict blameless rca I would like to speak that to it speak to it a little bit so basically bad apple theory says that your system will be safe if it were not for those few unreliable people in it and blameless rca says that when you are doing an rca don't blame the people blame the root cause or find the root cause and find out the tech that is at the base of it so I would say that bad apple theory is a foundation for blameless rca and not contradicts it maybe Piyush you can shed some more light on it that's exactly like rightly said here while it for us to analyse yes again as I said I keep going back to the same part are the right scale people having the right set of ownership but beyond that the blameless part is the question is not who done it the question is why will it not happen again it's not even why did something happen the very important I think I learned this from a friend of mine a while back he had this unique way of conducting rca when you approach him with a problem his first question would be why do you think this should work in first place when you ask this very simple question when you unravel and you actually go through the layers of where this should work because I have done this I have done this oh wait have I done this you know like a very very simple question to ask tell me exactly why do you think this will not fail again and that's the real thing which actually brings out the essence of blameless rca and also advances towards making sure that these tools are there built in the right stride and direction it helps alright so while questions are coming up there is one question one other question how can you try to change the culture in an organization where you are not in a senior position this is in reference to trying to make the culture more open and shareable and you know shareable with knowledge so this is mostly around how to influence without authority you should be able to speak to that it's a little tricky question actually because look if a why would a culture not change our two reasons you know one one doesn't know what is the improvement two three actually two one doesn't know what is the way to an improvement I buy into the improvement I just don't know how to get there third is look I don't want to improve at all now I don't improve at all sounds negative but the reason behind that could also look we don't have time for it right now we'll come back to it later and obviously the last which I don't want to discuss is where they think it's not worth it etc look it's not going to happen so which is a very toxic one so let's eliminate that you know because that's something that you can't change at all this is literally no way you can actually address the problem where somebody doesn't want to improve despite knowing that I need this improvement etc etc right so first part like we said they don't know what is the improvement this is where I don't think junior or senior makes a difference because you can actually very well go in there and educate be forthcoming on the fact that look I've learnt this new thing this may help because a lot of time we just don't know you know like so many things I learnt so many things every single day because there are so many things that I just didn't know and second is I understand but I don't know the way to get there now this is where you would actually utilize you will take an initiative and follow a bit of your off time your weekends etc come back with proof of concepts and say that look I did this and this is how it can be improved and then see how people follow you know so and this is your own growth path as well because I mean you do this x times over you won't remain a junior 3 is they understand like when you said I'm focusing on the word junior here because you brought it up do you say that look we don't want to do this because it's not the right time for it valid answer because maybe what you think is the right thing doesn't fit in the scheme of things here like example every single reliability tried requires code to be done more testing to be done more culture to change you know like somebody will have to change a habit of how they used to do things existingly versus a new pattern that you are suggesting now this requires a bit of adoption as well where you don't want to bring down an existing system and while I may have other business goals to chase the pragmatic thing would be you take that into consideration say fine you'll come back at it in the next months of time and last one on the other side if you know you are right they know you are right but still they don't want to change I won't say time to change all right so I think that has been just very I think I would also want to contribute to that because I have been in situations where in the culture was I was lucky enough to be in situations where in the culture was directly tied to an automation output for example encouraging people to use infrastructure as code you can show results and move on but as Piyush rightly said if everyone knows everyone is right but nothing is happening then you have to take the hard way out all right one question in the chat which is do you think DevOps should have the same limitations of time budget as developer projects and a follow-up question to that it has been challenging since there are too many dimensions and unknowns in tools and versions so basically what we are trying to say is that should we and if yes how should we do time estimations for DevOps projects also correct I think I said this in the first talk and I will say this again reliability has to be seen like a product it cannot work any other way than a product does what I mean by that is there are prioritizations in reliability as well you do not go from 0 to 100 you do not go from 90 to 99.999 it is a incremental step you can refer to that ladder that I created of the journey of 9s earlier the point being you take a stab at it you go incrementally let us change few things let us set that in motion and then address the next thing so and when you look at it like a product time budgets are always a part of it because that is the first thing that we do when we do a product sprint planning we say these are my tasks of product feature development this is the cost of it small, medium, large, etc whatever you use days, hours different people have different planning methods and then that gets rolled out as per a schedule as per a plan even that rollout goes via alpha, beta same medium is the case for reliability as well and reliability is no different reliability does not mean you have got a new tool and you just deploy it out there it goes through the exact same cycle of let us take tab 1 80-20 principles applied all over what is the least amount of work that I can do which actually gives me the maximum amount of benefit keeping in mind the long term horizon as well reuse ability every single aspect of software engineering that you have studied or you have read about will actually be applied here as well so one more question around culture what should be my strategy around SRE improvements within my organization like choosing the right tools apart from culture so basically what strategy to use when evaluating tools how again this is again to the same problem how do we choose whether should I use python or ruby for my next API this is a very similar question right because it depends on a lot of factors one, familiarity what is my existing team made of that makes a big difference adoption I understand the like of the benefits that I mean I'll cite you an example this was 2012 I think I was running there was a startup that we were doing and we decided to actually move we had a started hitting certain amount of load and the root cause analysis that we did was the reason was that a lot of people were using a dynamically type lagmence python and a lot of times majority of our bugs were int because of spring swing in case of int because not valid schemas to be and 2012 we didn't have these cultures of schemas and etc. a long time back almost a decade from now back from today so a lot of these RPCs exist you know examine it but examine was too clunky so lack of these things would actually ensure that our customers were catching the bugs which we didn't like at all so the decision was what language do we choose and we zeroed down on a language go land we had the choice of picking up closure as well back then the reason why we picked go land was because the team that we were we were all C++ and C developers we were all familiar with python and the choice of introducing a immutable functional paradigm seemed too expensive no matter how lucrative it was drawing a corollary and a parallel here the same tools go through the same evaluation criteria what is the the team made of what problem are you solving today and what can be adopted tomorrow example if you're immediate like let me make it more tangible a while I'm locating a tool over the other if you're immediate need is you want to do certain rate analysis you would I would say just install a Prometheus don't even bother about installing nflux dv but at the same time this is as I said every decision is a tradeoff if you're doing some batch analysis of existing data Prometheus won't help because now in Prometheus you cannot add records with older time stats nflux dv can do that so these are the the requirement that you have the usage that you have will automatically start driving your tools what we do is basically we actually create a fishbone kind of thing or an array matrix where we actually plot it down very like often we do that I think we did this yesterday it said we were supposed to pick up a graph database and we actually ran into this choice of which one should we pick so we list all the ones down we draw a dimension of a selection criteria query language ingestion language durability how the customer interact with it etc you know like you lay these down which are your non negotiables as I said non negotiables put them all down and then you see which one fares well over the other now in case of a tie up in case of a tie where two tools tie with each other we can decide still and we start looking at maintenance and infrastructure that what is a footprint that is going to engage the footprint is also a tie then we say that look is there a cloud provider available for it which can actually do this which suffices all my use cases then I can just hold my problem I'm just saying you know like multiple segments that decision fragments that you go to so something similar I would suggest that you would go through as well probably I took longer to answer the question no I think the you know doing the matrix is more useful in scenarios wherein there are opposing views for products and when you put them on paper you really substantiate them give numbers to them that really helps but before that the bigger thing is what are your non negotiations know those right okay one more question how do you incorporate SRE to data research ML teams where they're working of ways more experimental pre tools and because I would say I've been fortunate enough to have worked with on a reliability cause with all three of them in one of my previous jobs data teams actually reliability in data teams is massively important we just undervalued the use case of it I'll tell you why data pipeline jobs are mostly asynchronous in which the problem with asynchronous is it's like an arrow that has been spent you cannot reclaim it it's not like a boomerang it's going to come back if you go through a bad ingestion pipeline which is 100 of gigabytes while you're not watching the data is gone it has been ingested already into the system it's very important that you actually build a decent amount of guardrails around quality of data that comes quality of data in terms of so like we have SL SLIs and SLOs on request infrastructure we also have it on data infrastructure which is how many records came in after a job was run how many records were produced how many records were put in how long did it take what was the cardinality of the record set that came out all these are very important questions and they need to run periodically how many cores did it consume because each one of this is going to give us an answer did the data arrive at the same time or not did the data exit at the same time or not because while these systems are asynchronous in nature any delay in them is actually going to massively degrade something else which is waiting on it so that's the data part research and ML same thing reliability is important because if somebody is doing experimentation with the data the quality of the data becomes equally important what they are changing in an algorithm also becomes important what that means is if there was certain features that were being extracted in a certain amount of job and a number of features have altered you need a way to measure it because that instantly tells me that look one of my recent pushes has actually degraded the amount of features that were actually being captured with a certain data set and these are equally going to apply I mean I am happy to talk more in this but I think this is a broader topic of segment I have given you a very very high level of touch points of what are the things that you do on SRE book I think on Google's SRE workbook the chapter 4 is around something very simple which is around handling data pipeline so I would recommend that you actually read that it's available online you don't need to buy the work cool alright one more question how can we ensure to have SRE in place when we have multiple projects and many of them with different technologies like AWS, GCP or using Ansible Chef should we focus on start making small tools around that or can we go one by one with projects I can speak a bit to that because I have been in that situation so I worked in a team wherein we used Ansible and Chef both and not because they were the right tools for the right situation it was because one guy hated Python the other guy hated Ruby so if that is the reason why you are in a situation then it's mostly a culture thing however if you have if the SPU should mention earlier standardization on tools is important rather than getting that 1% of feature if it is if you actually have different technology stacks with their valid reasons like AWS, GCP using different Ansible Chef then in my opinion you could only build expertise on one thing at a time and deliver it and then move on to the next thing however if you could speak a little bit to that that would help I don't think I have anything to add to that you just summarized it really well all I would say is that in such cases the need of reliability becomes far more important actually in heterogeneous environment reliability has a far bigger role to play than a homogenous environment homogeneity is the result of reliability being applied to heterospecies so I would say start yesterday great we have a few questions on YouTube also coming in one of them is what should be the parameters for an MVP for site reliability so mostly I think what are the minimum things that you would need to have if you put in an SRE team in place so I mean usually I go about first as I said the first technique I put down on paper what are my non-negotiations non-negotiations based on what has been failing to make it more concrete there are usual dimensions that you evaluate a system on what can fail a durability of a system can fail correctness of a system can fail uptime of a system can fail correctness did I say that already correctness, durability freshness of a system can fail now these are few parameters that you actually classify failures on and which ones getting affected the most which one do you want to fix first which is also very data-driven decision which is so I would say to answer your question step one measure measure your failures what is the impact of each failure that is happening classify your failures to find the broadest category of A these 80% of my failures actually belong to the same cause at only the same category example if most of my failures are about correctness of a data what that means is every time I make a request a like simple application let's say we're building Twitter I make a like and I am charging people based on number of likes but when people refresh the page they don't get the likes this is freshness of data if this is something which is dear to your business very first thing that you have to invest is ensure that freshness is always measured there are the first tool that will build is how do we ensure that the freshness is caught as period one second if you are a payment system you cannot have incorrectness in the data or you need uptime in that these are the first thing that you will face and see if the errors are happening around that measure that fix that so I would say that this is how you actually start going about building an MVP measureability first so either thing reliability is a three exercise first is measureability control observability you first measure something then you control that to actually reduce the errors to bring it on a bounded state but to do that you need a massive amount of observability tools that you build actually fall in one of these measures measurable and observable are actually the same part so it's a circle okay of that answers the questions question beneath one more question from hankur on youtube which is how do you use synthetic monitoring of apps this is mostly around the the new relic and all those tooling yeah I mean for synthetic monitoring that we do in our case we mostly try to use a new relic and whatever it provides by default however to speak to a broader domain of synthetic monitoring I think you might have a better perspective on that so I mean all of these are I would say that there is one single choice that I pick over the other the broader set of classification that goes here is that again what level of monitoring and what nature of monitoring do I really require is dependent on what failures I've had so I mean I actually don't know how to answer these questions what my view is yeah it's good that you need it and where you need it so maybe a more concrete question I need you to answer that yeah I think it's mostly a question that says how would you go about it so I can speak a little more in detail about that mostly because synthetic monitoring involves trying out customers paths on a website or an app and seeing how they work mostly using tool sets for that however the good part is that the closer you are to customers the more behavior you can see the better your SLOs and all those things can be mapped that really helps us however the sheer number of customers and the spread of them is so much that trying to get a number out of those observations is hard so yeah we use synthetic monitoring it's good the fact that there is that there's a limitation and again this is I think where my answer may look like an endorsement of last night but I think that does go back into the reason of the reason of being doing this is a lot of see why are we doing all of this to ensure that our customer actually gets the best experience and if we don't observe a system from outside in on how a customer is going to look at the system or what does it look like and which aspect of it is failing at a very detailed analysis level like I should be able to tell that like I'll give you an example I was speaking with one of the biggest television houses in one of the guys in Netherlands and a very interesting thought that they had they said we deliver videos but the level of accuracy we need to test is that when a video was delivered because we have a lot of foreign media coming in whether the subtitles are actually playing in sync or not as well because subtitles are usually overlaid now how do they do this they actually first run simulations on different kind of ISP backgrounds they run screen readers where they play a video and then they actually measure periodically each time when a subtitle was laid what time frame the queue of the video was across because they are so particular about it so because it was dear to them because foreign media content it really spoils the experience if it is going out of order point I'm coming back to is this observability of these systems is massively important and it requires a great depth of coordination as well some of these existing tools that are there out there in the market actually don't help us at that at all there is nothing out there which can actually connect me from my job of saying that starting right from my browser hitting the CDN until my request hitting my server where did I actually lose my quality or what exactly is it failing and that is something that I don't know if any synthetic monitoring tools out there are actually capable of even doing it at that level and that's where site and availability step in and have to build a lot of this in-house over and above what they have but then they run into the problems of vendors don't give out data you can only do it on their platform and so you need to build in those data by yourself I probably took a very long stab at your question I think that I did answer the question as well as added new perspectives and especially the subtitle thing was very interesting we don't have a question but I think someone has put without the earlier point of the bad apple theory I think there is some gap there maybe you can try to fill that says that the Google SRE theory says that bad apple theory is demonstrably false directly in core relations to RCS I think there is a misrepresentation here misinterpretation here I think what we are trying to say is that bad apple here what we are essentially trying to say is trust people and then will people be more inclined to fix problems or attack the problem rather than he did it versus she did it so I think what we are trying to say is bad apple theory doesn't work what you need to do is so I am happy to have a chat detailed on this if you still think there is a misinterpretation here or if any other point okay alright I think we are on time plus the questions are questions are done to sum up you I think there is one take away that I would like to share with the audience which I felt was mostly an eye opener for me which was when you said that a systems observability is important and the tool that you use for observability is not so important as a systems ability to be observable so that actually reminds me of the part wherein whenever we used to write tools even for any tool that used to go to production we used to put in unit tests and someone had told us that unit tests are overrated you don't need to use unit test because integration test could fail which is true however making writing code so that it is testable is more important than any discussion around what is the right unit testing framework for it so the biggest take away for me from this talk was Ansible versus Chef Terraform versus Pulumi is always going to be there but rather than looking at the next shiny new tool if I could take whatever I have and work with it and see what limitations it has versus the tradeoffs that I get out of it that I think is the right approach so that whenever I move to the next tool also I will at least have extracted everything I need from the current ones. Okay alright thank you everyone for joining in we will make the slides available post this session and if you have any other queries you know Piyush's twitter handle that's the Hasgig site and you can also call out at the root con for hashtag root con that's about it thank you very much for your time thank you very much