 Awesome. All right. Thank you for having me here. Great to be back. I was kind of hoping that the Turing star style where I come in and everyone was like, it's good to be back, but I couldn't do that. So I'm going to skip on that. For those who don't know me, I'm Yagnik. I've recently come back to India and working at Snapdeal in cloud and infrastructure. My job is essentially to scale their operations. Also, every time I go home after my talk, both my sisters are on my case saying, you never mentioned us. So I had huge debates about how I give a technical talk. My siblings have no place in my talk. This time, they made me promise that when I get on stage, I have to mention them. And I'm supposed to say this quote unquote, don't judge me. I'm very thankful to my sisters. And I exist because of them. Clearly, they do not understand human reproduction. And you're the best. So now that we're beyond that part, there are a bunch of topics I wanted to discuss. So it's discovery, failures, and a bunch of issues regarding how you scale infrastructure, how you care about operations. And there are quite a lot of cool talks that happened. I've seen a few talks in the morning, which were amazing. See Antoine sitting here talking about crash software, which was also awesome. So I thought I don't really have to talk about those because people are much smarter than me are covering that up. What I did want to cover up was one aspect that no one else was talking about, which is the cause of 1 third of failures in most operations across pretty much most organizations. And I'm basically pointing at all of you, including myself, talking about humans that caused the issue of failures in most systems. And how do you scale humans? And not talking about reproduction, talking about operations and humans when they work together. So going in, as always, I always tend to have a story where I would tell you, make you feel my pain. So you realize what I want to say, why I'm saying it, and I'm not just some crazy guy just shouting at things and like, this is something you do. So I'll tell you about story of how SNAPBIL came about and how it's scaled, why it's scaled this way. And then what all pain points we started feeling when we were scaling humans and operations. At that point of time, once I'm there, then we'll walk into what are possible solutions, how we solved it, how we're thinking of solving it rather, and how far we are, how others have done it, yada, yada, yada. So that's the plan. At this point of time, the other thing I really wanted was like a cool wing chair in the middle where I could sit and just go like, come on around, kids, let's talk, let me tell you a story, but that didn't work out either. So going ahead, SNAPBIL was started by these two guys called Kunal and Rohit way back in time, saying way back because I want to make it, you know, give the essence of time in there so that there's a lot more weight to the story. Really it's not that old, it's fairly new. So these two guys came up and they were like, we need to build something in e-commerce. And I was like, great, go for it. Except e-commerce wasn't a thing in India at that point of time. So they really need to build a product. And to build something in e-commerce, you need to fulfill the e-part, which is the electronics. And lo and behold, you need developers and people to make that thing. And that's what these guys did. They went and wanted to build something along the lines of a deal-based website for which they needed people. Not that hard, all right. You need some developers to start building product and they bring in a few developers who come and build code for you, write out features, write down all kind of tools that you need and put code into production. What it initially looked at was, looked like something along the lines of three to four people punching out to get a code, putting it into production and running it. They didn't really care about reliability or how fast they're moving things, stuff like that. But as Snapple started growing, it started getting money from VCs and the whole money thing came into the startup business and they wanted to scale even further. They added more people, and that means more commits to servers where production code was going in. It wasn't just one or two people now, there were around 20 or 30 people constantly pushing code to production server and more often than not, they were actually stepping on each other's toes. So for example, I would go in, I wasn't there at that time, but let's for the sake of this story, assume that I was back in time and I was there and we'll go forward with that. So I was there and I'm doing my production build and I put code onto production servers and some guy who must not be named, I really wanna cast right now, but let's skip on that part. Some guy goes in at the same time and tries to deploy code exactly at the same time, screws over my work and basically brings the site down. This started happening quite often. This part is the real part, minus to me, everything else is real. It started happening more often than not. Deploy started getting shittier, overall production quality got worse and features started kind of conflicting with each other. Don't ask me how that part happened because I expected everyone to merge into master. That wasn't true at that point of time. So the number of people who started committing to these servers became higher and higher and at that point of time, one of the leads go like enough. This is not how we're gonna do these things. We need someone who's the gatekeeper and welcome your first operations guy. I had to put in the beard because every single cool operations person I know has a beard and that's the thing. So if I don't have a beard and I'm trying really hard to grow one, so I could be cool like them. But until that point, operations guy with a beard. Now it became this guy's responsibility to essentially get the code together and put it into my production. He's the only one who's responsible for it. More so he's also the one who's taking care of incident management, reliability of the code in production systems. Your quote-unquote cis-admin guy. That is when Snabville started looking something along the lines of a regular e-commerce and Amazon are similar looking e-commerce site. Pretty much all e-commerce site looks same to you. It looks same to me. Yeah, so that's what happened. But obviously that would have problems because now that we have more money and people want more money and people are building out more features, a lot faster doing all these things. This is kind of what happens wherein people bring in code to one person more and more and he's trying to deploy on more and more servers. The ops guy, let's say this guy named Bob is awesome, but he also has a limit. He has two hands, 24 hour clock period and that's in that point of time he has to do a billion mergers, a million deploys and most likely 100,000 incidents. No, that doesn't happen in a single day but point being every single day he would go home and fix on incidents because things would go down at night and during the day he's deploying code. And started looking something along the lines of this guy right here. I kind of feel their pain. I haven't done this ever where I'm the only guy who's in operations but that's what I feel like it would feel where you have so much stress of I'm constantly gonna go put this in production and go do this and then, oh my God, site is down and then co-founder is calling it like, dude, what's happening? Clearly this wasn't scaling as well. Same problem as before, you're to one person at it but that person has his own limits too. We have money and we have lots of people in India. I love India but population is a problem here. So lots of people, so that's what we solve problem with. Throw more people with beards at the problem. Why not? Because we can, yeah? That's what they did and this worked out pretty well. We started scaling from 20 to 50 developers. We had more and more people coming in. More features being built out. Suddenly we had the cart feature. We had things like search and all that fancy, shamancy stuff that you see in all e-commerce site. And it's great, great. Five people handling load of 50 developers. Why not? And this is how Snapdeal started looking at that point of time. To be honest, I still feel it's relatively the same but apparently it has way more features. Yeah, I do. It does look like all e-commerce site, doesn't it? No, no reaction, okay? That's all right. So as we scale, the more we scale over time, the graph of sys admins versus dev started looking at that. That's the regular hockey stick that everyone wishes they get but most people don't really end up getting that hockey stick. Snapdeal was one of the lucky ones which actually got the hockey stick and growth and they hired developers left, right, and center. We went from 20 to 50 to 500 to 1500 devs and compared to that we now have 50 dev ops people actively working to support those guys but the job hasn't changed. These 50 guys are deploying code, building code, making sure they respond to incident, monitoring production systems. From my perspective, they're basically God and that's what I like to call them but as you can see, the more servers you add in, the more people you add in, the more features you add in, you can't keep scaling people the same way and we started noticing this problem a lot. At this point of time, the whole cycle started repeating again. Features were getting jumped on, production issues were happening again quite often, operations wasn't stable at all and devs weren't happy either because they weren't stable, they weren't allowed to put code into production and new feature releases will be stored along with it. What operations started looking like with devs was something like this where operations was completely siloed, one single block which handles only build and putting code into production and devs who only care about the features but they had no idea about what the other person is doing. To give some examples of stories that happened in the past, I'm pretty sure some of you have already experienced this wherein operations guy would put some code into production for you but it's actually not working and they're like, not my fault, talk to the dev and then the dev guy like, dude, this is working on my machine, this guy has totally screwed up. That has happened to me at least a number of times but the reverse also tends to happen. I don't know if you have ever experienced this place where your code changes based on the environment it's running or somehow it knows that I'm on a production machine and this is the right time to crash whereas every other time it will never crash. As a developer, that has happened to be way too many times so I've stopped counting and like, maybe it's me that's doing this but somehow my code knows that it has to crash in production. I'm not very proud of that fact but that's what happens. So what does the developer view look like when this kind of painful things happen? Obviously I can't put in more features cause the things are down, you know? Something is better than nothing. I don't have the something right now, everything's down so I can't really put new features out. The other thing that happens is every single thing that you need to do in production needs an ops ticket. Some of you who use JIRA might be familiar with it or whatever tracking management system until recently, SnapDeals we are doing this thing was go over to the DevOps guy and point on his shoulder and go like, do this, I'm not even joking. That happened. So every single thing has to go through these set of 50 people who would always do things right from building code to deploying code to making a small change in your configuration to you name it. Essentially, you wanna touch production, go through these 50 people and that cycle became vicious. Again and again as a developer, I was like, great, I have to go to operations again. Let me take some bribe or something to make them do my stuff earlier. I didn't work by the way, no bribe policy. Just FYI, that was supposed to be funny. Anyway, so that's developers and the operations was the other side of the story. Obviously developers were building more features but the more they built and the more they wanted to deploy, the less stable they made the system. As an operations person, once I put something in production, I don't wanna touch it ever if it's working good. That's a general sentiment of most operations people I have met. If you have met a better operations person, please make me meet them so I learn a little more about what else they feel. I've come to learn that change is good. Everyone in my family says that but there have been after my life for different reasons but in general, operations prefer stable systems and I say, sorry. When an incident happens in production systems, the other thing that happens is operation people don't know a lot of context of the code. Another story that's pretty common that I've seen a lot is you have an issue that is specific to an application of how it runs. In certain scenarios, it tends to go to some external system and others it doesn't and operations has no idea that this happens. The only way they figure out is usually net stat or something but they have no idea what the application itself does. That critical point is very important when you're doing debugging of production systems. I wanna know what the application does to know what an impact it could have. For example, in my past company at Shopify, we had Kafka and when I was maintaining Kafka, my job was to make sure, well, when can I restart Kafka? Is it safe? That safety feature was only known to me because I know the internals of Kafka or this is the right time to restart or I can go ahead and do a rolling restart. Knowledge like that comes or code comes from context of the application itself and most cases operations don't know about it. So developers unhappy as hell cause no features going out. Why the hell am I coding anyway? Operations is also unhappy cause the system's not stable. Someone's gonna come shout at me that things aren't stable. And from an overall organization's point of view, overall things are bad. No code, no uptime, no new features. War between operations and developers was pretty common. I'm unsure if you guys have ever experienced this but this is also one of the things that I experienced where people were actually afraid of going to operations where they would not push my code into production. Or the other thing I would more commonly hear is, I've written my code in a day or tested it but it's gonna take two weeks to get it into production. That shouldn't be the case. If your code works it should get into production right away but at the same time if your code works it shouldn't take down the system. So organizations overall see these problems and then finally because we still have those 50 people working across the board doing incident management doing all kinds of things they were overworked and usually overworked and stressed out people are cranky and unhappy and I right now sit in a DevOps spot and they're really cranky a lot of times so trust me on this operations people can get very, very cranky. So that's organization's view. At its core keeping to today's theme we feel that pretty much everything that we went out to do. We feel that building features. We feel that being reliable. We feel that building a culture that would thrive and change. We failed pretty much in every aspect that you look at it if you see where we were and what we did. Good thing is we realized that we failed and obviously we wanted to make some changes but we weren't the only one who failed at doing these things and realizing them later. We weren't the only company or only organization that faced this problem. It just so happens that a lot of bigger companies before us already faced these problems and the bigger ones being Google, the biggest company. They're probably the biggest, close enough. I don't know what their market cap is as of today but they're pretty big. And they faced the exact same issues with operations. They realized that operation was siloed and everyone was following the exact same patterns that we followed except they went ahead and found another solution for us and what they term it as Cytre Liability Engineering or SRE. What's the goal of SRE or why was it built? The guy who created this term and built a team was called Ben. He works at Google now and I really like his definition so I'm not gonna muck around with it and make it into my words. Essentially what he said was tell a software engineer to take care of production. So he's not a sysadmin. He doesn't just know Linux commands. He builds codes and knows what test driven development is, why it's good coding practices, understands algorithms. He's a software engineer, not a sysadmin guy. And that's what SRE came about to be. There are a bunch of software engineers who are just trying to make the whole system reliable. And the goal of SRE in general tends to be first and foremost which is probably critical to every single SAS running out there or pretty much every operation that you run. You need up time. If you're not running pretty much all the time, you're gonna die soon then. I appreciate people take up names on like when sites go down. I don't wanna take names here but Snappiil was up throughout Diwali while its competitor was in. Things like this happen quite often where you pick names and things like these where you want up time. At the end of it, you want possibly five, nine up times and make sure that everything's always up. Second part is to solve the problem of reliability which was the original problem, the site going down every time you push a production code. You want systems to be reliable and you want to be able to push code whenever you want. Third is to make devs happy. Code should be able to go in again and again really fast without being on the system. Finally, this part I always have a little bit of a tough time explaining to a lot of people but if you notice my previous graph, as we scaled the number of people who got into the company and number of people, how the organization grew was scaling linearly and so was operations. That's not scalable in the wrong run. As long as you're constantly growing, it won't scale. You need to put a hard stop somewhere and where are you going to put the hard stop? If you do, does that also mean that you're going to take down growth? Which is why one of the prime goals of SRE is do not scale on things which scale linearly. So whether it be machines or humans, don't just keep throwing things. Make them more efficient. Make them more useful by doing things a little differently. I need to breathe down, sorry. So how does SRE work and what exactly do you do in it? As I already mentioned, they're just software engineers. Usually they're really good at UNIX or networking. That's what Google says. In my experience, that's what I noticed. My past experience, Shopify, we had amazing people doing the same thing in production engineering. At Snapdeal, we're doing the exact same thing. Finding people who are software engineers but are excellent at network and understand the whole stack inside out. Second is they focus on engineering. They don't focus on sysadmin work. There is no more go up to the seat and point to the person and go, can you do this for me, please? Pretty please? That doesn't happen anymore. It's engineering. You build a system. Once you see something that you have to do manually, you automate it the next time around. Obviously, build, I already mentioned automation. Anything that's left in operations is usually clicking a button and deploying or clicking a button and running the test. Or if possible, making that automated too with continuous deployment and continuous integration. But the idea of SREA is you offload that to the team. It's not operations job to build the things. It's not operations job to actually deploy the code. Operations job is to make sure everything's running and this is not a specifically an operations team. So any of these tasks which are specific to the service should go back to the team and they should handle these. How do they handle it is by training them. At the end, one of these things that happened at SnapDuelist, people started asking us, okay, if I'm not deploying in production, does that mean I'm also not owning this service? The usual answer is no, you both own it. You as a developer owns the code and SREA will help you make sure that it's reliable in the system. It's jointly our efforts that make us run on like five, nine uptime or something along those lines. Finally, Crosstrain. This is the hardest point that I've come across here so far. It's hard to explain to people who are developers how operations work and why it's important, why reliability matters, why security matters, why privacy matters. All these aspects don't come easy to most developers that I've come across so far here. But the other side is also true. Operations people don't ask like, why are you making it work like this? If you're scaling out, why are you doing this stupid thing in your code? You should understand that it's a distributed system. It will run on multiple machines, make it like. So a lot of work in making SREA happen in our companies to teach these people what to do and how to do it, to understand each other's pains and each other's benefits and make them appreciate what matters. Digging deeper into what all SREA cares about, talked about engineering and I'll skip on that. My second point is really my favorite point and I absolutely love it. Service level objectives. Have you ever heard the term SLA? Usually the idea is that a service has to be up for a certain period of time. That number or how long it has to be up or whether it has to be up or not should not be defined by anyone else except your business. Your business should be telling you whether it needs to be up. I, as a business leader, should be able to decide does the mobile app, can it go down at night because no, everyone is sleeping? It may or may not be sleeping, but the decision has to come from business side and then they decide a very high level objective of how long Snapple has to be up. Our goal is five, nine uptime, which is 99.99999. That is like you go down for a few minutes in the whole year. That's the objective. Are we there yet? No, but hopefully we'll get there. Second is monitoring. Oh, I skipped on error budget. Sorry. Error budget. So when you talk about service level objectives, obviously that's not gonna be always true because you will make mistakes, errors will happen, failures will happen. How do you handle that situation wherein those errors will happen and you still have to keep the uptime? You allow this little budget for which people are allowed to go down, provided they don't go down forever. So at Google, for example, they have a 0.01% limit of how much you could go down which really just means that if your service is up for 2.9, then you can go down for 0.98. 0.01% of difference of downtime is okay and we will still allow you to go ahead with deploys and all that stuff. But as soon as you go way below your error budgets, in that case, you're not allowed to deploy anymore. You're not allowed to essentially go forward with things like new deploys, new configuration changes. The system is stable and will stay like this until you make up to us. Monitoring, again, there are a lot of talks around monitoring so I'm not gonna dig into what monitoring is but SRE handles, cares about monitoring of all distributed systems and that's an important aspect because I can't react till the time I know what's happening. More so, I shouldn't be reacting. Monitoring should not just be logs, it should alert and do things on its own as much as possible. SRE is about automating all the tasks that you would do manually, including responding to an incident. If a machine can do that job, let it do that job. Emergency response is part of SRE. Usually, devs aren't too happy about being on call but SRE, what's the word, encourages. I think that's the right politically correct term. Encourages people to join in on the on-call duties. Devs and operations, both should join in hands on who will be on call and they get to split duties on who all gets to be woken up at night when something goes down. Change management is regular code deploys, configuration changes, so nothing important, just there for completeness sake. Capacity planning and efficiency. This is something that is super interesting to me with my experience in the past and now. In the past at Shopify, we did capacity planning by finding out numbers, figuring out how long it's gonna take, like how much throughput we need, how much load can our servers handle, and we would plan ahead for an extra amount of time. At Snapdeal, what we do a little differently is we do performance testing of every single system and see how much individually they can handle the load and then scale accordingly. We make some rough numbers of like, this is how much traffic we plan to see in the next year or are we gonna grow by 2X or 3X or whatever or not, and plan accordingly. SRE's job is to make that also automated. You should be able to tell in the next one year if I need to bring in new servers or if I need to go on demand into cloud, whatever the case is. So the most common question that I hear from people is, well, if you're automating everything, isn't that DevOps? And I heard a talk at Lisa, which essentially said DevOps is a way of doing things and SRE is a role and there's a significant difference things. A QA person writing scripts to automate their work is doing DevOps. An operations person writing things to deploy is also a DevOps person. People in our security team also write scripts to penetration tests that's also DevOps. Pretty much anyone can be doing DevOps to add more efficiency in their workflow, but it's not a role, it's not a culture thing, it's not an organization skillset. Anyone can do DevOps. Question then comes out to be, so operations, DevOps, SRE, too many terms thrown around, what am I doing, what should I be caring about? Should I be caring about it? If you don't care about it, you can really walk out, but I would want you to care about it. And I'll come to that why you would care about it right in the end, but as of now, I've already mentioned SRE is different from DevOps, SRE is a role, DevOps is a way of doing things, but they both rely heavily on automation. They both care about automation, they both care about no manual work being done and being very efficient in what they do and how they do it. But what DevOps doesn't do at the end of the day is solve our original problem of confusion when information is siloed into one group. It doesn't solve the issue of the original, when operations just know about operations, they don't know about the application and when application developers just develop the app or feature, they don't care about scalability, they don't care about security, they don't care about privacy. Both are important aspects, but DevOps doesn't solve that problem. SRE on the other hand is a role which allows you to cross build a cross functional team that care both about operations and about code development and how it impacts. So with that said, you shouldn't throw out the baby with the bathwater. What I really mean is, keep the things that are good of operations and sysadmin, get rid of the things that don't matter and keep the good parts of SRE and get rid of things that don't matter. You don't have to pick out everything, you care about some things and those are things that matter to you. Things that I care about from operations perspective, in operations, usually people care about how they respond, how fast they respond and how careful they are while responding. In the end, they're on production systems, I can't just do RM, RF, star and delete everything or I can't do stuff like, oh, let's shut down and restart, this will just work. So they really care about how they react in production systems. We should pick that and SRE does pick that. The other thing to care about is privacy and security. In general, operations like to make things secure so you don't get hacked. Take that point and pick it up and bring it into SRE. Pick the things that matter to us, pick the things that matter to the organization in general. Things that you should get rid of is manual work. I've talked way too much about automation, skip on that, skip on that. Okay, one off. So this has a funny story, last Christmas, last Christmas, that would be December. Last to last Christmas, I was at Shopify and this is right after the Christmas party, everyone was pretty drunk, I believe. And I decided, I go home and I decided to go. I think I can upgrade our Hadoop cluster, why not? I was a data engineer, let's upgrade it, because I can. Surely I did upgrade it because I could and I had the ability to do it. In hindsight, terrible idea. A, right after a party, please don't touch computers. Don't touch your cell phones, don't touch anything. Go sleep. Be things I forgot about, so many of them. I missed a bunch of configurations that I should have put in, things I should have done, which I didn't. Essentially, don't try to be a hero because that's not operations it's about. It's about teamwork, it's about getting things in production and keeping things stable. That's one of the things you should definitely get rid of. I've seen it more often, I believe this is last week's incident, we had a Kafka incident and someone at the time was like, oh sure, I could just restart everything and anyone who works with Kafka knows that usually it's a rolling restart procedure. No, why not restart? Don't try to be a hero, that's not something you should strive to be in a team, don't try to be that, I'm the cool guy. The other part is logging into machines. I feel leafy about it, I really like going into production machines and looking at the logs. So every time I have to give this point out you shouldn't allow people to log into production machines. You should have a mechanism for them to see everything without getting on the machine themselves. But I do it all the time, so it's so conflicting that I don't wanna say it, but it's wrong and I'm wrong there. I shouldn't be getting on the machines. If you build systems, if you're building your services, make it so that people don't have to get on the machines. Make it so that they can just use some kind of tool, whether it be logging, whether it be monitoring, to just get all the data they need without being on the machine. Move your own machine, the chances of you, cannot use any passwords again. Chances of you screwing up are very high. So that's the background, essentially, this is what happened at Snapdeal. Someone came and fixed our problem, told us how, showed us the light, told us how to run things. And what are we doing now? We had three teams in operations, DBA, release engineering, production engineering. We also have some minor teams which are not important. We're trying to bring them back together. We realize that that is not the right way to scale people. We wanna bring them back together as one single team, which will be the SRA team, which will handle loads the way exactly how it should be by embedding themselves with the teams and working with them. Other thing we did was we realized that when you're moving a small ship, it's really easy to course correct. If you see a glacier, but if you're Titanic, it's really hard to do that. With 8,000 people in total and 1,500 developers, it was really hard to actually course correct while we were running it. We wanted to still allow people to deploy, we still wanted people to go ahead with everything they were doing yesterday, but we also wanted to go to the new world where everything works better. In order for that to happen, we decided that some percentage of our people will work full-time on automation while we get other people to take down their load and operations, so essentially handing off whatever work they're doing to the developer team whether it comes to building or deploying or whatever not. So that's the second part. So that's what we have done so far, thinking we've missed something. It's mostly it. We got the teams back together. We're trying to make them think about how they're gonna work with the new services. We took down the load from operations people by offloading the work back to the development team, building tools to make it happen. But really, are we there yet? I would say no. We are far from there. Operations is a huge business in SNAPU and we have a lot of things to do. Things that are still missing, service level objectives. I mentioned about each service telling us how much it needs to be up, how long it has to be up. We haven't figured out how to get that data yet. We haven't figured out which services are critical and which aren't. For us, everything's critical. But in reality, that's not true. The other thing we haven't figured is how to do edit budgeting. What do we tell how much each service can handle? What's the overall impact of one service going down on the whole system? Again, still figuring it out. I don't have an answer to that. Distributed debugging. This is on the harder problems. People put in EFK stacks and all that, monitoring. But all the jazz doesn't actually help with solving problems when you're on the production system. Usually, our people don't even look at EFK. They'll just go debug it right on the system. So we need to rebuild all our debugging tools and figure out what's the right way to do it. And finally, capacity planning and forecasting. That's also missing. You already know what it is, so skipping on that. Yeah, I blabbered a lot. I've talked about operations. I've talked about scale. Obviously, SNAPU is bigger than most startups and most people I've met here. The same way we felt about Google, we are not the same scale as Google. So maybe it's not relevant to us. And I'm sure you have that same question. Is it really relevant to us? Do I care about SRE? Should I care about SRE? Does it matter to me? I don't have the same people as you. You've decided to scale operations this way. I would do scale operations in some other way. Or I could just ignore that you ever talked or existed, which is totally OK. A lot of people in my life do that. I ignore my existence. I hope you don't have to do that. Anyway, so what I would say is I do encourage you to actually care about this early on. Whether you are a 2% team, whether you are a 10% team, whether you are as big a SNAP deal or bigger, care about SRE, care about efficiency and operations. Think about how are you going to scale overall operations without actually bringing in more people. Think about how are you going to scale number of services you're going to have without bringing in more machines. Think about how the culture works when you have operations and developers fighting with each other. Think about all of those places where you could possibly fail and think about it very early on. I have to give you one last thing. Don't do machines work. Don't give it your blood. Let machines do their job. And don't write manual work or do manual work. Automate, automate, automate. I'm trying to do developers, developers, developers, but I didn't. OK, that's it. And that's basically it. There's any questions I'd love to answer. Yeah, thanks for the presentation. Yeah. I have six years of experience. I'm a cloud solution architect. I'm into Linux, Windows, release, build, everything. So how I can be fit into a dev of steam as you told that we should know each other's role. Like, I don't know Java or any developing course language. I know shell script, but how I can be fit into dev of steam? So I guess the very first answer from my perspective would be, you don't have to learn Java. Learn the things that you do manually right now. Identify them and figure out how you're going to automate it. OK, that's it. That's the first step. Second step would obviously be, once you understand what you're doing manually and you're going to automate, you figure out how your applications are getting deployed, what your automation has an impact on those applications. And once you're there at that point, then getting into the application, understanding your overall architecture will help you get into a state where you're like, I know how everything's automated, but if at any point of time I have to manually fix something, I know where to look exactly. You don't need another developer to tell you this is the code broken or you don't need someone else to tell you this is where you have to look. You know end to end where things are. So when an incident happens, you're completely on your own and you're good. Thank you. Hi. Here on your left. Hi, so looking at the whole SRE culture and looking at the software engineering culture and the point that you make, get good software engineers into operations to build systems, right? I think the major elephant in the room here is the recruiting process. Typically all these companies, most companies today, recruit a lot from colleges. And when you go to colleges and the way you hire, also, yes, money matters in most cases. And the reality of the fact of the matter out here is if you want a really good software engineer who actually loves these systems, and if your process and whatever you offer does not really match up to that, then I mean all this is good to have in theory, but how practical is it? I'll give you my own example. I just moved back from Shopify. Shopify is an amazing place. Their systems are solidly built, data production engineering, before I learned about SRE, and they handle things really well. And I came here, a lot of things were manually done. As an example, when we deploy code, we actually go into one server, pull it out of ELB, deploy on it, and then put it back manually. I would expect a script to do that even before thinking about it, but it wasn't there. When I got dropped into this role at operations, I also got handed like 10 new grads as part of my team. Sure, you have a couple senior devs, but these new grads don't know how operations work, and I'd be very honest with it. Initially, I was very, very frustrated. I was like, this is like no brainer. But then I realized that most developers are like them. They don't know how operations work, and the first part was to teaching them. It doesn't matter having, at the end of the day, you need people who are curious and who are willing to learn. Once you cross that part, making them great developers is not too hard as long as you give them exposure. And that's what I learned. Hiring is a really hard problem with SRE in general. Hiring people who are really good devs and want to do operations, even after the fact that they saw operations being really shit in most places, is very hard. But people are coming around to it like myself. I was a dev, but I love operations. Absolutely. Silos are pretty much always going to exist as you grow up. Till the time you have been doing it from the very beginning, that your culture is not hierarchical, which I noticed the first thing when I came to India is like everything is very hierarchy-based, and that hierarchy defines the silos that you work in, including operations. But working towards it, I can't break down the silos of a dev and operations, but I can penetrate right into the dev team, go like, you don't have to come to operations because I've built all the tools that you need to deploy it yourself, manage your code yourself, and I'm here to help you with your work. I make sure that things are up, but you also make sure things are up. We both are working on call as secondary, primary and secondary. So you bring those teams together, even though those silos exist, the teams are working very closely. Does that kind of answer your question? I had two questions. How did you align the goals of development teams and operations to be the same? What's the common ground that you tried to push for, and how really does your error budgeting work? Do you have common metrics? So I'll start with the common ground. At the end of the day, we had to get our CTO to accept the fact that the common ground is uptime and new feature rollout. We need both of them, and we need them all the time. If you build a new feature, it does not mean you take down the site, and if you need to be reliable all the time. So that's the common ground that dev and operations or SRE have to work together. As long as our goal is the same, and ever we get into a meeting room, that's the goal we're striving for. We're trying to make you work more efficiently without going down. So that was the common ground. We haven't really quite figured out our error budgeting yet. As I mentioned, that's not yet there. We've started some initial work on asking every service what you depend on and how much latency and how much downtime you can handle. Essentially, a first pass at resiliency of systems, but we haven't quite gotten to a point where we could say that things will just work or the error budgeting as a whole. Very early on that stage is we, yeah, essentially I would say it's not even well thought out yet. The only thing we have done so far towards error budgeting is getting the numbers, like, how much uptime do you expect to have for this? No monitoring on that yet to tell, hey, this is actually true or this is not true. How far are you from your assumptions? Thank you. Yeah, Nick. Anybody who has any other questions, please take it offline. No, I'm sorry, we're very late. Ash talk's starting now. Kamal.