 Good afternoon everyone. Hello. You guys are able to hear me in the back. Awesome. So I know I'm the only person standing between you and lunch and this is also Chris's talk. I only have 15 minutes so I'll and I have quite a bit of content to present. I'll rush through. I'll probably take questions at the end after the talk during the lunch. Okay. So the topic is about how we made everybody at index developers also take care of the infrastructure. Okay. The lessons we learned what was what pushed us into that sort of zone and how we ended up being where we are today and what are some of the key takeaways that you can have if you want to follow a similar kind of a practice in your organization. So my name is Ashwant Kumar. I'm a principal engineer at index. It is the fancy word for saying I also write code at index. So that's my Twitter handle underscore Ashwant Kumar. Feel free to ping me. I'm a little late. I still use a Windows phone so I don't get rid of them updates on Twitter handles but I'll keep checking them often so respond and yes. So the just getting into the dark so there are three stages I would like to classify the index has grown into the way we do operations today in three parts the early stage the growth stage and the later stage which we are in right now. I'll walk you through each of them. In the early stage the overall goal for us when we started the company was just have a working infrastructure like how can I talk about they just need to quickly get up and running and like any any other startups would just want how do we get up and started and start delivering value to our customers right and we were around five to 15 developers and we used to have one to two member operation teams I mean ops team members. There were one person was the developer who had to become an operational person because somebody has to take care of the infrastructure and the other one was on and off we had some contractors we tried with remote people and then it was still fluctuating and the responsibilities included they had to write deployment scripts for their whole setup both internal and the open source tools here by open source they mean like we started off using Hadoop Hedge Pace the whole Pigata stack we not the whole but did the starting off with them so they had to start writing the code for managing and applying them and they had a centralized control on AWS because I still remember the days when we had cloud watch alarms when our AWS will use to touch thousand dollars in the early days now we have like 10 20 times over that but at that time we wanted to have a control who gets to access those resources and how we manage them and that sort of also influence our choice of tools if you can see okay this is on the top of course we were using AWS but predominantly in the AWS one tool that we used the most was the management console that launch instance button was the life saver on those at like two o'clock in the night when you had to bring up an instance and deploy a course so that you can get show that they are I mean if what when I see you as it is when I'm trying to show the app to the potential investor you want to have something working all screenshots doesn't always sell you know for most long and that was a very important tool in our stack and because the whole culture of having a centralized responsibility and control was there we were using our chef where the upstream used to manage the provisioning so it was more of a full-based system they would update the cookbooks or recipes and would get pulled on the individual servers eventually and then get updated and the third one was the Capestrino which is what we used our deployment scripts it was also written in Ruby because chef was also in Ruby and the last one is the go city we are a huge fan of products products because a lot of our core team where from thought work so we use a lot of their open source tools so I'll go cities from thought work that we use for our CI and CD process so this is how it started and the lessons we learned in that was operations team couldn't really scare and they cannot also mean okay those are first ones they couldn't really contribute to our system design or architecture because they were always overloaded with Arab requests we look into what of them eventually and the kind of mentality that we had initially was they would be the first on call support if something used to go down for the minimum level of alerting that we had they would get calls and they would have to go debug a system at 3 a.m. in the morning without having much context like that really wasn't useful because they would be trying to fiddle out the logs try to make sense of what this lock meant why is there a I mean if you guys have worked with Java you know the size of the air stack that it generates and it's not much useful if you are if you don't have a developer background on where this is coming from or what this is being used for and they was I mean we were setting them up for failure so that really wasn't scaling well for us and developers also wanted to write out a lot of new things they wanted to experiment right I think Kiran was the one was talking about operations don't want to change their stuff because if it works you wanted to continue working but developers constantly want to change try out the new so I mean new tools that are being released a new stuff that are being out in the market and they weren't able to keep up so this was the ad hoc request part every time somebody wanted to try out some tool that's out there and we did a platform in those times it was like every day some tool was coming out every big company was releasing an open source tool and we weren't able to manage I mean the operations team wasn't able to catch up or manage those things so what we started doing was the ops team started working with the devs so that they can take care of some of the chef cookbooks or recipes to manage their own infrastructure and that's when we had this moment aha moment where devs didn't really like the Ruby code even for somebody who didn't come from a Ruby background just by looking at the way the operations team were writing code was no this is not how you write code your abstractions are fundamentally wrong why do you have a global variable written that you could have just put that inside your function and a lot of such things right devs had a lot of things to say about the whole operational setup the scripts that we used how we used to run them and all of that and that sort of led us to the next stage the growth stage this time the overall goal based on the previous learnings was to have a decentralized access to our infrastructure we were around by this time we were growing to 15 to 30 member engineers we are no longer developers I'll talk about that why and then we had two or three ops engineers also on the team and their primary responsibility was to continue where they left off educate developers on their own infrastructure now instead of earlier than we started off we had only one or two devs taking ownership of trying to experiment with the infrastructure but then we realized that it made a lot of sense if individual teams start managing their own infrastructure so we wanted to sort of educate developers how does this thing work how do you manage the resources on AWS and so on and then the operational team worked on the overall process or a sort of a framework for how do we do operations and that also sort of drove a choice of tools we moved from a chef based a centralized chef server based model to Ansible which is a pull which is a push based mechanism where individual teams had their own set of playbooks roles that they had to manage and they would have their own cd pipelines which will set up bring up the instances whenever they do deploys provision the instance deploy their code and so on and the key important lesson is I would like to quote from a paper on which came out in 2007 it's from a guy called James Hamilton he's part of the windows services team so the code is like if the development team is frequently called in the middle of the night automation is most likely the outcome devs don't like to make repeat don't like to you know do the repeated task over and over again and based on my personal experience that's so true but if the operations teams are woken up because they don't have much context on some of these things they'll say I just need you need more need more people on the team so that I can learn get some kt and start working on them now that doesn't really scale well when you have more number of people and your obvious choice is okay I need a big sre team or I need a big operations team I mean this is from a paper called on designing and deploying internet skills services you should check out this paper if you are a series of world devops it has a lot of really cool insights I will also share a link at the end of the slides so one of the key lessons was some developers when they started doing operations they loved to contribute to them some of the initial things that Ayush talked about the Matsya Vamana and a bunch of other stuff on OSSRindix.com was from developers who were asked to manage this set of infrastructure and then they had to like it started off writing some bunch of cron jobs or bash curves java code scala code and then it evolved over time but it it solved the problem we had automation as a likely outcome and individual teams took care of their own infrastructure and on call so that they if you are a developer who pushes a code to production at 12 in the night you will get probably a call at 12 if something is broken so you have much better context of fixing an issue rather than an ops person who's probably um sleeping with his wife and waking up and like oh something is down and then it doesn't really help solving the problem at hand and the downside to that is now everybody has access to infrastructure everybody can spin up resources everybody can do whatever they want your cost becomes a problem and from the previous talk from Ayush's talk you guys would have realized how we had a sudden spike from more than 150,000 dollars a month to suddenly we spiked 200 to 220,000 dollars and then we figured out a way we had like a mission cost reduce was one that we had for almost a month and then we worked like all hands on deck to control the cost so that we don't want to spend all the investments that our investors give to AWS it doesn't really make sense so the cost control was extremely hard and also super important and you if you're giving decentralized taxes one thing you should really take into account is a backup because I mean one case is when I was just getting into this whole upspace I was trying out a new tool to manage a route 53 zones and I wanted to import a set of my names into my zone file but I ended up replacing my three entries with the whole sort of 56 entries I had so I had literally took my laptop ran to every single member in my office asked them what are the set of DNS names you have in your bookmarks just give that to me I'll quickly add it to my route 53 and then move on it was trust me it was a very painful experience and I was thankful I didn't get fired for that and the first thing I did after getting the infra backup was to have a backup in place so we have route of three backups happening every 30 minutes now so worst case you just use the latest 13 minutes worth of DNS records and we just have to find which pipelines that ran and then rerun them again and the last the later stages which we are right now the goal is continue from where we left off have a self-serve infrastructure so that it's the goal is to be like AWS but internally for the organization so the users get to manage their own stuff I mean the developers or the engineers get to manage their own stuff and we now are 30 to 15 member engineering team but the operational team is still the same size 2 to 3 varying up and down and we have a lot of rotation from developers goes into ops and ops comes into development team and so on and the responsibility right now is to become enablers either via a process or some kind of automation tools so that the engineers can deliver end to end we also and the ops team also wanted to manage I mean contributes to the design and architecture of the systems so that they don't feel left off I mean you if you're in the ops space you can only do so much you can start at metrics for only so long you can think about metric collection log management instance costs only for so much right at some the domain doesn't change you mean you can be in an e-commerce domain you can be in a banking sector you can be in in any other different domain but in the ops space you're still working with the same instances same server is the same Linux is the same container it's it's the same thing in respect of whichever organization you work for now how do you get to explore your catalog of stuff so we work we take the ops team into our architectural designs discussions so that they come in with the focus on cost security and you know high availability part so this also sort of led us to our design decisions apart from Ansible we move to a lot of miso's marathon stack which are dark more about it so these are like resource schedulers Imran touched a bit about it I think everybody in the morning who talked about it talked about resource schedulers so it really helped us because it gave us a unified view of all the underlying resources as an ops member I can scale my backend system to any number of instances I've written it down for developers it's all about I need five gb of ram or I need one gb of ram and then one cp that's how it does but it's very difficult to find an instance on the cloud that manages that instance profile so how do you keep track of them it really helps and operations became a first-class skill for our developers and development became a first-class skill for our operations and the operability review which I talked in the next consecutive slides is just how one minute I just want to rush through of it sort of helped us reduce a number of bugs we had before we made the first broad push for any new system and the downside to all this approach so far it's been good but the downside is when you have decentralized taxes you give developers to try out new stuff it gave I mean it led to a lot of fragmentation in the whole deployment stack because different people wanted to try out different stuff and there was not much uniformity on how systems actually work so to handle last problem they adopted about fragmentation we used this notion called as a tech radar so tech radar is initially came I mean at least for me I heard it from a ThoughtWorks company company called ThoughtWorks they sort of publish this I think every quarter or two quarters once on what sort of different techniques tools or platforms that they use what has been successful what hasn't been what are their war stories and all of that so you have yeah this is our tech radar from index that sort of opens those that's the link so you have different categories like tools languages techniques and platforms and then we have different categories like adopt trial assess and hold adopt is like we have run it in production we know it works and we just want to continue working with them trial is something that we have found it to be really cool and there is one team trying it on a very small subset of systems and likely to see if it works good or not assess is something that we have heard really good things about it but haven't really got time to try it out hold is something that we have tried and don't want to use it now because of a set of issues that are associated with them when you click on each of those items you will it'll expand a description of how it actually works i just need two more minutes okay and the last thing about operability review so i mean i guess a lot of organizations actually have this notion of a checklist that they maintain before they make their push to production so this is i mean from the conference document of the operability review the two key sections are it helps you identify issues before you make the production push for your system and the other key takeaway is every first production push for any new system that's being built has to be i mean has to get a seal of certification from the operation team that it has passed or it has been gone through the checklist of operability review if it is not such a deployment is not considered production even though you call it as proud environment or something such deployment is not considered production so and this also sort of gives chance to the operations team to influence some of the design decisions or the architectural decisions we make in our applications on the aspects of say cost security the availability and bunch of other stuff and some of the points that we cover is the benchmarking or load testing results what are the SLAs do you have does your system meet sorry does your system meet them do you have a data store like a database is it like a separate mySQL or a rdbms database do you have a separate instance managing it or do you use a service like rds or is it an embedded database like rock cb how do you manage the backup entry store what are the sort of security policies you have i mean we don't deal with user data so far our security is more around exposing the system out to the public or not so it's supposed to deal with ports open ports on the security groups and what sort of scaling policy do you have how do you do deployments is it automated is it on marathon stack or is it something that you do manually and what sort of backup recovery tools do you have for your data store if you have any and then what is your monitoring or alerting policies what sort of metrics do you want to monitor and get alerts on and then finally the cost what are your instance type how much resources do you need and then do you have an entry and plotters so that we can start tracking your costs so these are some of the and then a bunch of other items that we track as part of the operability review policy so yeah that's it thank you