 Hi, I am Mehul Ved. I work as a system admin for a company called NexSales and I'll be speaking about today, I'll be speaking about how we went from manually managing our systems to bringing in some kind of automation and scripting to be able to bring our systems up and running and working much better. So I'll be speaking about that while we set it up. How many people still do manual work for managing your systems? Anybody here? One, two, three, four. So yeah, my talk will kind of cover about why we were still at doing manual work and what do we learn to get into some automation and what are the tools though? I would not really want to focus too much on the tools really because there are a lot of awesome tools out there and we speak what was more relevant to us. Something more might be definitely relevant to you guys. So I won't be focusing on the tools for sure, but I'll focus on what were the scenarios and why we do arrive at that decision. Well, okay, while we are still getting ready, one of the things that we worked upon was Ansible and how many people here use Ansible in their production? Chef, puppet, okay, there's the sizeable number who's not raising up their hands. What else is anybody using SaltStack? No? No. CfEngine? Okay, so yeah, Ansible was one of the things that was the first tool we started using and it's one of the things that I really love and I'll be speaking quite a bit about it today. Oh, nice. The laptop is ready. I'll get going in a bit. It's not a confidence until you have one person whose slides don't work. We should be on shortly. As Mehul said, he's a sysadmin. He's very well known in the Bombay tech ecosystem and the running sort of world and the swimming world and the cycling world is basically an overachiever. He's going to talk about moving from a completely manual deployment to how his company does automated deployments. Can you guys hear me? I'll be speaking about the, I'll be speaking about how we went from doing all the things manually to banging in the automation. So some of the tools that we did pick were Packer, Terraform, Ansible, and Jenkins. These are not the tools that I'm recommending or they are awesome tools but not that I am recommending because not everything fits for every scenario but they are the tools we picked up and why is something I'll cover as a part of the talk but you are free to pick up whatever you really love or whatever you're used to or you're already using. So who are we? I work for a company called Next Sales. It's a B2B marketing and sales solution provider. We were into providing services since the last seven years and in 2013 we started working on our first product. That was, I was one of the first, I was actually the second employee to work on the product and that was the first time I started working on a product. I used to work as a sales admin before but it was not really on any kind of products or I was not exposed to things like the DevOps environment or the corporate environment where you have very complex requirements. I was used to places where there's a small setup and a lot of things were done manually. So after Voice Reach, we have just recently started work on a new product called Right Leads and we are also working on some API redesign for Voice Reach and it's called Voice Reach API platform. These are all B2B solutions that we designed for some of our partners and we, this is just an introduction of what we have and we actually get into the real part that I want to talk about today. So when we started off with the product, we were just working, we did have a cloud infrastructure and we did start on Rackspace Cloud but we had two machines and we never really did it. Like things were just manually on these machines and we used to just go around, access into the machine, edit whatever we want. That's how we used to manage them. But at that point, it didn't really matter. The only people who were using the system were all within our company. They were just testing out the systems and they were using it some kind of a production use but it was all within the company. So if something breaks, they would just come running to us. Hey, our system is broken. Please do something and well, we had to go in and fix things. The volume of usage was much lower. So we were still transitioning from the old systems and if something is critically broken, they went back to the old system. Thankfully, we had a good way to migrate data between the systems at that point. So we could manage but we realize very soon that this is not something that's going. We can stick with for too long. So we get moved into the next phase but while this team during this phase, we were just two people who are really into the development part of the application. So nobody was really focusing on the office part. It was just development development happening and we were just focusing on that. No one was bothering about the office part. We just like some of the things that you are doing is pretty laughable and pretty basic. We were manually hand editing the files and whatever edits have happened. Just used to go into a wiki and mention. Okay, this was the edit done and this this is when it was done. It was like but as you realize once you start doing things probably some of the times it happens that you are doing this under a very short time frame. You need management is pushing you to do something and saying, Okay, release this today and this doesn't you don't get the time to document things and you have a change which has been undocumented and that spells disaster going ahead once you start these things start piling up. It starts leading to disasters which we did end up on we used to actually deploy applications by running a get get pull on the server and again we like our deploys used to happen like a month or two months maybe sometimes even two months apart. So it was just piling up. We were in no way we could have done frequent enough deployments. So it used to be like you're putting all together. We put the code together and in a traditional way we used to just release them and fix everything that comes up at one time. Some of the problems document is not updated all the time. If we had to reach a scenario where say a server crashed or some it was compromised and we need to start over from scratch. We would have probably taken good 10 15 probably even 20 days to rebuild things and this was something that was a very very scary situation once we went live into production code deploys we probably at times we actually there was a release where we took three days to debug all the problems that came because of all the configuration errors while deploying the code tracking of all the changes for the config like we have the code changes being pushed to get but the configuration changes were not really tracked properly anywhere except for the wiki where it was not really reliable. Nobody was handling the ops task as a dedicated thing. We were doing the development work but and just pushing things through the server and let's forget it. Nobody else is looking after things at all whenever the problem arises somebody will go. Usually that's me because I was the one who's more associated with the ops task. I understood the servers better so the what happened was we had one of the places just before the production release we went to a place where we had a major major failure. We are configurations were not right our code. We were having get conflicts and we probably ended up having a like a week's worth of work where you're doing overtime and we were running into major issues and management was really pissed off at us. What is this? This should not be happening. What are you guys doing? So that point we really realized that we need something. We need to figure out what is there. So after that release we set up together and said let's analyze our failures and look into what is the next step we can do so along with that we started having the increase and so we realized that one of the things is that we have a single environment. Everything is going there. So we decided to have to two different environments. One was the development where everything would be pushed more regularly and yeah so then we made another environment where it would be much more stable and that is what users would use so there wouldn't be a problem with doing a deploy and if the deploy is broken then it wouldn't affect what is being used. We started going from monolithic application. There was a one PHP big page application at that point. We started breaking it up into a few microservices which were there and that gave us a lot more reliability because some of the thing had to be constantly processed in the background and we realize it can be handled much better with Node.js. We had a JavaScript developer with us by that then still they were we were doing a lot of configuration. What I tried is to have a homegrown script kind of a thing where it's more of a git repo and I'm trying to organize configuration files within that. But with a little bit work. I realized it's just still becoming pretty complex with various things trying to update together. So what would happen is at times they were I'm making changes at one place but the same chain needs to go at another place like there's a service which is a dependency of another service. So both needs to have a same value and this was not being reflected correctly when I was making this changes. So while doing that one of the first things that I came across was you are at root conf in three years ago. Somebody introduced me to this tool called Ansible. When I visited the website at first I was like I don't know I didn't really make too much sense to me and I was a bit skeptical about the tool but I still started using it and it's one of the things that I love when I still use it to date so with Ansible the biggest change happened is now we have a very good tracking of what are the configuration changes which are happening to our machines. They are much well reflected there. We can and it's in the code. It's there in the git repo and all the configuration changes are in one place and we have the dependency between them mapped out correctly. This has like been the biggest change that we could have and that made our system the reliability aspect that we are facing was way way more simplified just by introducing Ansible at that point. There were things like we also added some deployment code. So we were still doing get pull but the get pull would have and would happen via Ansible. So what that did was earlier I was doing things manually you're now anybody can just use the Ansible script and do it by themselves so it didn't really depend only on me. I can hand it out to the others and where people can work with the same code. That allowed us to do some delegation of the task and also there was a lot more consistency with doing these things but of course this was the first step and it wasn't solving our problems. We still had some really major problems as we were moving with things they were we are doing deployments by doing just doing a get pull get push and we couldn't really track what is going into dev and what is going into production machine and when was it being done? It was not to like a good push may be done today, but the deployment happened two days later when we see it more appropriate. So there was no real way to get the data. We didn't have anything back then. So again the deployments we use we are a we have complete J stack. We have no J's in the back end and we use angular in the front end. So one of the things was we used to do npm install while doing the deploy and at times we used to observe that maybe some network issue maybe some issue with the npm at times our builds were not reliable. All our dependencies were not there correctly. So the same build which was there in working completely fine in dev would not work in production just because the npm install field. So we realize one of the problem things that we need to do here is to have a usable build how I'm sure pretty much everybody here would be familiar with the CI systems and you once you build in your CI system every the whole okay artifacts are available to you and one if you can test it on one system at least you are sure that your build artifacts are working correctly if there's a failure then it's somewhere else which was not the case when we reached at this point. We managed to sort out the whole configuration issue that itself may allow us to build our systems back if you had to do that would be in a couple of days which was a big big difference and it would we would be relieved because the business pressure would be high if you were to take some 10 days and not have something up up and running that that would really be a huge huge business loss. But besides that since we went to microservices architecture we started initially with just a monolithic application. It was just one place where things used to happen now with now with the various microservices the code integration wouldn't really always happen correctly. So this was a sort of a bigger issue that we needed to sort out that took us to next step. One of the big changes that happened in this was my role shifted from a developer to a full-time ops guy. So now I could actually look at the system and manage the systems properly and we needed to do that because we started having a higher volume of usage in our systems. We had more users. We actually had more systems there. So we needed somebody who would do the ops work full-time and we did hire a couple of other developers to manage things. So this brought in sure besides that we started experimenting with things like AWS cloud because we realized that even though Rackspace is pretty solid the cost were already escalating with even with we had about 8 to 10 machines and with that the costs of machines were really really high as compared to something like AWS and we were looking at things like having infrastructure which can be built up whenever required and but you just have it running while you still want like a dev environment doesn't need to be always running. You only need it when you are doing certain parts during this phase we started looking at what are the things that we can actually start doing. It's moved around. So your is where we started being having a little more complex way we started getting a little more complexity in our environments. So and also the frequency of the release is now increased. Now since we had more developers and we did have things working and moving and also with new frequent requirements coming in you are actually doing a few releases every week and or maybe one to three days. Yes, we are still nowhere close to having something like a contiguous integration or being able to quickly move but it was much with much bigger difference from phase one where we were just pushing things together and having a big release at the end of one to two months. It was much better place where we were in maybe over two three days to a week. Then we started working on having project management tools to tell us what is happening. What is release? What is to be released the tracking of the features tracking of when the releases happened? We introduce Jira Confluence bamboo slack chat and slack which was for these things. We did try to introduce a CI CD pipeline at this point but measurably failed with bamboo. We didn't really have anybody with expertise in that and we didn't know really where to go with things. We had and we tried to just connect Ansible with bamboo to do all the stuff and it didn't work at that point. So yeah, one of the even though the builds were there, they were not getting consistent builds across the environment. With the multi like now that we had about 10 to 12 machines. There was something in production something in beta something in dev the consistency across tracking of the deployments across environment is what CI tool should have sorted out by doing the CD part correctly, which we didn't get right at this point. So that was the biggest thing that we had and we are still manually managing the infrastructure. There are just the configuration is happening but the base system was still manually if you had to install a new package or update the package probably updates would happen with Ansible but setting up the base system was still manual. So with the next release the focus was on improving the code quality and we hired a person to do the Q&A and the same with the Q&A we also the realize the person the person who is handling the Q&A is also doing good with Jenkins. So we moved to Jenkins during this release and that was one of the big things that came into a picture that with Jenkins being there and some kind of Q&A test being there. He wrote some integration tests at this point but they were still there was a big issue that now with frequent releases and a lot of work happening there and more developers. We were running majorly into a lot of Git conflicts. So I did organize they'd also have a complete workshop with my all my developers and we went through the whole understanding of the Git and sorting out the Git conflicts locally but we were still facing certain kind of issues while dealing with the deploys and on the server environment. So started working on like Ansible for doing like dynamic at building and destroying the infrastructure dynamically we were looking at AWS at that point and AWS has a very good support within Ansible where you can bring up your infrastructure configure it and have it up and running. So I started playing around with that. So since we replaced bamboo with Jenkins during that period that was one major change and we started focusing on getting that part of the setup right writing the integration test having all the builds being done in bamboo and building the deployment the build and the deployment pipeline and the same reuse those builds which have been done in each in the dev stage we we can do that and reuse it for our beta and if that works then push it to the production around the same time this was pretty something around 2000s mid 2016 where we started working on a new project of as I mentioned we had three products now and we started the work on the second product by then. So we started applying this from day one on the new project and that did bring us a lot more consistency and we realize how where we were going wrong with the first product and we since also since in the first product we started applying this things much later into the thing building the things on top of what is already existing which is a pretty humongous infrastructure and also has a lot of things which we don't know right even pretty recently we didn't know some of the things what has gone in so there were inconsistencies and unless we were in a place where we could do everything from the start it this things would still remain so we ensured that we have a full automation and full control of the every of the systems of the builds of the deployments everything right from day one in the new project so with these more projects coming in and more developers being there and more like bringing in more automation also brought in increase in the cost now because we used to have more and more machines and with these more machines the cost started just escalating big time so management had a push that of you guys really need to start working on how do you reduce the cost for this so also building time for the infrastructure was we brought it down from days to like we could probably in by then start start doing it in a day probably 8 to 10 hours which was a major achievement for us to get from somewhere with two to three weeks to down to 8 to 10 days 8 to 10 hours sorry but again one of the things was I was trying to do too much with Ansible I tried to even start new machines bring them start the bring the load balancer set everything up together started writing the code to manage the infrastructure within Ansible and I realized that somewhere I am not doing this right I am creating something way more complex than is manageable and I'm taking too much time and maybe luckily or some unlucky Lee I don't know but the biggest hurdle happened when we decided we are going with Google Cloud and not AWS and Ansible uses something called lip cloud to talk to Google Cloud there is they are working on replacing it with the native APIs API calls but that was not there I am not sure if it's still done or not but when we were doing working on it it was not there and we couldn't have L7 load balancer we wanted L7 load balancer ahead of our machines to load balance all the STDP traffic that was coming into our machines and we couldn't do that with Ansible so I started looking out for solutions I reached out to some of my sys admin friends and one of the that is where we moved to our next level of automation that came in so the scenario is we have three projects by now we were almost close to having 20 to 30 machines it should be 20 not 10 but we were 20 30 machines there and we didn't need all these machines running all the time so we started working on how to cut down the cost and how to manage the infrastructure such a bigger bigger infrastructure compared to the earlier things how do we manage this so I was told by some of the friends to bring in Packer and Terraform to with Packer I start doing my builds and having image ready with all my base things built into it and with Terraform it will just take the Packer image and have virtual and your whole infrastructure will it will start a machine and also with Terraform I could actually define my infrastructure where the load balancer was there my networking was there my machines were there my database was there everything could be defined in a much much simpler way I didn't have to write a logic on what to start first what to start later Terraform just understands how do you want to what are the kind of dependencies that come in when it's also more declarative code which is much simple which is I realized that it's much much simpler to manage and I was I didn't have to write a logic on how to do things I just needed to define what do I want and Terraform would give that to me then we still keep Ansible we just reduce the use of Ansible to do the we are still doing employees using Ansible and we are doing configuration using Ansible and it's working brilliantly and what we have done is we have tied all the three things together and created some tasks within Jenkins. So now I don't need to do like they don't require my involvement everything has been fed into Jenkins only there are certain kind of parameters that my release managers understand and they will just define this parameters whenever they need to do a release even our dev can do that they just need to specify what is the release number what what is the environment and the release can be done right from Jenkins environment. So they don't need my help for every release only when things fail then I need to look into it that has allowed us to reach a stage where we can actually do multiple code deploys a day we are doing that in the dev environment and in the basically in the non production environment we are not yet at a phase where we are so confident about our bills we are trying to work on that but hopefully in next couple of releases we will be there where we are even doing multiple push to production in a day with this whole thing we can actually start of a new like if I want to build a new infrastructure I can actually do it in some 15 to 20 minutes now I have the code written in Packer it can and every like it creates a machine image and stores it already in the Google cloud images and like Terraform can just take that image and start building the machine probably it will be done in another five to ten minutes and Ansible just kicks in and configures all the things as once the infrastructure is ready. This has allowed us to have the complete DevOps cycle that you really need where the dev creates the code it goes to our CI system does the build the build is pushed into this system which can be used then brought up in minutes and there are also once the new infrastructure is brought up and a deploy has been done there. The tests we test run on that and give us the immediate feedback and we can have this completely tracked into a couple of central places like Jenkins has a very good amount of information which is already there we are pushing things to slack and the next step is to push it into a monitoring system which gives us a complete dashboard of what has happened at what point of course this is today but even today we do have a few more problems that we'll be sorting out in near future one of the things is that we have our code and data on the same machine that certain places that means that we cannot just bring down the machines at this point and have have a new machine brought up unless we handle the data correctly. So we are working on bringing in that separation and so the data machines can be separate and they will be having a backups and all the things handled on the data separately and the code machines would be much much it will become very trivial to scale up and down just destroy and bringing in a new machine whenever you want. So we can then whenever we are doing a new release we can just bring up a completely new machine and whenever we are ready with the new machine bring down the old machine we want to get into that kind of a phase. So one of the factors which is stopping us from doing that is this part. Another thing is we've written I've written a lot of code but it was written in a place where we were still figuring things out. So one of the things I would really love to do is rewrite my code such that I can if I have to start a new project at any point I can just reuse the existing code and be able to do this with a very less work. So that is still remaining in bringing into our automation. So idea is to be at a stage where if any if any eventuality occurs we are ready to that be prepared for that like having being there in multiple regions we are in Google in today the cloud providers provide you multiple regions we have not using that we are still in one region and if that region has a failure we are my code is not ready to immediately be allowed able to switch to a new region our data is there that data needs to be backed up such a way that we can bring it out from a new region and use it there. So some of these things still remain and stopping us from having a very reliable infrastructure and having a fully automated system which can really work without me having to look after it all the time. Yeah, so that pretty much concludes the what I have in my presentation. Anybody would like to ask any questions here? Hello, yeah. Yeah, yeah, so yeah, so my question is like currently the scenario inside by company we have migrated into monolithic into micro service, right? And we are doing deployment by Jenkins, right? All right. Yeah, so but the problem is currently just on a daily basis we are having five millions of ad requests which gets doubled on weekends, right? Okay. So now we're on a stage we are thinking to have auto scaling, right? Right. But in Jenkins we need to configure all the all the servers manually. So is it any plugin or a way where you know something like auto discovery can happen in Jenkins, you know when we scale up we can deploy the code on the scaled up. So this is something that I am currently working on for our next release is the deploys don't happen to the systems that deploy will happen to my packer build. So my packer will build and keep the everything the base ready and Jenks of when Jenkins has a new build that build will get baked into my image which is there. So using that image I can start the new machine. So your auto scaling I have not used auto scaling yet but with the autos. I'm guessing that you can specify which is the version of the image that you want to use. So say if you want to go from one machine to two machines you will start one machine with the new image. Once it's I you are sure it's working then you can specifically say that take the old machine out and bring on a start one more machine with the new code. So you can so the whole thing is baked into your into your you don't need the Jenkins to know which are the machines your Jenkins Jenkins just deploys the code to your image instead of the machines and what I do is after that there's of course configuration required. So my configuration which are the machines to be picked up is I use dynamic inventory in Ansible to handle that. So Jenkins doesn't need to know again. It's Ansible which knows which are the machines which are relevant to this. Any conflict basically you need a configuration management tool so Ansible is one probably if you are in AWS then you can use their own tools or if you are outside and if you depending on the size of your company and your requirements you can figure out what are the tools relevant to you but yeah Ansible is what we use. You can use any configuration management tool that is relevant to you if you'd like to speak on Ansible I'm always happy to do that probably there are some redhead guys around and I'm not sure if they have anyone from the Ansible team they would be able to definitely assist you on that. Okay the next question is from here. Hey hi I have a question on the tool whatever you talked about so consider that you have a cluster already created using your system so tomorrow I want to upgrade the same cluster. So how is it going to be with zero downtime? Sorry what is this this is a cluster of what cluster of of a publication of you spin up instances. Okay so we have a we have Node.js and if we create let's say we create a cluster of Node.js machines which we have not really experimented with it but what I can think of is that we would go is like instead of sending all the requests what we'll do is we'll create one machine with the new new version of the code and only push have our load balancer