 So, good morning everybody, I'm Madhu Ranjit and I'm part of FTC dev-all scheme or deployment scheme or whatever. We recently renamed ourselves to BAD, this is build.remain.plied. We have BAD at FTC.com, some schools, a lot of people can identify with us. Today I'm going to be talking about how we used a combination of big retail profit to spawn off environments on the flight. So, I might be using the word planet because internally we use the term planet to refer to environments. So, for example, production is the earth planet, something like stage is Mars and so on. So, if I use planet, it stands for environment. Okay, so with that, let's get started. So, brief agenda, briefly about what we do, what problems we were trying to solve, what did we do with those problems, testing and some challenges. Okay, well, what did Apigee do? So, we were basically an API platform. So, if a customer comes and says, hey, you want to expose so many apps to the outside world, maybe for their customers or their partners, we are a cloud slash on-premise platform. And there are various components. So, for example, the cloud platform would have an ERB and sort of receives all your traffic. So, imagine it as a middle there, right? So, you have your customers and you are the customer and you say somewhere in between. So, your customer's traffic will hit our platform first, which then goes and makes a call to your apps and then it gets it back and does some transformation and then gives it back. So, we are in between there. So, that's what we do. So, we do some data analytics on top of it. We take care of making sure that the call that comes in, it's transformed it away. And so, for example, let's say I have an XML data that the end call is given. So, we probably come back into JSON and give it back. And we do some sort of casting and we do rate limiting. Rate limiting is a big thing right now. And we also do some things like OAuth and so on so that you can expose in a secure way to your end users. So, having said that, the problem we've had is we are completely on AWS. And for a very long term, we've had static environments. Environments running anywhere, the size of environments would be anywhere between 60 machines to 180 machines. And they would all be up all the time. So, you can imagine the amount of money we pay on AWS just because of that. And you have roughly around six to seven product teams. So, the challenges with that is also... So, for example, you cannot directly come into a stage environment, right? So, you have to go from your initial smoke environment or your first environment and then go to the next and then sort of promote the business type of the products. Problem is if I have six to seven teams, all of them trying to deploy to the same environment, it's always going to be chaotic. So, we have those problems for a very long time. So, for example, I have a new version of software and I deploy... Someone has actually been testing on this and everything breaks. That's one of the things we want to go ahead and solve. The second thing was self-service. I think pre-stop was also sort of doing something like this. Use one of your environment whenever you want. Bring it out, test, and then bring it out. So, you want it to be sort of self-service platform. The other thing that we have is that we're also on-premise and we also do some sort of hosting. We want to be able to... So, the choices that we wanted to use were something that would enable us to use it directly on-premise or on any cloud. So, it should not just be only for AWS. It's more if you move to OpenStack and try the cloud off. If our customers are on OpenStack, we should be able to help them. And finally, we should be used by devs, pre-sales. So, the pre-sales guys, they have their own dedicated set of machines on AWS which they use to show stuff to the customers. And even things like regression, right? Regression can happen anywhere between 10 machines to 8 machines depending on what you have as your different product. So, I don't know how it's sort of... So, then we came up with something called as AutoPlanet. This is how we would run the command. I talked about what we are doing, but... So, we ended up with something like this. And that took us from here to there. So, I guess you can't read, but... So, what it essentially is, is basically a sort of script, one of the scripts, which will, given some input, take that, parse the whole thing, create machines, apply software for you, and then why are they modeled? So, I'll talk a little bit more about what we did. So, the flow we basically follow. So, we have environment profiles. So, for example, P1 can have their own profile of whatever machines that they want to spawn and what configuration they need. So, on a single machine, I can have one or more applications. There are some base applications which are going to be on every machine. So, I'll take a regression theme, for example, would come and then create a profile. And the profile is all in YAML, right? So, we have a standard design profile. So, people go and create whatever profiles they want. So, the regression profile is, for example, 80 machines. So, we will have 80 machines in the environment. So, as part of the environment profile, what we do is, we specify the item of machine, what applications suppose to run, number of machines, where they want to run it on AWS, such as on your local laptop. So, currently, it was really well for a developer slash tester on your laptop and on AWS. So, the secret ingredient to all this is actually vagrant. So, when we spawn machines, we basically, what we do is, we take those profiles and we have one script which sort of creates a vagrant file out of it. And the vagrant file sort of defines what machines you want to build up, what configuration you want to run. And I'll be showing some examples around that. Then, we also run configuration and wiring part. For that, we use pop-up and vagrant values. That's the purpose. And at the end of it, we need to test the planet. So, because we use pop-up and pop-up doesn't have something like a mini-test handler or whatever, the other cool tool that a lot of people are using now is the server-spec. So, server-spec, you define, it has its own DSM, but very simple, written in Ruby. You can say, I'll show you examples, but with that, you can sort of test the whole thing. So, for example, if I have any machine setup, I can have, say, based on applications, go and test each of them. So, okay, vagrant. So, in the hindsight, if you're probably using Chef, I don't think we would have even used vagrant to do all that we're doing today. So, one of the problems with something like pop-up is there's no office data. So, people might use MCO, et cetera, M-collective, but that's more for saying, okay, run this on this machine, but there's nothing that will sit there and you have the same common interface that you would use to spawn off a machine. So, for example, Chef, you would say, mine is easy to create, like, so it's not going to work, or my open stack creates, and so on. So, for that, we need something. One of the best things that we had that we could integrate with was vagrant. So, how many of you are vagrant? So, the vagrant is not a vagrant. It's a tool this guy, a Japanese guy, called Michael Hashimoto. He came up with it. So, his fund was very simple. He said, I'm going to write a contract which will sit, basically, on virtual pubs, or on AWS. It's basically provided, so provided, in this case, would be AWS or open stack, or even virtual pubs, right? And then I define whatever configuration I want in some format, right? And then I'm also going to integrate provisioners into it. So, provisioners is your standard puppet share, principle, soil. So, you can put in whatever configuration you want there, and then, when you say vagrant up, the machine comes up, and then you can write whatever configuration. So, the provisioner thing sort of takes over, and if you have Chef scripts, it'll run Chef. There are puppet scripts, it'll run Puppet. So, here, for example, if you look at the left side, this is a standard configuration for a zookeeper. So, you have, that's the machine block, the machine server zookeeper. So, we define all the puppet modules and so on. The interesting part there is the puppet.factor. Just keep that in mind. When I talk about Puppet, I'll talk a little more for that. So, the configuration in red is not going to change whether I run it in virtual parks or on AWS or any other provider. So, that's a cool part of it. So, it means if I have some script which will generate that part for me, I'm really good. Because the only thing that's actually going to change is this part. So, the provider configuration based on AWS or this type of all of that, will change. But the base configuration that you have will always remain the same. So, Puppet, problems with Puppet, right? So, we run close to like 3,000 servers and we run roughly on 12 puppet servers. So, that's a problem with itself because the problem with Puppet is when we start running Puppet, it compiles the whole damn thing at that time on the server. So, we see like crazy loads on our server. So, one of the things we wanted to do when we built something like this is get away from that and have something that really scales well which means we go with Puppet master this, right? So, once you have master this then you don't have to go called the top of the server or any such thing, right? So, what we do with that is the guy called Jordan Cissem who sort of initially came up with this. He's a guy who had a lockstash and a couple of other really cool tools. So, he came up with something called a tooth and coarser. And he was having exactly the same problem with this guy, right? So, he said, okay, I want you to find a tooth and coarser class. So, in Puppet, this is how we would have it. You'll have classes. You can do inheritance and all those things that you would generally do with, you know, your Java class or whatever. So, then he says, what I'm going to do is it's called as factor-based Puppet. So, the configuration that you see here which is the factor. So, that's the part that we need to pass the actual inheritance. That's the part. So, things like planet is this. Application is this. So, for example, application what you do is you pass a JSON which says you have application A, B, C with all the information on that. Now, the thing that the tooth and coarser does really well is that you have the exact same manifest the word for cookbook in the Puppet word is manifest. So, you have the same manifest that's going to run on every single machine. So, when it runs, it will look at, it's basically a whole bunch of instances, right? They're testing more than go run something else. If it has this particular application then we call it another class. So, we call the zookeeper class for example which will go and run the zookeeper manifest. The other thing that we learn and a lot of people don't know for Puppet, right, is to have parametrized classes. It's not very documented. Maybe the documentation is there but if you look at general manifest that are written people do not use parametrized classes. That's something that helps us quite a bit. So, we pass things like forward and applications and blah, blah, blah. Okay, so, so far we have become like, so if I give you an environment, Yaman, we have a bunch of scripts which come as that into a vagrant file and then we run vagrant app which basically brings up a whole bunch of these machines based on the configuration. One big thing I'd like to touch is we also go with this mighty region thing and stuff. So, if you do mighty region our recommendation is go with this AMI that Viata has, Viata is this company which is really good with the VPNs. So, we use 8 of years on one end and it gives you like this customer gateway that you can use and on the other end you have Viata and it's a free AMI the way it works as well but the free AMI works really well. And then once we do that we run a topic master list which is what I'm showing you with Fnx followed by vagrant projects. So, one thing that we have like our applications is that just because the topic runs at the end of the thing it's not yet usable which means you cannot actually send traffic. So, what we have to do is to actually register each of the components with the management component which in turn stores its state of uniqueness in the zookeeper. So, if you imagine let's say 20 machines which sort of come up in parallel at the end of it they are all configured but you need to do this wiring part which sort of registers each one with the management part. So, you can do it well with vagrant projects. So, an example for this and a lot of people again do not use vagrant projects. So, people who are using vagrant highly recommend you to use this because you can learn to do what you want. So, I just put in a couple of samples here. So, for example we do things like vagrant FBG management wire so it will be able to run a whole bunch of REST API calls of the management and sort of wire stream. Then you can do things like this. For example, for the same zookeeper at the end of it if I run vagrant FBG status, we spit out a JSON which goes and figures out the status across all the machines and see what works out, what's not, and so on. The other things we have written which we plan to open so soon is plugins for you to manage various other components. So, we have plugins to manage ROW53 plugins to manage VLB. So, very simple we use the same so vagrant gives you this classes, bunch of classes, it says you inherit these providers like you would do with any standard tool. And you write the whole product. So, if you use the Fog API and at the backend it will do this. So, the vagrant VLB list. So, this one we sort of inspired from Chef, I'm a big Chef fan. So, instead of 9 VLB lists I would do a vagrant VLB list. And then in Chef you would have something like 9 node show and this. And you would create the only show that will give you a lot more information about the individual VLBs or the ROW53 entries or whatever. So, recommendation is to use plugins. So, the whole wiring part that I was talking about it's done completely via plugins. So, that's that part. Now the next part is so where it comes up, we have a profile we have some bunch of threads it's kind of a massive vagrant file. Nobody ever needs to look at a vagrant file but it's an automated script and we generate it as many times as you want. And then we run the spawn of the machines. So, if I have 18 machines one of the problems is that you do not want to sort of do it seniorly. So, vagrants sort of introduce and we did some contribution as well to making sure that all the machines sort of come up with pattern. So, if I have 80, 90 machines obviously 90 threads would drive the machine mind. So, you can actually say you can define maybe 10 to 20. So, we've seen that more than 15 threads and AWS begins to terminate machines that it is created. So, bring up around 12 machines at that time, 12, 15 machines and then we run a puppet master class and then we run the vagrant classes. So, that's four steps. The final step after that is something that everybody wants to do is before they can log in is how do you guess like everything is set up correct. So, for that, as I said, use server spec. So, if you look at the example it's very simple, right? So, you can say for example, describe port and it says it should be this thing and describe lookable it should be and able it should be running. So, at the end of what we do is run this rate server spec. What it does is at the end of the machine creation, we generate a JSON file which has all the metadata about the machine, right? Right from right from the machine names to IP addresses to port name, port numbers and all those kind of things in the JSON. So, as part of server spec we have a rate file that just passes on the JSON and then says, okay, these are all the zookeeper machine. Go run the zookeeper test under the pivot machine. And the management machine go run the management test. So, you can sort of parallelize the whole test and it takes to clear all the maybe worst case a minute to run all this and then the environment is up. Right? So, we sort of done the whole so this is a case where the the brand has actually failed so something is not so you need to this is just some output that we did. Right from having a simple YAML file that defines profiles for various teams and what we've done is we helped the teams by saying you form all the brands, create a road profile, test it out and set a pull request and request. Then we created the file, spawn out the machines, wire it up and run tests. Okay, what was about the challenges here? So, even with this one of the things that was happening was we used to have, let's say 7 or 8 different teams, 7 or 8 different people from various teams spawning out their planets and we did not know how to order them and so on. So, we created Jenkins YAML which would sort of send us a daily report saying these are the number of machines that are running because the business has always been ours so if I'm moving away from these static planets or something like this, how much money are we spending here? Right? So, one of the things we're doing now is with EPS flag instance what we do is sort of suspend them or we say pause, so we say dot slash auto setup, name of the planet or environment and say pause and it stops all the machines and next morning when you come to office, you're going to do a rescue. So, by viewpoint if we don't get any intimation about something, every night at 11 o'clock, 11-11-13 something, the scripts run from Jenkins and then sort of just run all the plans. Okay, user data is something that's really, really interesting and I think all of the platforms have this concept because when a machine comes up, the machine has to identify itself. So, who am I? That sort of information you can get via a curl call in Amazon with the 169.2 reason might be one. So, apart of user data what we do is we sort it like a puppet. So, the machine comes up, we inject facts in it and it installs a puppet and then begins to run the manifest. So, a lot of people don't do this, I don't know why, but if you're using something like AWS please go ahead and do it because a similar concept exists in almost all of the platforms. So, it's a useful thing that you can do. The other thing that was there was parallel execution, I've already spoken about this. The other thing was, till recently they did not have the support where so, if there's a pseudo part there's a configuration which says requiring your servers. So, if you don't comment that out then you cannot access the burning commands there. So, apart of user data, the other thing we were doing was commenting out. So, but it all depends on that level because initially the machine comes up unless you have a specific AMI which already has required it by all. But the guys are the standard AMI which does not have that. So, we comment that out, we install public and then let it fails. And then the provision part actually uses this loopy called parallel. You could use the GNU parallel as well to run all these jobs in parallel. And it goes in then it gets popular as well. So, even after that we constantly had problems and the problems were because so they say I keep introducing new features which means one of the problems is that the configuration files is part of the applications. So, where do you store them? Do you store them in virtual software or do you store them in a place like Puppet or Chef? So, initially here in our organization we used to store all the configuration files in Puppet as templates. Now, that is mainly a bad idea because every time you change the actual configuration here so which means you introduce let's say a new parameter called foo. Now, unless you actually go and update your Puppet emulator next time you deploy let's say it's not the RBM or target it's installed but then with Puppet runs it just overwrites the whole file. So, it sounds crazy. So, there are two things to like. One is just make sure you maintain your whole configuration file yourself because then your Puppet manifest or your Chef would be as simple as for example installing HTTP. You would do a YAML install it will maybe modify a few things in your configuration file and then start it. So, we sort of moved a lot of this embedded stuff into the source code. But that worked really well for static environments. Now, I told you briefly about HYRA. HYRA is HYRA can database which can be backed by just YAML files or the Red Air sort of my SQL and it works really well with Puppet. So, for example Puppet does not have the motion of environments I mean it's there but it's not that as good as the one in Chef. In Chef what you would do is you would have a JSON file that you would just check in or an ARMY file so every time it runs and node runs in that environment it picks up that. So, if I just pick up environment specific values in Puppet the way you can do it is you store those values in HYRA. So, for example I have based on different environments I love different values so when it does HYRA lookup it picks up the right value. So, now if you think about moving all the configuration files into your source code the problem is it works really well for static environments and this is something that gets falling off on the fly. So, which means that the HYRA lookups for back might not be there and even though it is there it might not be there so the thing that we can ask people to do is to go ahead and create a YAML file for us and the YAML file will have okay we have company Java so YAML file will have the property YAML it will have the file YAML and all the properties that need to be replaced during development and the corresponding HYRA value. So, the thing that currently works for us is the one thing that the developer has to do for the value that has to be overwritten is to add a HYRA value and make an entry in this file. Part of the problem what we do is we pass out this file and click and replace all the values. So, we sort of got about 90% of failures because every time somebody needs to do a configuration property now it doesn't work. I guess it is probably an even better way to do it so I guess we are just learning but we have seen a lot more success with this because I would expect the developer to at least make one check in into our HYRA value about what value we should pick other than not knowing anything about it so that the developers themselves are aware of this. Finally, I guess this is what not only spoke as well you need a single interface sort of with your fancy UI which sort of does all this but in the meantime a simple stupid solution is to sort of have check in jobs so people go create planet take a bunch of parameters which is part of an environmental test and then give you the result. I had at a moment I don't know but I am using a shared laptop so I can't show that to you but yeah, any questions? I guess it is quite a question. Any questions on choice of technology? No questions need you to not understand anything or you understood everything. Did you face any problems because when we were using background type saw that it crashed around a lot and it kind of blew away in the wrong minds as such and it was very kind of unpredictable. Okay, so I think there have been a lot of issues around that with especially virtual box but the the newer version of AGRID is 1.4 and virtual box 4.3 plus it is way more stable so they fixed quite a few issues so if you look at the AGRID log at the recent times they fixed quite a few issues and we run maybe two machines the other thing is even using a laptop after three virtual box machines it becomes really slow so we saw moving to Docker and stuff so that we run only two virtual machines and run maybe seven containers spread across the world and the thing is one of the thing we are doing now is AGRID has this other provider now called AGRID LXC so you just want to plug this in so next time if I am going to do this then I should probably show the AGRID LXC it works quite well so you can plug this into the AGRID it will also help the comparison oh ok, so I can go on from here a little bit I think Chef is involved way more than Puppet so the guy who started Chef I don't think of him as a guy they were from the core of Puppet so one of the first thing they introduced was 9 it's the management link which Puppet does not have and I think that made a huge difference the other thing was Puppet says you as a user should define the order so if I just randomly write a manifest and just leave it it can run anyway so the funny thing in that is so we were like the start up and we wanted to give things our trade so people kept writing manifest and at some point of time they missed a lot of this so there were roughly around 180 different modules and manifests so each module will have multiple manifest so they forgot or I don't want to blame them because they have their pressures but this ordering is not there so we actually did Puppet twice so the first time everybody is in a state of mind where they say ok first Puppet run we are not sure that it's going to run but then in a Puppet it's already run so it's a very very important in Puppet that you define the order of resources but it's definitely very simple it's not the bottom I just said just on it the other thing with Puppet is that you just had a product and you should do this right the fact that I'm running 12 Puppet servers is something that I'm really not proud of but when we move to this Puppet master list completely we are going to do this so what it does is it compiles the whole order in memory every single time you run it and it eventually gives you an insanely massive JSON file so the more it manifests the more compilation time it takes whereas Chef doesn't really like it Chef is already it's stored in state in the database it's already there so it just states it and gives it to you so for example if I have 40 clients for example there's not here in the list if I have 40 clients which sort of go to the Puppet server at the same time it creates 40 threads and if you have 20-30 manifest running on each then it takes a lot of time so we seem like see people go like this the load happens at 28 so that's another so today if I get somebody to start something new I would actually tell them hey you Chef you have a good movie guys in your optimization otherwise you will answer a little bit but don't do the Puppet so I think for other organizations eventually the move is to move away from Puppet and I think you will use MCO quite a bit as well and the problem is usually you basically have plugins in MCO where you would say so it's equivalent of what you would do with 9th so Chef you would say 9th search environment and web servers so it does a solar array for you and it gives you all the stuff and it does an SSH or whatever and it applies similarly MCO is supposed to scale better because it supports something so you have an active LQ it's a set of all the work and in this way the problem is that so we have like we have these NAWs that we sort of manage and there's a complete cluster of these active LQs so each one talks to the other so when I hit an MCO query what happens is it goes to this first and then that goes to all those and that goes to all the servers so the way it works is that every time I make an MCO query it will really go for every machine so if you see the law on the client you'll actually see hey are you this guy and the answer will be no on active LQ so we had a lot of issues with MCO at scale more than 2000 plus machines where you do a LQ query it actually gives you only maybe around 70-80% of the actual number of nodes so 100 nodes if you like or might so that's the other problem of orchestration so 9th for example does not do that it just goes and looks at its database it's a solar array SF11 is way more stable and that gives you the other nodes SSH is lower that is true but it's more alive so here we want to go to eventually something I guess 6 months from now we complete the method of POPET today already we've started using Ansible so maybe some of the root cons how we are using Ansible with POPET so we do a lot of our orchestration class with Ansible and it works quite beautifully POPETs are working with 2 runs still a problem because when we bring up a new machine today what happens is the machine comes up in like 2 minutes but for a machine to be actually ready it takes another 12 minutes so it's a 14 minutes because we don't do POPETs so you run POPETs at minus 3 and you wait for like 3 minutes for POPET to run to start because it takes so much time for you to come here so it may be around 100 500 knots I think where you POPET chef POPET plus MCO should handle all your cases but once you go we run 2000 and everything will crop so that's when you say if you're using POPET go and master it so one of the thing we try to do is and in fact a lot of people have done this as well so you come in to get as opposed to push all the way to S3 and when the machine comes up you have a user data thing which goes and just runs it that's sort of the right model to follow if you're using something as POPET and have some other system which gets you all the dough inside like puts it into 90 hours or whatever that's another thing that everyone does right which is make sure that every time my monitoring system runs it updates or removes the notes which have been removed so though if you don't think you'll have to do a little more with something like POPET but as a support spot instance earlier I thought it was only POPET allowed you to work with spot instance I think that I saw a couple of months back I have none so since I came here I have not worked much with shift but I guess it has support now in fact that's probably a good opportunity for you to confirm excuse me give a continual comment on these five dimensions individually or individually or okay so when you have like VPC public-private you have an added instance in between now when I connect U.S.S. to U.S. so to connect those across to the other zone though we want to have a private sort of so for example if I have two zookeeper sitting here and two zookeeper sitting in another zone the ideal way to do that because both are in private the private part of the night the ideal way to connect them is to have a sort of meeting in between the two we have not faced any performance issues the only thing that you have to monitor is Toyota so one end Amazon already gives you this customer gateway and you sort of get the call so you can actually say what you want other end point if it's Cisco or give a Cisco configuration you can say why is that in a couple case that make sure it's always up there's another company there's another company which also does something similar so your only problem is to make sure that VPN is sort of up and running and earlier it was not stable people used to run two Linux boxes and have a VPN and make sure it's there that's definitely something you can do but at least for test and run we also don't have that much for the last one year we used version like 6.5 64 bit it's quite twice there wait it is my orchestrator here no no no Operator only run on the box Operator has no idea what Operator is only on the box Operator is your shift length shift type of length you get some state and running on the box I need something which controls the visual eye or software MCO right people use MCO to control it from outside so we even draw much like my Provisioner sorry my orchestrator now you will have people have like some plugins around it but I have not seen much use of those no it is not orchestrator it primarily it was best as a configuration another? no no no zookeeper is a software company I just picked up zookeeper because it's something that's agnostic to any of our internal company so it's like think of it like agnostic it's just another machine and this I just picked that so that I could I could take one component and run it through right from configuration to test it can as well be your web server engineer so you bring up the machine have scripts which will just pull it so we run all of this so you can do all of this that's a good question so you can actually have a jump server from where you run all of this this is what we typically do but the other thing you could also do is on your native since you can set up an open VPN server and from your local box you can have those certificates here and you can connect to it so that now you have the exact same something as the public private and then you can do this so you can do it both ways once you have a VPN it's a very private and since I'm not it doesn't really matter you know the same thing but so jumpers will be the public subject of the VPN if you are using that so two ways to do it