 Kairing on, for the afternoon, we've got Andrew Bowe from Catalysts talking about orchestration war stories. Thank you very much. You can all hear me okay? Yep, it was going well. So, my name is Andrew Bowe, firstly I'd like to thank you for the opportunity for presenting to such an impressive crowd. I actually conceptualise a much smaller event, given that it was a mini-conf, but it's great that we've got the opportunity to share some of our learnings. So, what I've aimed to do with this talk is to really talk about the things we have seen and done based on using the tool sets that were decided upon by the team and the client to solve orchestration and cloud-based infrastructure challenges. So, just talk a little bit about what we have done. So, I'm the managing director of the Australian end of catalyst business. So, we've got about 35 staff and we've had a fairly big involvement with AWS since the launch of the platform in November 2012. So, our broad experience with AWS as an infrastructure-as-service offering has been we've done hand-rolled stacks and infrastructure with just resting things up using the GUI and connecting to machines. We have used the golden master, red-green deployment, a stateless immutable approach in which you... I'll talk about that a wee bit later. And we've also used opsworks. Now, opsworks is sort of your close, your real deal orchestration tool which sort of allows you to deploy its stacks as code, right? I mean, there are other methodologies but opsworks looked very appealing to us at the time and we got a lot of buy-in from the client which is actually one of the ways we've been able to learn so much because still even out there in the world of cloud there is still a lot of essentially lift and shift from what we're seeing where you're basically just taking your existing infrastructure, your existing applications and you're picking them up from somewhere whether that's physical infrastructure or VMware or whatever and you're just putting it into a cloud using a subset of the available tools without heavily re-architecting because of budgetary constraints and there's still worth even to lift and shift because you might spend less on infrastructure, right? It might give you some better redundancy offerings and stuff like that but thankfully we were able to build an application pretty much from the ground up, well, a stack actually, which was fun. So, I'll just talk a wee bit about the golden master deployment model. I mean, do you guys use, is anyone used to this sort of deployment model, golden master, red, green deployment, anyone? I'll assume you're just shy. So, what we've got here, the magical instance here is this one called the EC2 admin instance, right? So, what that is, I mean, this is your typical lamp type stack where you've got an application server and then bits connected to it. You know, you've got databases and caching servers and a clustered file system at the back, GlusterFS down there. And over to the right, you've got this autoscale group which is the AWS magic of being able to define an autoscale group based on an AM, an AME, an EC2 snapshot, a compute snapshot, right? So, these things are considered immutable in the sense that they don't have any state. Well, I suppose technically they do have a state, you know, they've got a bit of caching stuff going on but they're writing their syslog, they're writing their logging out to an external server. The syslog out on the right, they're doing all their file operations with a clustered file system, a network file system. They're talking to RDS as their database and they're using external caching servers. So, there's sort of that real cattle versus sheep type thing where you can kill them and you just get another one, right? So, these, and the way we build this snapshot, the way we create these instances to use, you know, for, let's say, we've got an upgrade to an application of some sorts, then we actually do all the changes on the EC2 admin instance which is sort of off to the side and is not in front of the internet. There is no URL you can get to to get to it, although you can cook, you can do sort of cookry on ETC host and get to it if you need to. And we will deploy changes to that. It still has a live connection to the files and to the database and to all those things but it's just not part of the, not part of the sort of the compute muscle that's servicing requests from the ELBs. The ELBs are load balances in Amazon speak. So, this has been quite successful. This was our first sort of orchestration approach and it gave us a lot of flexibility around, we can use very conventional tools to manage the EC2 admin instance because it's just a server, it's just a Linux server and we're pretty much exclusively one-two Linux. So, there was all sorts of ways we could manage that and we were very comfortable with that, right? And then you fire off an API call into AWS land and it takes a snapshot of that EC2 instance, of the admin instance and then rolls it into the load balancer. So, you just simply kill off the old ones and move the new ones in. And there's a couple of approaches depending on how much, what level of comfort you've got around things like upgrades, running and stuff but it's pretty flexible and it allowed us to do seamless upgrades. So, we came along, so after doing that for one of our bigger projects for our clients, the next step was much more of a proper orchestration layer and the thing that came to the attention of both ourselves, we actually proposed it to the client but the client liked it. They're very, very technically capable and they really wanted to buy in for all these decisions was OpsWorks. Now, OpsWorks was pretty, I mean it's been around for a while now in AWS speak. It is, by our definition, orchestration is infrastructure as code, right? So, it is your entire, the Holy Grail is your entire stack is defined as code that's in a version control system somewhere, right? Or not, but that's obviously a great way of doing it. So, OpsWorks is an AWS service. I believe they bought it off someone but I don't know the details of that. It's orchestration as a service. You don't actually pay for it. You just pay for the resources that you use when using OpsWorks. So, that makes sense for AWS. It allows you to use their infrastructure and you get charged for that. It does provisioning. It does management. It does deployment. It plays well with continuous integration and continuous delivery, which is what we were using heavily. So, it was sort of push-button deploys and all tied together with Jira and Bitbucket and HipChat and all that sort of stuff, which is very much driven by the client. And once again, the client was heavily involved in technical decisions and really did know what they were talking about and understood the sort of rules of the game for this world. So, summary of what OpsWorks is. It's stack as code. I'm not trying to sell OpsWorks in any way, shape or form. It's stack as code. It goes in a Git repo. It gives you methodology to layer the stack. It's based on Chef recipes, not puppet ideological discussions around Chef versus puppet. There's lots of room for that. ERB templates is sort of the methodology of defining templates as compute nodes. There's lots of existing recipes and you can now actually recently, although I'm not sure if I would, you can actually run OpsWorks outside of AWS now on your own infrastructure. So, you can actually run some agent. I'm sure there is a requirements of what you run it on, but you run some agent and you get charged by hour in the same sort of way, even running it on your infrastructure, which in theory allows you to do this whole portable workload type thing that is sort of the buzzword in infrastructure as a service around at the moment. So, we haven't done that. It was quite a recent development that it was around. So, in practice, we define servers in layers. You know, there's your cache layer, your front-end, your Gluster, your RDS. OpsWorks does do auto-scaling, but not as well as just plain auto-scaling. You don't have as many metrics by which you can make rules, which was limiting. You can do some interesting stuff with temporal rules, which is a really good thing to do for a whole pile of environments that lie around big enterprises, because they just turn them on and forget to turn them off and they get charged for it. So, it's really cool to be able to say, you know, this should run basically during extended business hours or, you know, and you can still turn it on, but the rules aren't super slick, but you can do that. It has an auto-healing functionality, which actually came on, which was a big problem for us when it came unstuck, because it depended on the mothership, which was in the States, and at the point when it lost connectivity between, you know, the data centre, the region, the Sydney region and the States, then it started doing some pretty strange things, even though it could see the internet and the clients were using and all that stuff, but we turned off auto-healing and that fixed that problem. So, the things that we don't like and this is received from the team, I mean, we've been using it now fairly heavily for well over a year. It has some communications dependencies. It has some dependencies that we're not fully comfortable with. We ended up doing the whole build from scratch, build your node, your compute node from scratch every time. So, you deploy a new EC2 instance, which is of onto LTS something. You then start installing packages and running recipes and all that sort of stuff, which is good for some things, but really not that good for others. Auto-scaling isn't very meaningful in that context because it takes far too long to build an instance to save you from a load spike. Even in the best cases, auto-scaling isn't actually the magic wand that it looks like in the beginning. So, we would have been faster going with an AMI approach, but that was just a re-engineering that no one was prepared to pay for. Here's a contentious statement that other people have made. So, I'll just repeat it. The learning curve for chef is steeper than puppet. That's what some of my guys say. They had come from puppet to chef and they liked puppet more. But then one of my guys went completely chef and now they have arguments and I see about which one's better. We're probably leaning back in the direction of puppet in all honesty. Look, there is an element of lock-in with opsworks and I think that's probably the case for any complicated orchestration framework at the moment because they do things in certain ways, especially with AWS you're making use of offerings that no one else has. I don't like the term lock-in as much in this context much as really there's a cost to change. There's a cost to moving away which is non-trivial. So where we sort of got to and where we are is we have the ability to create a full-stack rebuild at the press of a button. Now this means that we can replicate a recently built production state. The data might be as much as an hour old but we can at the press of a button replicate the entire stack which is front-ends, databases, caching servers, Gloucester and all the state of that data within about 20 to 40 minutes it depends and a lot of that time is actually restoring the database so it's pretty nifty and that you can press a button and away you go and there's another stack and then you can press a button again and away you go and there's another stack and that was very, very useful for some parts of the project. We could use that for disaster recovery purposes in terms of starting a launching an instance in another region although there's a lot of discussion around that in terms of the data implications and stuff. I mean everyone likes that idea but it was never done. I mean obviously it's a lot nicer if the whole region doesn't just go away. In terms of your ability to load test it is awesome. So you can just go hey guys I want to do a big round of load testing press a button and you've got a whole stack good to go and go nuts load test the shivers out of it and then you put it back down again and that's something that has been quite challenging because you didn't have physical machines lying around that you were always ever going to use that were one-for-one matchwood production and it was just, it was challenging. If we did it again, right so if we did it all again we would use Amy based which is snapshot based solutions rather than call home and tell me what I need to be because you can get faster launches. Opsworks wouldn't be suitable for all of AWS stacks because of the way they need to scale based on order scale rules. The golden master approach is probably still more flexible although you step slightly further away from this my whole stack is code inside a vision control system. You might have some stuff that people just go in there and change, right but that's sort of a bit closer to the old school but there are times when that is really convenient especially when the wheels are coming off that you can just go and make changes. There's also discussions about of course re-engineering our own version of sort of what Opsworks does with some more improvements but in doing all sorts of other things but funnily enough the client wasn't terribly keen on paying for it because they didn't necessarily see I mean they understand the pain points but it's also a very sophisticated and mature piece of kit. So just a bit about auto scaling when we first saw auto scaling and the capabilities that it offered us we thought oh my god the world is now eureka moment we just don't need to worry about we're going to spend so much less money we're going to be able to make these incredibly tolerant systems it'll be able to tolerate any load spikes the reality of it is it isn't quite the magic bullet that you think it might be you cannot still just broadly under provision because you will have to have a reasonable outage before auto scaling kicks in enough to save the day so you still have to be paying attention to what the requirements of your system are and you can't just completely under provision under the hope that auto scaling is going to save the day not all of the not all the computer instances you bring up in an auto scaling event are actually completely ready to go and they could actually cause your bottleneck and one of the problems we found was the APC cash in PHP needed to sort of be pre-warmed in order to comfortably throw it into a heavy load environment because it might actually make things worse you really need to load test auto scaling and understand what the symptoms what the scenarios are around it launching because you sort of really want to get that because if you get it wrong suddenly you end up auto scaling all the time and you spend more money than you need to you need to understand what a suitable set of rules are around firing up new instances so I'll tell my chat story we had a we've got a large application which is a learning management system and a content management system Moodle and Drupal that are attached that are in that stack that are already explained and we put inside that a chat application which was a PHP application which was basically brokenly engineered and polled the database every 30 seconds or every 10 seconds from inside the window I mean you've seen the pop-up chat applications on the web this one basically polled the database every 5 seconds saying have I got a new message yet have I got a new message yet have I got a new message yet so while someone is on the page and going away having a coffee for 2 hours or lunch that JavaScript is constantly pounding the server and guess what it's sent RDS into a meltdown so it just completely drowned a flat line to the database and we didn't get that much we didn't get that much saviour out of because it was all about the back-end anyway they were basically just passing on requests to the back-end so we managed to wrestle enough we managed to wrestle enough of the work into a caching layer and put it into the front-end land so the database was left alone and so what happened after that was instead of having 4 front-ends working we had 20 so it was scaled and we went up to 20 and the application dealt with the load but the client now has to pay $2,000 or $3,000 more a month to run the infrastructure and that's an interesting discussion is that generally our background has been fix it make it work which we did but within the constraints of the task and yet still our solution isn't completely good because it's going to cost them a lot more and it makes them understand the value I mean the cost to these things and what they needed to do was use a different chat solution which they will if they decide to and we propose a different one but it was that was a different symptom of using infrastructure as a service because in traditional sense if you just had X amount of servers you would have basically either turned off chat or maybe done some magical tuning of the database or just watched it descend into a ball of flames you wouldn't have had terribly many more options other than that and this is new this is what we're allowed to do so resource upgrade policy quickly you can very much get distracted with an organisation when you have to ask them can we use a bigger RDS or can we use a bigger EC2 instance and they sort of look at you blankly and go I don't know and actually what we've done now is we're just going to upgrade it if the system is breaking and we need it then we'll just turn it up because it costs them 25 bucks a day and they need to they need to talk about it and we need to have a meeting and all that stuff and that's fine but in the short term we're just going to turn up the volume monitoring order scaling is also quite challenging but I haven't got time to talk about that spend once upon a time the technical people didn't really need to think about spend so much in terms of managing infrastructure because there was a cost for a number of servers maybe a pile of data but now the cost of this whole stack and staging and continuous integration environments are really interesting in terms of what they affect the spend you sort of have to have discussions with the client that you didn't before because they want to know why suddenly they spent $5,000 more in a month and you have to tell them and it might be something to do with you and it might be something to do with them but it's a new type of discussion and it actually gives us the way we look at it as it gives us the opportunity to bring more value to what we're doing for them and that we can say well what about this and how do you feel about doing this and how about you turn off some of the environments you've got running so Catalyst New Year AWS tips how many of you are using AWS in some way, shape or form? Right, for clients okay, you should have dual factor authentication with CloudTrail enabled you should there is no argument not to if you are the concept of someone being able to maliciously or accidentally or in any way, shape or form get your login credentials and erase your cloud existence is pretty frightening, right? So you should have dual factor authentication enabled it is not that hard to set up there's a number of ways of doing it if you want to come and talk to me about the ways we've done it please feel free that should be business as usual, right? Like it's not for everyone but that is just so important with CloudTrail so you've got some level of auditability the other tip we'll throw out there is no single person no matter how trusted or intelligent or ubergeek they are or whether they own the company should have the credentials or ability to be able to delete all the copies of your snapshots there is no AWS account that can go in there and delete, delete, delete, delete and remove it all there should be at least there should be copies of the data in places or using tools that mean it will take two people to go in there in order to delete that data because it's accidental because hey, people make mistakes boom, you could go it is a malicious staff member who gets fired and decides that you guys are all dicks I'm going to show you or it's getting hacked so you want to be able to say that there is no as well as you used to talk about data being stored in three different physical locations and now you want to say that as well as saying that there is no one person in my environment or in your environment who can delete this data they could delete a copy of it but there will be another copy somewhere else and you can come in and save the day if you want to come and talk to me about how we've done that I'm happy to discuss that and there's a lot of flexibility in how you might do that so more tips for bear later on in the evening so there will be no demos there will be no demos my sysadmins when I said shall I do a demo they sort of smiled and then I know afterwards they said don't let him do a demo do not let him do a demo so there's no demos and really it's about the application as opposed to sort of looking at an opsworks interface it just looks like the AWS console in many ways I would love to hear some questions I think I've got the time for one or two maybe did you play with the cloud formation at all before going to opsworks and also an opinion between the two so we did and for the way we were when we went down the opsworks journey when we went down that direction we looked at cloud formation and it just seemed to be more work than opsworks to do what we wanted to do but we're actually looking at it again now because we think it might give us some things that opsworks doesn't but we wouldn't use opsworks again unless it was sort of mandated to us and we probably wouldn't propose it we've sort of all fallen a little bit out of love with it so cloud formation yes it wasn't it didn't look easy when we played with it but we're doing another experiment we're doing another proof of concept now does it play well with OpenStack? sorry? your environment well it's not agnostic to cloud providers so it's an AWS tool the closest from what I know I'm not an OpenStack expert but it's heat heat is sort of that same thing and I haven't seen a detailed a detailed examination of the feature match but there are offerings in AWS that OpenStack doesn't have so it's going to be different and this is very much an AWS tool interoperable interoperable interoperational between the two stacks sorry interoperative interoperational between the two environments and from your perspective from our perspective I think that's a tricky one I honestly am still the jury's still out actually in this cloud agnostic deployment methodologies because you end up getting the lowest common denominator and like the reality is a lot of what AWS has to offer I start saying like an AWS salesman which I don't really want to but a lot of what AWS has to offer is some of these bleeding edge technology and there's a read replica for my RDS or the queuing service and I know that these things exist to some degree in OpenStack it's generally a subset of the offering so all these little bits and pieces that really make your life easier are what we tended to use and so yes you could build these things and that's why I talk about cost of exit as opposed to lock-in because you could do it all on any cloud platform but there'd be some things you'd have to build and it's a changing ecosphere that OpenStack does that AWS doesn't someone using OpenStack does but at the moment we've been very centric towards AWS and that was once again the client's decision as well they were telling us we want AWS they weren't coming to us for a recommendation of a cloud provider One more question and we'll move on to you Saban You said that you keep the data in two separate locations I'm assuming they're in two separate accounts Well they're actually in more than that AWS different accounts So wouldn't the script that copies the data from one account to the other have to have access to both accounts? Read access So the machine, actually the way we do it is the catalyst accounts which have stuff and then there's an account over there that's mine and then there's an account over there that's my operations manager and that just has it connects and pulls data back and it doesn't delete so even if you wiped everything off our joint account it would connect and try and sync and so well there's no data there but it wouldn't go and wipe all of the historic data So there's no way that the central if you compromise these internal machines all they could do is shut down access they couldn't poison the state of the external servers but there's a number of ways you could do that and we've got some new ideas Thank you Alright Thank you Thank you very much