 All right, well, I think it is one o'clock, so we can get started. This is loft your web platform into the clouds with immutable infrastructure. My name is Steven Merrill. I work at Phase 2, I'm the director of our DevOps discipline over there. And so what I want to talk about today is obviously my, you know, the session title talks about immutable infrastructure. So in service of that, I want to talk about immutability, what it's used for in software development, is it actually a real thing that you can get in infrastructure, then more specifically how you can build these immutable artifacts, which is kind of at the core of the idea of immutable infrastructure, and then using those things to achieve AWS auto scaling, how you might do this if you are looking to use containers in one way or the other, and just some real world examples that I've seen with these sort of practices. So, mutability, like everything in a computer is mutable, right? I mean a computer just mutates bits of state in the memory, and that's how it does things. Lots of things have side effects, and yet the topic of immutability comes up quite a bit in a lot of different fields and a lot of different disciplines, and there's several good reasons for that. So in the software development field, we find a number of different places where immutability of one sort or another is used to a particular effect. So one great example would be that a decent amount of language runtimes these days will often have at least an ability to define a variable as immutable. So for example, Scala and Haskell are maybe two great examples of languages where you can say this variable is immutable or mutable, and using immutable variables or using referentially pure functions, meaning a function that always returns the same thing, given the same inputs, it'll return the same outputs, can allow you to just, in general, get performance benefits at runtime. In other words, if you have a function that only works with immutable variables, and you know that given one input, you'll always have one output, you can just memoize that, because if you know that given input A, it'll always return output 10, then you can just use that to your advantage. When you have immutable data structures, you can also use that to achieve better parallelization. You can actually say that, hey, if we have 10 different things that are gonna work on this given variable at a given time, there's no necessarily need to synchronize read access across concurrent threads. So yeah, this idea of immutability is sort of finding its way into software development practices. You'll even see things like, for example, there are immutable data stores that allow you to have a right-of-head log and kind of keep an entire history of something, but each different iteration is a snapshot in time. One kind of cool example, if you haven't seen it, is that the Elm language offers time travel debugging because of the way that it stores its state. You can actually take a program and debug it forwards and backwards in time. You see this in other architectures too. We actually, a while ago, were using Mercurio instead of Get, and it's kind of interesting, because Mercurio is far less mutable than Get. In Mercurio, for example, like in Get, a branch or a tag is just a pointer, right? You have your developed branch and you move it ahead. That is totally mutable. Whereas branch in Mercurio is actually commit-level metadata. So in other words, once you make a commit on a given branch, it is always on that branch. So that's a little less fun, maybe, if you're used to Get, but it also means that Mercurio's data structure can just be a pure right-of-head log because things are never mutated in place. We even see this on the front end. React uses immutable data structures to help limit the amount of mutation and the amount of calculations that you have to do to know when you need to update the DOM, when there actually has been a state change that needs to happen there. Okay, so as I said before, immutable infrastructure. What is the idea behind this thing that's admittedly a little bit of a buzzword? I mean, really, the idea is, we've had this idea of infrastructure as code for a while, that there should be some way using configuration management to define all of your infrastructure using one of the major tools that's available. So the idea behind the immutable infrastructure movement is really just to build these immutable artifacts and promote the exact same thing through your system so that you only have to build once. So really, it might not be perfectly immutable, but the goal is to limit mutation as you take your code, your configuration management, and move it through a given system. So you build these artifacts, which could be machine images, maybe container images. They're designed to help you eliminate drift as you work with your systems. So an example of where this kind of came into play, we were working with a client who was trying to use auto scaling in Amazon Web Services, and they had a process for auto scaling where they would start from a completely blank Ubuntu AMI and would then use their configuration management system to install all the packages, layer the code, like the latest Drupal release on top of that, and then finally, when that started up, it would, in theory, come into service when the load balancer that was attached to that auto scaling group started working. But there are some problems with that approach. Like in theory, it is a very nice approach because you always start from scratch. You can always make sure that your configuration management recipes will provision from a base box. But in truth, there were a couple problems with that. One of them just being drift. Like if you're saying to install the latest version of OpenSSL, and you have two auto scaling instances that just recently came up, and two that maybe came up a little while ago, you aren't necessarily guaranteed that you're gonna get the same version on all those different instances. So it is possible to get drift if you do things from scratch every time. Another one was literally just the time to provision. In this case, going from a blank Ubuntu AMI, installing all the software, then installing the code, then getting the services to start, often just wouldn't complete in under, say, five minutes. So sometimes when you really needed to add capacity under load, it was just very tough to do so. And there was a hard dependency then. In this case, they were using one of the configuration management setups that had a client server model. So if the configuration management server was ever having problems, then you wouldn't be able to add more capacity when you needed it. So I will offer some counterpoints. As I said, truly immutable infrastructure isn't really possible. I doubt that most people who build AMIs and work in this way actually mount their disks read-only. Like you probably actually need a place to put some sort of state container systems as well as Amazon instances could let you do read-only root-of-fesses. Another one would be like if you're really running an immutable infrastructure, would you, for example, disable SSH? Some folks, I think a lot of folks fall in different points along this paradigm. There's a good article here where someone actually said, hey, let's actually tag an instance as contaminated if someone SSH is into it. So you can at least know that someone may have come in here and made some changes. So then if you later on are like, why is this one different? You could actually have an idea of where it goes. So even though maybe you can't reach a theoretically pure nothing ever changes on this, the whole idea behind this would be that the practice of building these artifacts and using them just once and promoting them through a pipeline doesn't mean that you can use auto-scaling. And we'll talk about auto-scaling in a couple different ways later on. The whole idea then behind this is to move to a more ephemeral system where you can have these artifacts, you can bring up one to n of them, and that way you can do auto-scaling and then you can also throw some of them away. How many folks have ever heard of the cattle versus pets metaphor? Okay, so a lot of people when they talk about moving to immutable infrastructure builds or to auto-scaling will say you're supposed to have cattle and not pets. So the idea behind that is that if you have a pet, you probably love it, right? You got a pet, you love it, you lavish attention on it, you give it special treats. Whereas if you want to live in the cold hard world of auto-scaling, you have to have cattle. You've got cattle, you bring some of them in, they do their job, and then you send them off to the great server farm in the sky when they have completed their work. So you may hear that, and I think it is a really great way to encapsulate the idea of the ephemerality that you need in order to get to a place where you are using auto-scaling. It can allow for easier rollback if you have a literal tagged image that contains the whole operating system and your code. It also offers the possibility for like canary or blue-green deployments. If you've got these built artifacts for everything, you could potentially say let's make one of the five with our new code, see if we get a spike in errors, and if so, then complete rolling it out. So for the next part of the presentation, I do just want to talk about specifically how these ideas can be used to get you to the point that you could get to the point where you are running using some sort of auto-scaler. So the really important thing to be ready for auto-scaling, to be able to build these sorts of artifacts is to have a good, repeatable build process in place. And honestly, most modern web apps are gonna have some kind of build step. I mean, I've personally worked on several projects where you have to do a Drushmake-based build to pull down some modules for a Drupal site. That's probably gonna pull down some libraries from GitHub or somewhere else. You may have to compile source. I mean, not in the PHP world so much, but if you're working in a different runtime, you might have to also create a jar, create some class files. You're probably gonna have some sort of CS or JavaScript compilation or minification and assembly. And certainly all these steps can take some time. If you've ever worked with a node modules folder that's got like 40,000 files in it and there's a couple hundred megs, you know that this can take some time. And there are these steps that are required to take your plain source code and convert it into a spot where it is ready to be run and to put into production. We at Phase Two have an open source set of tools called GruntDrupalTasks that makes it easy to do a Drushmake-based build, to compile a theme using Grunt or Gulp. And it has a package command that allows you to make a folder or a tar ball. And it can start you with an example where you have a set of custom modules, custom themes, a Drushmake-based build. I will also say, just when talking about the build process in general, another interesting thing to think about, especially if you're gonna be doing this big build that relies on NPM and Drupal.org and some other libraries on GitHub, is the topic of vendering. A lot of major organizations will vendor their dependencies in one way or another. That might include running a private NPM mirror. It might include using like a squid server in front of your build machines and that way you can protect yourself from some sort of temporary outage or something like when the left pad module was unpublished from NPM and suddenly React Native and a whole bunch of other upstream projects broke. So there's actually a really great discussion about this and a couple other things in the Who Owns Your Availability episode of the Arrested DevOps podcast, where a couple of folks who've worked at large companies talk about kind of this idea of what should you vendor? Should you maintain your own package repositories and what are the pros and cons of it? But I think it's also worth thinking about that, that if you wanna have this build pipeline, you might wanna consider about well, what happens if D.O. is down? What happens if NPM is down? So to be able to build these kind of artifacts, you might have multiple different build targets. Maybe the simplest one that's not necessarily related to doing auto scaling, but if you run in a platform as a service like Acquia or Pantheon, you know that they each have a Git repository and that is the way that you do deployments, right? You actually push to the Acquia hosting repo or to the Pantheon Git repo and that triggers a deployment. But again, in all likelihood, you probably have at least something that you're compiling. You may have a Gulp-based build process for your theme. You might still be doing Drushmake-based builds. So maybe the simplest build target that you might have worked with is just let's do a build and then make that one big commit to a repository. Maybe the next simplest build target, if you just want to be able to keep artifacts of your built sites, is a tarball. Like I said, with grunt-dripple tasks, we have a grunt-package task that just allows you to do all the build steps, do Drushmake, compile your theme, and then output a tarball. And so then you could take that and throw that into Jenkins and to Artifactory and to S3 and then you have a built copy of the site for every single revision and then you can work on taking that and promoting it up the chain. For use in auto-scaling, you're probably ultimately gonna end up building some kind of a machine image. So in the case of Amazon or Google, both of whom have auto-scalers for their cloud, you're probably gonna build an Amazon AMI. And an AMI is nothing more than just a complete file system. So maybe CentOS plus all your packages plus the code base that you just built. Same thing, Google has the private images concept where you can snapshot a running instance and then you can use that in their auto-scaler. If you're running using some sort of a container scheduler, which we'll talk about later, this could also just mean you have a built container image where you start from a base image, you again install whatever packages you need, you add in your build and then you push that to the private Docker hub, your own Docker registry inside of your infrastructure. And with those in place, you can then use that to kind of horizontally scale your instances, whether they are actual EC2 instances or instances of containers running in a scheduler. So how many folks here have used Jenkins before? All right. I think Jenkins, I discovered Jenkins a long time ago and I do wanna say that I think Jenkins is a really great tool. And I also think that Jenkins can be abused maybe because it is such a great tool, it's very easy to build automation into it. And it's also very easy to just have people come and type into a text field and suddenly an important part of your infrastructure is something that someone just typed into a text field. So I think you kind of have to be careful from that side of things. I'll tell you the personal story of where I came to in terms of using Jenkins. When I first got Jenkins, we were working, this is probably like eight years ago on a build and I was like, oh, this is great, I can use Jenkins and I can set up actual Jenkins build processes on each of my live web heads and one by one in serial, I'm gonna go and do a Drushmake on each of them. And that's terrible, because if one of the Drushmakes fails, now all of a sudden you're in an inconsistent state between two of them. So I mean, this is me, like I said, maybe eight, nine years ago, just being like, Jenkins is awesome, I can run all this stuff, but that is certainly not to the point where it's repeatable and you can get into big trouble if one of your four web servers stops or even worse if a build fails on the second one and the third and the fourth don't run. So then, at least my personal evolution from that was like, okay, I'm gonna build once in the workspace on my Jenkins server and then I will sync the code out to the different doc routes. And that's getting a little better, even better one would be just build in Jenkins, store that artifact and use some kind of a deploy like Capistrano where you can just atomically switch SIM links. So you have very tight control over when these things get deployed. And of course, the ideal state to get to where you can do auto scaling would be to build once, build that artifact and then build some kind of an image around it so that you can put it into an auto scaling group and put it into production. I also wanna mention that in terms of building images, especially like thinking about building AMIs for now, you wanna be ready to do two different kinds of builds. So you're probably building for an application, right? You've got a Drupal site, you wanna do regular releases of it. So certainly, yeah, when you build an AMI, part of that equation is just I need to have this particular build of my Drupal code there. But the other part of it is that an AMI also includes an entire Linux file system with packages, with configuration files. And as you know, at least recently, there are a lot of Linux vulnerabilities that have been coming out recently. As people are delving into open SSL, open SSH, bash, libc, there are certainly a good number of vulnerabilities in key parts of just the underpinnings of most modern Linux systems. So I would say that you should be ready to build a new AMI or a new artifact, both in terms of when it's time to do a new code release or when a CDE comes out. And again, some kind of a high level vulnerability. And in that way, you will be ready if you have this repeatable build process to build a new machine image that both can just not include new code, but can include critical security fixes, or can include new code. And then you can promote that through your system without having to change the code that's on it necessarily. All right, so auto scaling. I think this is probably the desired result for having this repeatable build process and building these immutable images. So how many people here have ever sat on a call and just watched server monitoring graphs before? Yep, I mean, maybe can't be avoided, but certainly you have a high traffic thing and everyone's like, all right, we're gonna get on a call line, we're gonna take a look at these server graphs, like make sure nothing happens. And I think the number one reason to get to auto scaling, if you can, is to reduce stress. If you can get to a point where Amazon or Google's auto scalers can be watching metrics for you and handling, scaling things up or down, it's a really great spot to be in. So like one of the examples that I use is at one of our clients, we got to a point once we rolled out auto scaling that we could come in on Monday morning and say, oh, hey, there were some spiky traffic events that happened on Sunday, so we auto scaled up to six instances and then back down to two as the traffic leveled off. So I think that the reduction of stress for your dev end or ops teams is probably the biggest reason to use auto scaling. Another one is automatic healing. So if you have an auto scaling group that has at least two instances in it, that'll mean that if one of those fails for whatever reason, and there can be transient failures that happen, that the auto scaler will just replace it. So if you have at least two, then you can have some form of automatic healing. Another one is certainly cost benefits. Using auto scaling, you can generally run say two or three smaller instances and then set a much higher, higher bound, like say from two to eight instances and then let the Amazon auto scaler or the Google auto scaler handle that for you. So yeah, certainly the automatic scaling based on average metrics across an auto scaling group is a great thing. And still, if you have those big events where everyone's like, oh, I don't know, we're expecting a lot of traffic, you can also just increase the floor, right? Instead of being a minimum of two, you can say, well, let's make it a minimum of eight for the next six hours and maybe that costs you an extra 100 bucks in exchange for some peace of mind. So for a more practical example, let's talk about what you would need to do to auto scale a Drupal site using Amazon web services. And I'm using AWS because they are definitely the market leader. Google certainly also has all these capabilities. So a Drupal site running in production is really made up of three different things. You've got your code base that's living somewhere on an application server. That could be a Lemp stack, a Lemp stack. But yeah, ultimately you've got the Drupal code base and you have something that is running PHP. The second one is your relational database. I'm gonna guess that most folks here are using MySQL. So that's what I'll generally assume, but it could be Postgres as well. And then you have the files directory. And the big thing, if you're not as familiar with Drupal is that at least the default for Drupal systems is to use the public stream wrapper. And the public stream wrapper just generally assumes that the files directory, let's say site's default files inside a Drupal doc root is magically available on every web head. Like if someone hits web head number three and they're logged in and they upload a file, then it assumes that, okay, on web head number one, that file is also there. So in a lot of cases, maybe where you're not in auto scaling, a common solution to this kind of problem is to use something like NFS or Gluster so that you actually have a networked file system that's mounted and that way, yeah, when something is written to on one of the web heads, it then appears on the other ones. Some other optional parts of a Drupal hosting setup, quite a few, very often, you'll also put Drupal's object caches into something like memcache. I think it's actually kind of interesting. Drupal.org recently turned off memcache because they actually found that with modern MySQL, it just wasn't as needed. And talking to David Strauss, he's also the opinion that we may not need to actually store object cache somewhere else, especially as modern versions of MySQL and the forks like Percona and Maria have increased. And similarly, Drupal 8 has a chained cache back end so you could actually say store everything locally on the server in memory and then check the timestamp and if you do not have the latest version, go and grab it from MySQL. So I would just say that it's possible that in the near future, it might be less important to have something like memcache or Redis to hold object cache, but it's still a very common thing to have right now. And of course, you may likely also have varnish or a CDN in place to just offload static images or to offload anonymous traffic as well. So in terms of how autoscaling works in Amazon Web Services, an autoscaling group is attached to one or more elastic load balancers. So basically, you set up an autoscaling group and you say, when this autoscaling group runs, which elastic load balancer should it add and remove instances from. Then you set up the elastic load balancer health checks and the timeouts, which will help control those actions. So with the attached elastic load balancer, you can then say, okay, use the ELB and allow it to control whether or not an instance is considered in service or not. You then use CloudWatch alarms to trigger autoscaling. So for each autoscaling group, you'll go in and you will put in a variety of different CloudWatch alarms that you wanna use to control when you scale up and when you scale down. So one example that you might use, and it's worth noting that the CloudWatch alarms that you use, you'll generally use the averages throughout the pool. So a very common one might be to just say, okay, if the average CPU of the machines and the autoscaling group are greater than some amount based on testing, 70%, 60%, for more than five minutes, then add a new instance to the group. You also have control over how often you do that. So you can say, add a new instance after five minutes of this elevated CPU and then wait five minutes, or you can say and then wait one minute. You have very granular control over the policies. Then you can also define the policies for when instances leave the group. So in other words, when the spiky traffic that your application has subsides, when should it scale back down? So an example might be, okay, if the average CPU is less than 40% for 30 minutes, then take an instance back out of the pool. These will usually vary based on your workload and your traffic. I do wanna point out that the reason that I have a longer timeout on the scale down is that Amazon at least charges you by the hour. So it probably makes sense to have instances in for at least around an hour, just to get your money's worth. It's interesting to note that Google actually only charges by the minute. They charge a minimum of 10 minutes, but then after that you're billed by the minute. So you actually could go with a slightly more granular policy on Google, maybe save a little bit of money. They make a lot of, hey, about this in their TCO comparisons against Amazon, but I think both are fine systems. So you've got your auto scaling group, you have your policies, and you tell it which elastic load balancers you're going to add or remove instances from, and then you put in a launch config. And a launch config is a one-to-one mapping where you say launch this AMI and this instant size. So you might say to launch this particular AMI that has a build of your code on a C4X large with 100 gigabyte disks. And an auto scaling group can have one active launch config at a time. So basically for each build you will build an AMI, you'll then build an associated launch config, and you'll then flip over the auto scaling group to use that new launch config. And any time an auto scaling action will add instances, it'll use whatever AMI and instant size that you specify. So a quick AWS recipe for running Drupal might look like a Jenkins instance in a private VPC subnet that's building tar balls and then building AMIs with all your config management recipe around them. You'd have an auto scaling group that's got your public Drupal instances in it. A very simple way that you could get this up and working without having to manage a MySQL server is just use RDS. RDS will handle replication for you, it'll handle patch releases for you, it'll handle point and time restore, which has saved my bacon on more than one occasion. And granted, you could also run these on your own EC2 instances if you want to, but certainly I think things like RDS and Elastic Cache are some of the things where you can get a decent operational benefit for your business. Like if you don't want to manage MySQL, you don't have to. And I think that's a pretty nice value add. You could use Elastic Cache for, oh sorry, you could use an S3 bucket for your files via the Amazon S3 module. There are actually two modules in this space. There's the Amazon S3 module and the S3FS module. They both work pretty well. It's interesting, well yeah, I'll get into that more a little bit later, but there are two of them, they both actually work pretty well. And with that, you can set any of the image uploads or things that you have to use the S3 stream wrapper instead of the public stream wrapper. So basically, once you turn on and enable these, then when an editor uploads an image, it'll just go into S3. And then when Drupal constructs the URL to that image, instead of it just being a link to site's default files, MyImage, it'll actually be a public link to an S3 bucket. So in other words, that image is just gonna be served out of S3 and it won't have to hit your server after that point. You can use ElasticCache to spin up a memcache instance and store your Drupal object cache. You could use CloudFront for page caching and distribution. CloudFront is the low cost option. You don't pay for transfer between your EC2 instances and CloudFront. I will say that it's not as advanced as something like an Akamai or a Fastly, but you can use it for basic page caching. It can do things like conditional requests so you could run fairly low TTLs for your dynamic content and have it check in. And it does now. It finally just recently added wildcard and validation so you can actually clear all your cache. Before that, you'd have to actually just make a new CloudFront distribution, which wasn't as great. And you could even use like SES to send out email via SMTP module. So by combining this suite of services, you could really just have your code in the auto scaling group and you really just outsource the rest of your stuff, your relational database, your file system storage, memcache, even email to other services. And that way, all you have to do from those actual app servers is to run your Drupal app. So a simple example of how you would go about building those AMIs, which again is gonna be, take a base AMI, run your config management, and then put your Drupal code into place. You'd wanna have some sort of a Jenkins build that'll make you a tar ball or a build artifact that you can get into the AMI. I like to use Packer to automate these builds. Packer is an open source project by Hashicorp. They're the makers of Vagrant and Console and Vault and a number of other really nice sort of operational focused tools. Packer has a notion of provisioners. And so with this, you could for example use the file provisioner to copy the build stuff from Jenkins onto the image. You could then run your configuration management. It does support provisioners for running any of the big four or you can just run a shell script if that's what you have to actually get the image ready to run. Then you'd create a new launch config and you could flip your instances over one by one. When you enable a new launch config in AWS, it doesn't do anything by default until a new auto scaling event occurs. So there's really two ways that you can roll out code deploys. You could either temporarily say double your capacity. So you'd say my minimum is now four, which would cause the auto scaler to launch two of your new builds and then you could kill off the old ones. That's it or you could just kind of wait for them to eventually roll out. You'd probably want to actually flip them over one at a time. So an example of how Packer works. Packer has the ability to specify different variables. You also don't have to necessarily put your access key and your secret key directly into your template. You can also pass those in, but it has this notion of variables. So you can define variables and you can define defaults for them. So in this case, okay, we can say, we're gonna do our builds of the AMI because this will actually launch a temporary AWS instance, do the build, whatever provisioners you say, whatever, hello, my name is Steven. Whatever provisioners you have selected are going to run and then it will stop and snapshot that image to make the new AMI for you. So in this case, you can set all these different variables in place. So which subnets to use, which VPCs to build in, which AWS instance type to use, and these variables don't do anything by themselves. The AWS access key and the secret key, you could also pass those in. You then have your provisioners. So one useful provisioner is the file provisioner. The file provisioner will just copy something from the machine where Packer is running to the machine where the AMI is being created. So in this case, if they're both running from your Jenkins server, you could say, okay, go into the workspace for the other build and grab my tarball and then upload that to the AMI build machine as optbuild.tgz. And then I use a simple example of just even running Ansible Playbook via a shell provisioner. There are provisioners for the big four too. So you could say, okay, once that file has been uploaded, go and run config management. And then at the end, both those provisioners, if they succeed will allow the creation of the AMI to complete. So in addition to the variables and the provisioners, the builders themselves are where you specify the exact parameters for each build. So this is an example of a Packer template that has two different builds to build one AMI in US East One and one in US West Two. So there are different builder types. Packer supports building Amazon instances through several different methods. You can build vagrant boxes in virtual box or VMware or QMU. You can actually build Docker containers, much like both Perl and Drupal, there's so many different ways to do things with infrastructure tooling, but Packer does have a number of different builders. And you can see that in the course of this, you are, we are specifying and asking for several different of those variables that were set. So if they were not overridden when the Packer binary was actually run, it'll just use the stuff from the top in the template. You'll see that I'm actually specifying a region not from the variables because the East Build is probably always gonna build in US East One. And then in the AMI name down here, I didn't actually put build tag in as a variable up top because presumably you're gonna wanna actually pass in some sort of identifier that'll go into the name of the AMI that gets output. And if you try and run this and you don't pass in build tag, when you run Packer, it'll complain to you and say you must specify build tag. So this is just a simple way that you could get to a point where you're building in US East One and then similarly, you know, US West Two. So there's, it's the same type of builder on Amazon EBS backed instance. It's gonna have a different name, a different region, and then it's gonna add West onto the actual name of the AMI that's output. When this runs, Packer will run them in parallel. So, and the name that you use there, the East AMI and the West AMI, will actually show up in the output. And then once they both complete, it'll let you know what AMI names it produced and you can then use those to set up your new launch configs. Some sharp edges around the actual use of S3 with Drupal. By default, when you upload an image through either of these modules, it's actually gonna go on your server, go to the temp directory and then Drupal will go ahead and upload that to S3. So if you're uploading a gigabyte PDF, your PHP will need to have a memory limit of a gigabyte. This is probably not a problem if you're working with small things, but we've actually seen a case where customers have wanted to build like a video platform where people could upload movie files. And so, that would require a huge PHP memory limit and not really be all that desirable. So there are other solutions to this. There is a daemon called S3FS, which is a Fuse file system backed by an S3 bucket. So you can take an S3 bucket, make it look like a path on disk. I would not recommend doing this from a lot of production struggles necessarily. It doesn't work all that well and has some interesting weird failure modes. Question? Yes, so in an unfortunate quirk of naming, S3FS module has nothing to do with the S3FS daemon. Yeah, the S3FS daemon is a Fuse file system. I would advise you to look at avoiding that. Okay, so yes, another non-ringing non-endorsement of the S3FS daemon. And I actually haven't tried it in a couple years. It may be slightly better, but honestly, natively working with the Amazon APIs is likely to be better. And for this particular problem of, hey, I have this multi-gigabyte upload, what do I do with it? Both of the major S3 modules for Drupal have a separate S3FS cores or Amazon S3 cores module, which will allow you to enable direct cores upload to an S3 bucket. So basically that way, a user can just directly write into S3 with their browser and then Drupal can come back and get the reference to that. So it never has to pass through your infrastructure. So you don't have to worry about slingshotting a huge file through your running PHP processes and on up into Amazon. So yeah, I think that's definitely the better way to go about it. So some auto-scaling quirks too. Like in the ideal auto-scaling is very nice, but then Drupal has a schema update. Let's say you have a schema update where you rename a field. You're gonna likely run into some problems, right? If you have something that doesn't alter table that renames a field, which is probably the worst case for a Drupal schema update. Before you run the schema update, your new code will throw errors and probably white screen, depending on how severe it is. And once you've run that update, your old code will probably white screen because it'll say I don't know where old field is. So there's a couple different ways that you can handle this. One would actually be to add double the capacity. So let's say you have a minimum of two, you flip over and you launch config, you double that capacity so that you have a minimum of four now and you'll have your two old instances and your two instances, your two new instances. Presumably the new ones will fail until the schema update runs and then your old ones will fail. The autoscaler will kill them and replace them with the new code. So that works fairly well in practice. You could also potentially consider taking some downtime and doing maintenance mode. Most stakeholders don't like this, but it could be an option to actually say, let's just let Varnisher fastly serve our, hey, we're doing some maintenance page while we run a big hairy schema update. Another thing that some people want for a variety of different, very valid reasons is to have a separate editorial pool. Like one thing that we've commonly seen would be to say, we wanna make sure that nobody can log into Drupal unless they're on the VPN or unless they're coming from one of our offices. So you could certainly build the same image and maybe change the way that some of its code works based on environment variable to say like require basic auth if the environment is set to a certain parameter. We've even seen the case where sometimes like you'll have editorial pets and then you actually just run your sort of production instances in an auto scaling group. So it can certainly be done. So hello, Brian. I'd like to talk a little bit about one client where we did actually help to set up auto scaling and this was for MLS Digital. We did a case study with MLS Digital yesterday. We've been helping them for about the last year to work on mlssoccer.com and all the club sites. So MLS Digital maintains a 21 site Drupal 7 platform that handles both the league site and all the club sites. They really are kind of a centralized service provider that provides this CMS and Digital Hub as a service. MLS is a great candidate for auto scaling because they really have like quadruple traffic spikes when games are going on, when big events happen, news about players being traded hits. And so when we got there, auto scaling was turned off and Drupal was actually not set up in a multi site either. So in order to help get MLS to the point where auto scaling worked, we first actually moved to a Drupal multi site. Drupal multi site is a divisive topic and not every set of websites is well suited to be run as a multi site. But in the case of MLS, all the sites ran the same code base already. They all used a sub theme of a parent theme. So they actually were a really great candidate for being a multi site. It's because it's not like they had their individual club saying, well, I want panel seven and context 13 even though the platform only has panel six and context four. So yeah, by actually just moving to a multi site, we got their build artifacts that contain the code down from like 500 megs to about 40 megs. And that ultimately helped us save about 1.6 gigs in OpCache or APC. So that was a big benefit to the density of number of Drupal DocWords you could run out of a given server. So the client I talked about earlier was MLS where the old setup just started from a base of Ubuntu and provision using Salt. But we had problems mainly with timeouts. Just sometimes it would take more than five minutes to scale up. And so yeah, we moved to a Packer AMI build that had all the software and all the code on them. We also specifically made it so that even though the auto scaling instances would join the salt master, they didn't actually run salt when they joined. That was just for like emergency reporting or anything with the benefit that an AMI can come up and serve traffic even if everything else is down. If the salt master is having problems or something else as long as Amazon's core auto scaling platform is up and running, a new instance can come up. And as I mentioned before, MLS is using all the technologies we talked about. They're using Amazon RDS to handle the MySQL database, including replication and disaster recovery, using Amazon elastic cache to run memcache and storing files in Amazon S3 via S3FS module. So it was a pretty good success. We went from four pets to two to five cattle in an auto scaling group to serve traffic up to public site visitors. Eventually by turning off their old legacy data center we got a pretty great hosting cost reduction and later Drupal optimizations even let us lower the instant size. So this is actually the month of July out of Datadog where we first moved to auto scaling. And you can definitely see the spiky traffic patterns. The top graph here is load average across the auto scaling group and the bottom graph is the number of auto scaling instances. So generally like a couple of week days or the first couple of days, we stayed at two instances. Then games tend to happen later in the week and on weekends, so then you can see that the instance count jumps up to three, even jumps up to four a couple of times, but still it goes up and down. And the majority of the time, I think the actual average here is about, yeah, 2.3. So MLS could run an average of 2.3 instances instead of four of them in any given month. We then did some work and I won't describe the whole thing, but we basically before a big migration did some work with New Relic to really find some Drupal bottlenecks and really lessen the amount of CPU that Drupal was consuming. So we went from like an average of around 11 to 1300 milliseconds to serve a page from the auto scaling group with a couple of tweaks down to like 500 milliseconds. And if you wanna talk about reducing stress, once we rolled that out, we were pretty much solid on two auto scaling instances for a while and we could actually go down an Amazon instant size and still for the most part, stay on two instances at any one given time. So that's the general idea that if you have an immutable build process where you can build some kind of a machine image that contains all of your software and configuration via configuration management and can put your code onto an instance, you can avail yourself of auto scaling. So next up, how many people have opinions on containers? Okay, so let's talk about containers. I will say that containers are a very, very interesting and hot topic in our industry right now. And certainly with pretty good reason, you have some potential really great benefits from, I will generally say Docker, because it is definitely the most popular container runtime that's out there. You have a lot of potential benefits. You can get increased density. And a main reason that you can get increased density is by running heterogeneous workloads, right? Like if you have a bunch of node apps that mainly consume memory and a bunch of Drupal apps that will gladly eat up all your CPU, you could potentially host those on just one cluster of worker servers and get great density and really let each of the individual apps that are gonna use the various different resources on the box, use them and still collocate them together. You do, when you're running on bare metal, get native CPU performance. Like they are just processes ultimately. You do get native disk performance in volumes. You don't necessarily get native network performance. Most of the Docker networking setups will lose you a little bit of performance, but, and you do also get the ability to set resource limits. So yeah, there's certainly some great things that you can get by moving to a containerized workload. It's also not easy, right? Docker on a single host is fantastically easy. You write a Docker compose file and you say I need a database container and a memcache container and a lampstack container and you are off to the races. But generally once you get to the point where you are ready to run Docker in production on multiple hosts, there's a lot of different things that you have to think about. Certainly most of the setups that you will see will use some sort of overlay networking so that every container has an addressable IP address. You might have to think about network security too, like if you don't trust apps A and B, can you let them be on the same overlay network together? There's also things like service discovery, like how do I route to all the running instances of a given app container on these different physical hosts? How do I do scheduling? Even how do I do routing or load balancing? Secrets management and certainly monitoring of containers is far different from traditional monitoring. So with that said, I'll come back to what I like and what I like to think about in this space, but even how you do Docker builds is very interesting because if you're a company that has already used configuration management, you might be tempted to use configuration management to build them, whereas I think a lot of people who have come to Docker maybe and don't have as much of an ops back, and they'll just be like, this is amazing, I can just use a Docker file. And certainly building an image up from a Docker file is fantastic because it caches the layers. So if you make a change on only the 13th layer, you can actually really quickly iterate and work together. So I don't have an actual answer on which is better. I will say that at phase two, not all of our clients use Docker, and so we certainly use configuration management as well, but this will all depend on what your situation is. Another consideration is dynamic templating. If you have to have some kind of process that runs and needs a configuration file on disk that can't just run based on environment variables, you may have to have a way to write a config file on disk before a service will start in a container. There are tools like ConfD that can do this. You don't have to think about that. And then even build tools, you can use Docker build, you can use Packer, every major configuration management system also has a way to handle container systems. So the Docker daemon though does offer tagging, and by using tagging intelligently, you can also get to the point where you have the same immutable infrastructure workflow where you build a set of bits once and then you can promote it through a life cycle. So ideally, I would recommend that you both tag your builds, preferably from your main Git repository, maybe from like Jenkins build number, but that you have one canonical set of tags. So you can always say, I wanna go back to build number 40 or tag 1.1.0. And then you can also use these Docker tags kind of like branches in Git. So you can advance that tag to say, okay, dev currently points at 1.1.0, and then we're gonna move it ahead to 1.2.0. So yeah, you might actually build both using a Git tag with a version. And then also use it to act as sort of a branch. So as an example of this, excuse me, if you're gonna do a Docker build, you could say, okay, I'm gonna do a Docker build and I should tag that with p2-site version 1.1.0. And that'll build that container and tag it with v1.1.0. You can then go and say Docker tag and it'll say apply the same tag that currently matches 1.1.0 to dev. And then you could Docker push both of those. It'll only have to cache them once because it's the same set of layers, it's the same image for both of them. But then that way, if you have a service setup somewhere or a system unit that is always gonna run the dev version, you can then run that way. And then when you have a new release or when you wanna promote this same release to your stage environment, you could just re-tag 1.1.0 as stage and push that. And again, that push, much like a Git push that advances a tag, won't actually repush any image layers. It'll just repush that metadata to say that this tag points at this particular build. So with all the things that I mentioned where Docker in production is kind of hard, I'm gonna go through this a little bit quickly. I really like Kubernetes as a solution to this. There are a number of Docker schedulers out there. You have Amazon ECS, you have Tutom, you have a number of platform as a service systems that run on top of it. So you also now have the built-in Docker swarm and Docker networking. So there's a real, real large set of things that go there. I happen to like Kubernetes. It's an open source container scheduler from Google. Their pitch is that you can manage a cluster of containers as a single system to accelerate dev and simplify ops. It is based on some of the ideas behind Google's Borg system. Google has actually been running containerized workloads for many years and runs billions of containers a week. They're not actually using Kubernetes. They have an internal system called Borg. But the idea behind Kubernetes is to take the principles of that system and make it available on open source. Google will also host a Kubernetes cluster for you with their GKE product in Google Cloud. I like Kubernetes partially for that reason and partially just because it has also gotten a pretty nice set of contributors. So Red Hat has been contributing a considerable amount because OpenShift 3, which is their platform as a service, is actually based on Kubernetes. OpenShift 3 is actually now kind of the canonical upstream for Kubernetes. So a lot of new features come to OpenShift first and then filter down. The DACE project rebased on top of Kubernetes. Even CoreOS, who wrote one of the first distributed schedulers for Docker called Fleet, they're still going to finish Fleet, but they've really kind of thrown their weight behind Kubernetes anyway, and they have a commercial offering called Tectonic on top of it. So why Kubernetes? It does require you to bring your own overlay networking. So before you can use it, it assumes that somehow in your cluster, each container will have an addressable IP that any other container can reach, but it has a lot of nice features built in. It has a concept of service discovery, just called services. It will do load balancing and cluster DNS for you, and specifically this means that if you spin up a service, it will spin up an Amazon elastic load balancer or a Google Cloud load balancer for you. So you need to say make a new service, it'll make an ELB, and that ELB will just automatically route to all of your running containers. It offers pod scaling manually, and it does now have a beta horizontal pod auto scaler. So you can actually say, based on the CPU usage of my pods, which are sets of containers we'll talk about in a second, basically add more instances. And again, that's kind of the holy grail, to, based on your CPU or memory use, add more instances to balance the load. It has resource aware scheduling and capacity limiting. It has a very nice REST API and a very nice CLI client. It can do secrets management too, so you can actually put sensitive passwords or things into it, and it will handle distributing it securely to containers, and they'll actually never hit disk. It uses tempFS in the containers to handle them. And you can define all of your objects in Kubernetes via either YAML or JSON. So I'm going to go through this fairly quickly. Kubernetes has the concept of nodes, which are just machines where Kubernetes clients are running. It has labels. Everything in a Kubernetes system can be labeled, and these are all arbitrary. You actually use labels to select pods that are going to go into a service. A pod is just a set of containers, but the big thing about a pod is that it can be multiple containers. Let's say you have to run Apache and PHPFBM together. That's a fairly common thing to do. Pods are just sets of containers that run in the same network namespace. So in other words, you can access stuff on localhost. If you have a nginx container that wants to hit PHPFBM on localhost 9000, that's how a pod works. You say run them both together, and they will be in the same confined network namespace. There's also a concept of daemon sets, which lets you run one thing on each node. So if you have to run a data dog agent or a syslog shipper or something, you can say, hey, my cluster needs this daemon set. And again, daemon sets, like everything else, can also be labeled. So you can say, for every node that's labeled with prod, run my syslog shipper or my EC2 backup script or whatever it happens to be. And then replication controllers and services are how you actually start and scale different pods. And a service has an IP address and a DNS name where you can reach all the running containers. There's also a new atomic deployments thing where you can actually do rolling deployments. You can switch out pods one at a time when you have a new artifact together and roll them out that way. I'll go very quickly here. So yeah, everything in Kubernetes can be done with YAML or JSON. So you could say, OK, this is an example of a pod that just has one container in it, a LAMPstack container. You can make requests, which will help the scheduler determine when a node is too full. You can also do limits to actually say, don't this container consume more than a megavram, or in this case, two CPUs. They use the idea of 1,000 millicores being one core of a box. You could then set up a replication controller to say, run five of these containers, looking for containers that have been labeled with the key of app and a value of LAMP. And then similarly, you tell it what to spin up. You could then make a service for this, where you say, OK, select all of the running pods that have an app value of LAMP and connect to port 8080. And if you're running this in AWS or Google, that would automatically create a load balancer, where the service could be reached. And as pods go up and down, it would connect to them. So in summary, immutable infrastructure maybe is not something that you can 100% achieve. But the idea behind immutable infrastructure is to build immutable artifacts for the things that you deploy, mainly to help you limit your configuration drift in your infrastructure. And by having this repeatable build process, you can be ready to both build new code releases and also be ready to respond to security vulnerabilities in the underlying Linux systems where this code is going to run. And the main reason to do this is to get to auto-scaling, since auto-scaling can help you to reduce your cost of your hosting, your stress, and to get self-healing benefits. So thank you. All right. And so we do have about eight minutes for questions. If you could please come up to the mic if you have a question, so we get it on the recording. Or if there are no questions, then oh, cool. Actually, can you elaborate on self-healing? Maybe I missed some of that when you're going through, but just to explain that a bit more. Sure. Yeah. So self-healing specifically around auto-scaling groups means that as long as you run more than one instance, if one of them goes away for whatever reason, an Amazon hypervisor happens to go offline, then you'll have at least one running instance that the ELB can route traffic to, and then it'll automatically spin up a second one. So just kind of basic redundancy and the fact that the auto-scaler will help to make sure, as long as one instance stays in service, it can actually then replace a second one. OK. As far as the ElastiCache goes using Amazon's, the memcache modules that Drupal has now, do they do auto-discovery for clusters? Because I noticed that in ElastiCache, you have a cluster endpoint, and then you have your separate node endpoints. I've been trying to figure out which is the best point to from Drupal's end. Yes. That I have not managed to get auto-discovery working. I know that Amazon provides a separate actual LibMemcache SO that is supposed to allow that. But in my experience, the best thing to do, which unfortunately is not automated, which doesn't let you change your topology, you can't go from two to four, is to actually just put each of the individual servers in the cont memcache servers array, and then that way you can let the PHP drivers do their consistent hashing to that pool of servers. So yeah, I have not had success with auto-discovery and ElastiCache up to this point. OK. Thanks. So your example with MLS was that you converted or you worked with them to convert them to a multi-site for the 21 team sites. Had they said no, what would you guys have done? Let's say that the teams were all different, all 21 site, and docks were different, which you would have still went with the same approach. Oh, that is a good question. Because, yeah, I mean, the main reason that we were looking at it was for just the benefit of lowered server resources. Like the fact you don't have to then share APC, well, or have like 20 copies of Drupal in APC basically. I mean, I don't think there would have been a way around it if that were the case. And so in the case of MLS, like we did some back of the napkin calculations and said, OK, like 21 instances times about this much first APC memory would mean, well, that'll be about a gig and a half of APC memory. That actually probably wouldn't have been the end of the world. Like we were, I think, at that time running on instances that had eight gigs of memory. And I think the memory limit was probably around 128. So it wouldn't have been the end of the world. It would have meant probably four to six fewer PHPFPM processes that we could run. So I don't think it's the end of the world if you can do that. I certainly wouldn't try to force multi-site on a structure that doesn't support that structural sharing. You'll probably just get more headaches than it's worth in that case. OK, thank you. Do you do anything specific to handle scaling the database? It sounds like you guys use RDS, but in terms of those traffic spikes and the database becomes a bottleneck, is there anything you do there, or is RDS handling that for you? Well, so that's an interesting one. I would say that first off, Drupal does have the ability to send read queries to a replica server. It's not particularly great at it. You have to specifically in your code mark that something is safe to send to a replica server. So you have like, and of course Drupal uses the slave terminology, you have like databases, default, default. And by default, if you just have that in your settings about PHP, that'll use everything. You can then have like settings, I'm sorry, databases, default slave, and any of the queries that have been marked as safe to go to that will go over there. Now, there are some other nice benefits in Drupal 7, for example, if you're building things using views, you can say, use a replica database for this view. So if a lot of your site is using views, you can send things that way. I think the two major approaches past that of just saying, let's send some of our read traffic to a replica or some replicas, because the idea is if you're sending a decent amount of your traffic over to read replicas, you could add more read replicas without having to vertically scale the main database. And RDS will handle like adding one or more replicas for you and handle that replication. So beyond that point, when you get to the point where you just can't sustain the write load on your primary database, you've got a couple of options. There are modules like autoslave or even MySQL and D or MySQL proxy that purport to automatically do that splitting of read and write queries for you. I will say that I've personally had some bad experiences with autoslave module. If you're using it, just do some performance profiling or at least even just see if you can take a look at what kind of a split you're getting. I will say that the guys at Blackmesh have said that they have seen a lot of success with using something like MySQL proxy. I don't have any specific experience with that one way or the other. So I think that's one avenue, is looking at some way to then just artificially increase, well, not artificially, but some way to automatically increase the number of queries that are going over to read replicas so then you can horizontally scale those. Another interesting one that I want to play with but haven't had time to is that specifically in Amazon, there is Aurora and Aurora is really just DynamoDB that speaks MySQL 5.6 and so with that, you can spin up as many read replicas as you want because it's just one super fast DynamoDB instance under the hood and it purports to be MySQL 5.6 compatible. I don't have the experience but I really want to play with it. So that, I mean, and again, that's not necessarily 100% solution but you could theoretically get four to five times the threshold before you have to then vertically scale your database again. That's their benchmarks. Haven't had a chance to test but I am very interested in doing so. All right, thanks, that's good information. All right, one more question? Sure. Can you talk about how you handle test environments with things like file system configurations for web server and so forth? Yeah, absolutely. So I really like to usually actually set up a number of different settings files because like you're gonna probably have different database settings, different S3 buckets that you're gonna head. If you are using, you know, you might have just like an auto scaling group of a single instance for dev and test and then a real production auto scaling instance. So usually what we'll like to do is to actually like set an environment variable and then load a different settings file if that's available. So you might have like settings.dev.php, settings.test.php, settings.prod.php and then in that you can actually switch on which database should I connect to, which S3 bucket should I use. And then similarly in a lot of cases what we've done is to set up something that would do an actual SQL backup say to S3 and then refresh lower environments nightly or on demand depending so that you can say, because one of the biggest things to do especially around schema updates would be to pull a database down from a higher tier, run the schema update, see if there's any wackiness that happens, repeat it as many times as you need to if you have to do that. So yeah, I'd say definitely, you know, using an environment variable which can be done either with auto scaling instances or with containers to switch which settings file you use and then actually in that you have the connection information for your, S3 buckets, et cetera and then you can use Amazon tools to just copy from one S3 bucket to another to actually sync them together. So usually we'll do like nightly syncs of database and files in this case like S3 files down and then also have an on demand method to do it so that you can refresh lower environments as much as needed. And you also probably want to have a way to pause that like if you have a big thing that's happening on stage and editors want to have on fettered access maybe you can then pause those syncs for a while. Okay, well it is a little bit past two o'clock please in the schedule page you can get back to this session and leave any comments or questions and thanks so much for coming.