 All right, so I'm Daniel. I'm Daniel Dreyer. I work at Puppet. I'm a senior SysOps engineer there. I'm the primary service owner for our Drupal web infrastructure. Can't quite hear? All right Doesn't quite fit. My laptop is kind of up on the mic here. So I'll try and lean in. How's that? Cool. Thank you All right, so I'm a senior SysOps engineer at Puppet I work on our web infrastructure, which pretty much means the Drupal websites, so puppet.com The puppet conf sites including the older ones and I also run infrastructure for Puppet Forge Which is a Ruby thing but is a pretty similar architecture in terms of the infrastructure backing it I moved all of that stuff from Mostly from Linode over to AWS and I'm currently in the process of going through that again for Forge so That's pretty much the basis for this talk before I was at Puppet I was doing two things. I ran a small web hosting business that was focused on Drupal and I did Consulting mostly for small medium-sized Drupal web dev outfits that had done some kind of their own hosting and had outgrown it or it became too painful and so I'm What I'm hoping to share here is About a the dozen ish key decisions that when you're building this kind of a web infrastructure in AWS That kind of dictate the rest of the architecture that you end up with because What my experience has been seeing other people go through this is that people make some really important architectural decisions early on without realizing that they're important and So they'll make decisions about their DNS their load balancing their web servers Based on what they've worked with before what their colleagues or friends have worked with what they read about in a blog post and The symptom Down the road is that two months in They're spending all day working on varnish or they're spending a ton of money on giant easy-to instances and They don't necessarily realize that that is actually subsidizing architectural decisions that aren't necessarily objectively wrong, but are the wrong set of trade-offs for their organization and so Coming into the whole AWS thing my experience was it was this sort of amorphous blob of technologies and questions and phrases and It wasn't clear to me how to put that all together and how to make those decisions And so what I'm hoping you all come away with is a More linear way of thinking about what the key decisions are at least in my view so that you can go through that process in a more deliberate way and put the The downsides in places that are acceptable for your situation and More consciously pick the the upsides that you can get out of AWS, but you don't automatically get So before I get too far into that just to get a sense of what Experience levels there are how many of you actually run or are involved in running some kind of Server infrastructure web infrastructure at all Okay, cool just about everybody How many of those run Drupal? That's what I expected pretty much everyone And how many of you are doing that in AWS oh Cool lots of you like half How about for configuration management is that a thing that we're all doing or But much much less so okay cool So that that hopefully means that this isn't something that you just all already know so The first question that I always get talking with people is why aren't you just hosting your site it doesn't get that much traffic And the fact is that most of the time it's better to use some hosted service so sort of The the factor the key fat one of the key factors for us at puppet that probably won't apply to any of you except for people from maybe chef is That in operations at puppet We have a bias toward running as much of our infrastructure as we can in-house in order to have a larger infrastructure so that we can dog food puppet better because if we went and Outsourced everything we possibly could we'd have like 20 or 30 servers and that would not be at all a representative user of puppet enterprise Where the the really big pe clients are large corporations that try to run everything in-house and They'll also tend to run their web infrastructure in-house or at least some of it And so that's that's one factor for us the other things that are probably more in common with your concerns are More that we want to integrate with our other infrastructure And we want to use our own workflows rather than using the Vendored workflows that you get from a lot of even the really great hosting providers So specifically we want to be sending graphite metrics in web server logs into our log stash and We want to be provisioning access either using puppet or by integrating with LDAP and It's been surprisingly difficult. I've made quite an effort to find that And it's it's harder than you would think because once you start talking LDAP you end up in somebody's 5k a month Enterprise tier and that You know, I can't justify that for like the 2015 puppet conf site that no one visits anymore That's it doesn't doesn't make sense and the last thing and I hope that there's some people from these hosting companies here is The question of SLAs so you look around you see SLAs that are, you know, three nines four nines a hundred percent But the commitments backing that are typically relatively weak and They're typically aren't published actual uptime numbers and So if somebody from marketing is asking me if I can stand behind a thing and I have no real hard evidence that They're actually up that much even though these companies have a great reputation and I have had personally really great Experiences with them if I can't quantitatively support it like I can my own servers that I can monitor the heck out of It's harder for me to tell other people that they should trust it So architecture wise I'm going to be running through basically each of the pieces of Infrastructure that you would see when you make an HTTP request So it's going to be almost like a series of lightning talks about DNS load balancer, etc And I'm going to have to move through it relatively quickly Because we only have about 50 minutes left before questions and there's there's just more Infrastructure there than you can reasonably talk about in an hour So there's detail in some of these slides that I'm just going to skip if you're interested in it Jump back to it later. Pull me aside afterward the idea is to present a relatively high-level overview You all are competent enough to do the googling of the specific things and there's no shortage of tutorials about You know how to puppet eyes or chef eyes or whatever you call that NGNX and HAProxy and so on so the the first point of contact that you're Use your has with you the way they actually find your server at all is DNS and a lot of people Go into DNS when they register their domain they just use Something and they don't realize that that is limiting them in certain ways. So specifically with AWS The key decision here is am I going to use route 53, which is Amazon's hosted DNS? It's really great or are I'm am I going to use some third-party DNS? We're using a third party. We're using dine because we were standardized on them before we started using AWS And they've been totally solid If that wasn't the case Route 53 would be a really compelling option The complication around route 53 is IPv6 and is elastic load balancers and Naked domains, so if you want your domain to be Example comm instead of dub dub dub dot example comm the DNS spec says that that has to be an a record But an elastic load balancer only gives you a C name So if you're not using route 53 Amazon's route 53 because they just cheat and let you have a C name anyway They just have an integration you have to figure that out and you it probably means if you're using Not route 53 that you can't use an elastic load balancer for a bare domain that catches a lot of people off guard The other question is IPv6 So I hope someone from Amazon maybe listens to this sometime and improves the IPv6 situation So it's it's you can't win basically because route 53 will serve IPv6 quad a records, but Route 53's DNS servers are not available over IPv6 So you have to make the request over IPv4 so That's kind of a non-starter if you're trying to help people who have an IPv6 Only stack which I didn't think there really were any but then we had an IPv6 Only outage of yum.puppetlabs.com and app.puppetlabs.com and we heard from people on the same day So it turns out those people do exist I know right and so unfortunately Elastic load balancers are also the only way to get a public IPv6 address in Amazon so if you want to be using IPv6 you have to be on route 53 Which means that you're not available on IPv6 only So we're using Dyn They are a little bit more expensive than Route 53 they're really really well established in the almost two years I've been at puppet. We have not had any problems whatsoever with them. So we've just stayed with them obviously We can't use elastic load balancers But there's other reasons we couldn't use those anyway that I'll get to momentarily so that hasn't been an issue The next hop there that your packet will make is to whatever load balancer you're running So I've highlighted four options here. The sneak preview is we're on HA proxy And I'll go over each one fairly quickly So if you can use an elastic load balancer because it's by far the lowest management overhead And if you're thinking it it really wouldn't be that much work to just stand up in HA proxy it isn't but it's keeping it running and being on call for it and Figuring out the edge cases of keeping it highly available across multiple availability zones that turns out to be a lot more work than I had expected The disadvantage there beyond the IPv6 stuff I described is It's not quite as full-featured as some of the other load balancers if you want to do more advanced stuff at the load balancer That's not a huge disadvantage because you can typically do that once you get to the web server anyway The big killer for us was the lack of a static IP address So we have enterprise customers whose firewall, you know, IT security policies require them to whitelist every IP for every website their people can visit and we want those people to be able their users to be able to visit our websites an Elastic load balancer you just get a C name the IP changes. So given that hard requirement of a static IP We just ruled that option out entirely Nginx is another option when I started at Puppet. That's what we were using It's already familiar to a lot of you It's really nice to be able to have the same thing for your load balancer as your web server because you don't have to learn as many things the the thing that is a real killer for me and the reason that we switched away from it is that Doing health checks of your back-end web servers in Nginx requires the commercial Nginx plus which the sticker price on that is like 1900 a year per node and So given that HAProxy will do that for free It it's difficult for me to justify even though Nginx is a really cool company if if you had a need for commercial support and You're already running Nginx as a web server and you need commercial support for it. This would be a really good option Just for us. We don't need that. So it wasn't a good option The last thing I want to plug here is if you're coming from a big enterprise shop You may already be running F5. They're sort of the go-to big enterprise load balancer They have virtual appliances available on AWS So you can have the same load balancer workflow that you would with your hardware stuff They're really expensive. So if you don't have that specific need You it's probably not a great fit. So we ended up with HAProxy. It's open source. It's free It's super lightweight. We can run it on an Amazon on a T2 medium instance I really love how they do SSL termination. So with Every time I've set up SSL anywhere I end up googling because you've got your how do you point it to the what's the cert and the intermediate cert and the key I always seem to flub something With HAProxy you point it to a folder and you dump all your certs in that folder and Then it loads all of them when it starts and it just checks to see what domains They can handle and when a request comes into that domain it just maps them and it just handles it and there's no configuration So I really like that because there's a whole class of things that you can't screw up anymore And I don't have to Google when I add a new SSL cert So the big downside and this would apply to or the big operational overhead to running your own load balancers This would apply to HAProxy and Nginx is That if you're in Amazon any services you're running Really need to be in three availability zones or at least two or else You don't so in Amazon Their SLAs are on a per availability zone basis. There isn't an SLA for just one EC2 instance What they're committing to is that the whole Availability zone will basically be up so much of the time So you need to have your stuff especially key things like load balancers high availability across them an Elastic load balancer just does that for you with HAProxy or Nginx You're doing that yourself and the way we do that ourselves is using KEPA live D Which uses VRRP, which is a protocol like TC TCP or UDP and You get three Servers HAProxy servers that all have a KEPA live D process on them They form a cluster. They do a master election and the master is only Qualified to become a master if the HAProxy process is running you can write other custom health checks But we haven't really found it necessary and When it I'll show you the scripts I use for this shortly When it when one of those nodes wins that election it runs a really trivial shell script that uses the Amazon command line tool to take over the elastic IP and When one of these nodes disappears Then the it fails the health checks with the other KEPA live D nodes They do another master election and one of them wins and then it takes over The elastic IP and that process takes like a few seconds. So you don't really notice an outage So I'm not gonna go through the details of this Can I mouse over that? No, it doesn't look like I can so The nutshell of this at the very top there. It's just the health check script is just checking that HAProxy is running It's listing some unicast peers by default KEPA live D wants to broadcast But you can't multicast in EC2. So a lot of people think you can't use KEPA live D there But you can use KEPA live D In unicast mode, you just have to tell it what its peers are We just use puppet to template this out because puppet DB already knows who those peers are So that's easy enough to set up The other thing to keep in mind there I forgot to put this in the slide is that you will need to enable the VRRP protocol in your security groups or else it will silently drop them and You'll waste a day like I did Here's the the shell script that it runs when it takes over so the line 10 is the one you care about and Literally, it's just running the same Like shell command that you would if you were taking this over by hand. It's it's really dumb We template this out as well using puppets so that we won't have to like create one of these for every elastic IP that we use There's another cool thing that we do with HAProxy that I wanted to share because I haven't seen other people doing this And I think it's a pattern that would be useful to copy So if you look at the very if you squint a little bit and the very oh, geez Oh, man, I'm sorry so If you imagine The that the very last line of this set had an X This is a this is an HTTP request and this we have the headers here that we're getting back And the very last line is a header called X unique ID and it has a long nonsense you you ID We have HAProxy every time an HTTP request comes in it in it generates a you you ID It injects the you you ID into that custom header and Then when the response to that comes back from the back end it injects the same You you ID header into the response and so if you do a curl that shows headers You'll see this you you ID So we have HAProxy and engine X and Drupal all configured with custom logging options so that they'll log that header and What's cool about that is your troubleshooting workflow then if you have a centralized Logging tool that you can search all those logs at once Your web devs without talking to ops if they have access to the logging tools They can say they're having trouble with some page They're not sure if it's happening at the load balancer the web server PHP FPM in Drupal and they want to see all those logs Instead of asking ops for help and grepping through logs and stuff they can search all of those all the logs for all those things for that you you ID and In one place they get all the relevant logs for that specific request and They can the the other cool thing you can do with that is if you want to be able to make your users able to File better error tickets if you have your front end discreetly display that you you ID somewhere You can tell them you could add that say to your custom like 500 page Tell them to copy paste that or take a screenshot when they file an error and then you can go and look up the logs For that exact request and you don't have to go back and forth with them to figure out which request that was The last thing that we do in HA proxy that I'm proud of is it and it's a dumb hack How many of you have had that problem where a botnet shows up and registers like 10,000? users with your site right and The rat I bet half of the rest of you who didn't raise your hand The reason you're not is because you have some kind of anti spam service that you're signed up for right? So we do too we get these and it blocks a hundred percent of spam registrations consistently It's really great, but you can't cash the register or the login endpoint And so if you get a real high-volume Registration effort by a large botnet it functions as like as a D-DOS and I've seen at Puppet and at a couple of consulting clients I've worked with that Their site will go down for like 20 minutes every few months and they don't know why and the web servers get really really busy for a while and They they don't get it The and typically the reason is that there's some spam bot trying to register all these accounts And it's just burning so much CPU on all their web servers that it blocks the whole site so what we do in HA proxy is on the the front end we have this ACL up here that tags all The requests that start with user login or user register with this Drupal sec ACL And then it sends everything with Drupal sec to a back end called dub dub dub throttle and What dub dub dub throttle does is it's basically a separate queue it hits the same web servers as you normally would But it has a real Limited number of connections, so each back-end web server is only allowed to handle one of those at a time so if you get a Billion of these spam registration attempts those queue at the load balancer in a separate queue from the rest of your website traffic and So if if that gets slow as heck The only people who are impacted are legitimate users trying to register or log in and since 99% of our Traffic is anonymous and the people who are logged in are typically marketing people or really dedicated users Who are logged in all the time? They don't notice because they're not hitting those endpoints So that has completely solved the problem for us And it's it's kind of a dumb hack But because it doesn't actually get rid of the traffic, but it sidestepped the question of Figuring out which traffic was malicious because that's surprisingly hard to do So we've gotten through the load balancer somehow the request has gotten to something running engine X and PHP or Apache or whatever and This is typically the part in Drupal or other web infrastructure where you need the most compute and You have to run that in something when you're going into AWS pretty much the two obvious options are You're either spinning up EC2 instances Which if you haven't used AWS are virtual machines very much like what you would get out of a line out or a digital ocean or You're using ECS elastic container service Which is their Docker hosting as a service that they more manage for you And then the other key question when you're coming in is what kind of instance type you're looking for I'll get to that momentarily The the good thing about this is that the decision is really easy How many of you are using Docker? Okay The people who raised their hand should probably look into ECS The rest of you should probably use EC2 because ECS is a perfectly good service But in my view, it's not compelling enough to warrant a switch to Docker just to have it Especially if you're using decent configuration management and you have good automation there if you have a an in-house capability to use EC2 or to use to run virtual machines the normal way Just spin up EC2 instances and you can Run those as ephemerally as you would Docker containers if that's important to you The next question so we're gonna assume that you know since we're me and we decided to use EC2 The next questions at hand are How do you provision it and how big of instances? What type of instances do you use and the general contours of those decisions are more smaller instances? Fewer bigger instances if you're running things more ad hoc by hand Fewer bigger instances is obviously less work if you have good automation It doesn't really matter how many especially if you have good centralized logging and monitoring and so on Because hopefully then you're not SSHing into we have like 12 tiny back-end web servers I don't want to have to SSH into all of those to grep for a grep logs or whatever. That's terrible So we use T2 medium instances and What's cool about the T2 how many of you know about the T2 stuff? Okay, so this is worth explaining so T2 instances basically have a pool of CPU credits and when the machine is not very busy that pool is growing to a ceiling and When you use when you need to burst capacity it can go to about two and a half times. It's sort of baseline load But it draws down those CPU credits What's cool about that compared to using autoscale groups to grow dynamically is that it's instant So with an autoscale group, even if you have baked AMI's you're waiting at least one two three minutes for a new machine to come up A T2 instance can go from totally idle to burst it all the way out It as soon as fast as you can send requests to it the So I really like that because you get the burst ability Without worrying about Autoscale policy or trying to predict what your load is going to be the only downside there is that you can run out of T2 CPU credits and then the machine slow to a crawl so the really dumb way that we work around that is Knowing that these T2 instances a T2 medium has about 20% of one CPU available when it's out of credits and so we just did some benchmarking and figured out I just drew all those CPU credits down a benchmarked it at about the rate that we normally get and we provisioned enough T2 instances that we can Basically run the site. It's a little pokey, but we can basically run the site When they're out of credits and we can also burst to the kinds of increased traffic that we get around puppet conf We also monitor we use isinga 2 for monitoring and we monitor Those credits so we also have that but even if we drop the ball and don't respond to a page The site will basically stay up and we can burst Let's see. All right moving along here So the next interesting question So we'll we'll get back to the the details of how these how we're provisioning these things a little bit later right now I'm just working through that packet flow The next interesting question in my view Is the question of shared storage because you can Find results in Google all day about how to run a decent EC2 Nginx Apache whatever you don't need my help for that shared storage is is painful so When I say shared storage how many of you like Does this a concept that needs explaining or? No, okay so there's pretty much There's pretty much four options here and they consist of whether you're going to use some variant of s3 or you're somehow going to use NFS and Please come and talk to me if there's another good option because these are all kind of painful I know you can use Gluster or whatever, but that's even worse So the best option if you can cut it is s3 with a Drupal module That does native integration. I have been part of a couple of projects that tried to do this And we all got about 95 percent of the way in where it basically worked But there was some module that we couldn't get rid of that Expected POSIX semantics because it bypassed the Drupal file system abstraction It's really frustrating, but I've been in this has been my experience every time I've tried to do this both with puppet in Drupal 7 and Drupal 8 with consulting clients before that It it should just work and my experience has been it doesn't quite especially for a brownfield project where somebody has already locked in the module decisions The the next way to use s3 is a fuse file system mount fuse is a user space file system You can mount s3 so it looks like a local file system. There's two problems here It It's a leaky abstraction s3 is not a POSIX file system. It's it's a lot different and Fuse presents a POSIX abstraction to it. So you have kind of a messy interface there. It's also slow so I know that I wanted to list this because I Know a lot of people use it successfully and it's a lot lower operational intensity operational overhead than running NFS and it's a lot easier on developers than Having to redo everything so that you can use s3 with a bunch of poorly behaved modules So it is an option, but before you go into production with this benchmark the heck out of it The most traditional option you all know is to just stand up an NFS server The failure mode of NFS is really gross It will hold a lock on Whatever it's trying to read it'll it'll hold it'll try to read forever Your NFS server can disappear and it will just keep trying to read and the failure mode I've seen there if you're running PHP FPM is it's basically poisoning each Worker process as it tries to make requests and they start blocking and so you get this weird failure mode Where when it makes a request for a shared file that worker? Disappears until it times out and dies If you Google a lot of people will tell you that it's easy to tell NFS to break these requests, etc If if that's true, please pull me aside Afterward and demo it to me because I spent weeks trying because all these blog entries told me how easy it was and I just felt dumb I Just couldn't do it. I just end up the only way I figured out to reliably get it to break an NFS mount is to to reboot the box So we've got a real dumb hack so Of course, this is the one we do But we've got a real dumb hack that's kept it from being a problem for us Tell you in just a sec the last option came out like three weeks after we committed to using NFS And went into production, which is Amazon's elastic file system EFS is is a hosted NFS service It it looks really cool. I haven't run it in production. So I don't want to tell you that it's the bees knees but It looks like sort of the ultimate answer to this question if it does as promised the only real caveat is it's in beta and You also have to use their DNS for internal DNS resolution So if you're the if you have a VPN link to your VPC and you have to use your corporate DNS for policy reasons This may not be an option So I don't have a slide for this but what we do to make NFS less of a problem is This is almost an embarrassing hack to admit But it's worked so well that I want to share it because I know a lot of people run NFS We have a cron job that every two hours are syncs The the NFS mount to a local sort of cache directory And we have an engine X try files directive that tries that path First and if it can't find the file there, it hits the NFS mount We don't have people uploading stuff very often so well upwards of 99.9% of read requests get served out of the file system But if it's not there it immediately gets it out of NFS it's faster and It effectively solves that worker poisoning problem Because so few workers are hitting if you're if your NFS server goes down. There's so few workers most of the time hitting NFS for things that The handful of workers that might get sort of poisoned will time out die and respawn Before it becomes a problem your respawn rate can keep up with your loss rate. And so the site stays up So even though we have an architecture there that I wouldn't recommend which is one NFS server For three availability zones. That's a clear single point of failure We've been running like that for more than a year and have not had any problems as a result And we've had a performance boost as well Next part Yes Yeah Yeah Yeah, so there are some really cool Distributed file systems out there. I think Gluster and Seth are the obvious ones if you're willing to go commercial There's there's there are other ones They're super compelling. I have When I have experimented with Gluster the operational overhead and the amount of knowledge that it required to run was just Beyond what I was willing to impose on my colleagues so all of us are on call for everyone else's stuff and So You know, I might get woken up for network gear that I've never used The network guys might get woken up for web server stuff So it all has to be kind of you know every man a rifleman kind of Stuff and Gluster is is really really cool It's not the kind of thing that everybody should have to learn how to run So if if you're starting your own Drupal hosting service, you should probably look at Gluster If you're doing your own hosting, you probably should use one of these. Does that kind of answer your cool? No, I don't know that term Oh Cool, no that sounds like a good good thing to look into I know I See one in the back too Yeah Right. I don't think that the Drupal 7 Drupal 8 and please correct me if I'm wrong here I don't think that that will make any difference as far as your shared file back end your your decision-making Tree is pretty much going to be the same there Cool, so I'm gonna I'm gonna skip forward because I'm running slower than I should be I'm real strongly opposed to running your own database stuff in AWS with Postgres there are There are a handful of reasons if you need custom Postgres extensions In the Drupal MySQL world, there's I just have never met a good excuse. So The options here when you're making in that part of the decision tree of how do I run Drupal on AWS? You have three options. You run your own MySQL You use MySQL on RDS or you use Amazon's Aurora, which is a MySQL compatible Multimaster database that they run as part of RDS Don't run your own RDS is what we use because we got set up before Aurora existed the trade-off there is real simple Aurora that The only real downside that I'm aware of to Aurora is that the smallest instant size you can buy is really big So it's a little bit expensive It is that consistent with your it does Right, right. I totally agree Aurora is should be the default choice If you're super price sensitive, which I know some people will be because they're running like custom hosting for some Tiny outfit that counts every dollar you might care about like if you're tempted to run your own MySQL Because it'll save you a couple bucks a month Run then you might be interested in the cost savings of RDS If you're running development instances that need like a t2 micro size database You might want to run that on RDS If you're running production and you don't have a compelling reason to stick with RDS use Aurora because it's Multimaster and So the failure mode there when one goes down is going to be you won't notice whereas with RDS you have a traditional primary slave failover model and My experience has been it fails over basically instantaneously, but it's not guaranteed It can take a few minutes and you'll be throwing database errors on your site while that's happening Yeah, unless it's I mean if it's production I don't think so you would want the comment was that single AZ RDS doesn't make sense What we do is in puppet we have in Hyra we have Based on this stage we set the the replication options So for the dev version of the site of the the infrastructure It's single availability zone the staging version is multi-availability zone and the production is multi-availability zone So we just save a couple bucks because in the dev version, that's just the putzing around environment I don't need HA for that one quick thing here One thing that surprises people is we got rid of varnish and we got rid of memcash This is probably not the right answer for everybody, but it it's simplified our Architecture because we could just do some mysql tuning It turned out when I came on board the reason that we needed varnish to make the site perform acceptably was mysql was was not tuned basically at all it was at its default settings and So and that's a scenario. I've seen with a lot of consulting clients It's the first thing I look for when I go into an engagement Because the default mysql settings if nothing else the default mysql settings give you 10 megs of memory use and it does amazingly well with that but You can do a lot better, right? So once we tuned mysql correctly we didn't actually need we could get to a like a one second page load without using or page generation without varnish and That just took a piece out of the stack and removed the question of how to clear those caches because the Drupal caching works fine I see a question in the back Yep, so the question is whether to run memcash locally on the PC2 instance versus using the hosted elastic cache Excuse me hosted elastic cache service. I think it depends on your use case so if you're just running simply what what we used to do when we had memcash was We configured all of them we had every web server running a memcash and Because we wanted those caches to be consistent We used puppet to configure the same order of failover so that every web server Would hit every other web servers memcash in the same order so that if one of them went offline all of them would hit the next Memcash so that those caches would be consistent I was really happy when we could just get rid of that and not think about that so if you don't really care about Consistency and you don't need to burn a ton of memory You kind of might as well run that locally if you want a shared instance and you you do care about cache Consistency my impulse would be to Use elastic cache and let there let them handle the high availability question And just point everything there and make it their problem and not yours. Oh Do they not in memcash? Oh? I say I Say haha That doesn't sound like any fun at all All right, we'll scratch that recommendation then You're on your own And the last thing that I want to say here This isn't really varnish memcash, but the reason The ultimate reason that we were able to get rid of that is because we were benchmarking in a reasonable way Most of the people that I've worked with who run Drupal infrastructure Benchmark in a very ad hoc way The process that I use that has worked fairly well for me is all For a brownfield type Environment is I'll take existing web server or load balancer logs parse the URLs out of that Generate a list that I can feed into Siege and Then just generate traffic with Siege. I Don't look at the numbers Siege gives me at all. I just want it to generate load It's just a way to have users in my benchmarking environment Then I use New Relic or some other tool like that And I adjust how aggressive Siege is Until the load numbers look about the same in my dev environment and my prod environment After that I have a reasonable degree of confidence that I have a similar Load profile going into it and then I start doing performance tuning But the only numbers I'm looking at are what I'm getting out of my metrics tool Not what Siege is telling me Siege is a great tool For really ad hoc stuff, I might start off using their numbers But you get so much more in-depth data if you're using a more powerful tool Your whole benchmarking process will be better and you'll end up with better infrastructure for a ton less work for it so Obviously, I'm going to be talking about config management We use puppet obviously We use it for just about everything and But I'm not going to talk about a whole lot of puppet specific stuff here For two reasons. I don't want to just be a product pushing person. I also Don't think that I have anything to offer that's particularly unique except for a couple of slides that I'll share because you can Google about how do I puppetize or whatever and you'll get an answer what I'm going to focus on here is is Basically the provisioning side because that's the one thing that I have had a real hard time That question of how do I get a node from not existing to being a puppetized part of my infrastructure? That's the part that there's like a shortage of tutorials on Side note Sometimes I'll hear people say that they they don't want to use config management That they they have people who you know won't learn it who aren't sort of technically sophisticated enough That's a clue that you should be using a hosted service. I don't mean that in a derogatory way But one way or another you're gonna pay the cost either It's going to be by the initial investment in config management where you're going to pay a higher cost long-term through the pain of Operating without good tooling You're gonna pay that cost so either pay somebody else to do the config management and use their service or You know bite the bullet and and just go into that At a high level the hard parts I see of config management are managing secrets There aren't great stories for that in most tooling The question of bootstrapping, which is what I'm going to focus on The question of code complexity is an important one most of us who run infrastructure don't come from a software development background and so writing code to manage a lot of complex moving things ends up with complex code with a lot of code paths and If you don't use some kind of reasonable Methodologies in terms of refactoring testing structuring your code in a reasonable way You will end up with a real messy code and so That turns into a you have to be aware of managing code complexity when you go into this because Otherwise you'll you'll end up frustrated you'll switch configuration management tools and then three years later You'll be in the same spot because it wasn't the tool. It was the code complexity I'm gonna skip drift because I'm short on time The way we provision is we use the puppet labs AWS module We have the puppet master it has just enough I am permissions to create a tiny little like a t2 small provisioner instance which has I am permissions to do to provision more services and The main reason we do that is just because you it's a way to contain complexity Like it's easier to reason about a puppet master that isn't also provisioning all my other stuff It the provisioner then we have one for each of the dev stage and production stages And so the dev one will provision all the dev versions of services the stage one the staging versions of services That allows us to test changes to how we provision without blowing up production The actual sort of high-level path for how we Get a node from not existing to being part of our infrastructure Took a lot of work to for me to I tried a bunch of different permutations and what we ended up with was The Bit of background here is the way puppet works is when you connect a node to a puppet an agent to a puppet master The first time puppet runs it requests an SSL certificate to be signed by the master If you don't have any automation that means you either have to go into the PE console and click a thing Or use the command line to sign it yourself There's a tool called policy-based auto signing that allows you to automate this, but it's more of an API than a End-user tool, so I wrote a tool called puppet auto sign it's public I'll include a link at the end that generates a cryptographically signed token. It's Hmax signed The provisioner when the puppet compile happens generates one of these tokens it puts it in a cloud in its script That it gives to the EC2 instance when it provisions it using the puppet labs AWS module The cloud in its script runs the first time that the node when it boots And it sends that token back to the master along with the request to sign the cert and then the master Can cryptographically validate that that is authentic and it also has a local registry that verifies that those tokens can only be used once There's also a timestamp signed into them, so they're not valid after more than an hour Maybe it's two Anyway, so the the cloud in its script then installs puppet gets the cert Runs puppet again a couple of times and that way by the end of that the node is part of your puppet infrastructure And you can do things the normal way And we're not managing AMIs Particularly We've sort of bypassed that problem. We haven't had a need to bring up nodes that quickly It also provisions the RDS security groups elastic it assigns the elastic IP unfortunately can't provision them I'm running a little short on time, so I'm gonna Skip this. It's kind of cool Come back to it. I want to make a point about using configuration management end to end I know a lot of us use configuration management a lot of people don't have a process Where it's used where like a server can go through its whole life cycle without a person changing anything manually There's a huge psychological difference, especially when it comes to Operations and developers working together because you get a culture change when you you take the high stakes Traditional sys admin got to get it right the first time thing out of the equation when you can just delete a busted node and Rerun puppet and it comes back up Because then a developer who doesn't really know much puppet or anything they can experiment in a dev environment They can make a pull request by past the whole process of filing a Jira ticket for operations They document the change they want just by making it and making a PR it's great It saved me so much time because they do most of the ongoing maintenance now and they don't have to block on ops Another thing that I wanted to point out that I haven't seen other people talking about it's a real simple thing I see a lot of people templating out config dot PHP with puppet or chef or Ansible or whatever It's really gross to template out executable code It's hard to to reliably generate syntactically valid PHP from templates. That's like a Conceptually hard problem Generating valid JSON from a data structure is a conceptually easy problem So we generate a JSON and then have a trivial little bit of PHP in the config dot PHP that just loads that JSON and Sets value values from that so stop templating out PHP because there's there's a better way to do it That's actually less work And the last thing I want to get to is sort of the the operational experience of putting this all together And the bottom line is we ended up with pretty good uptime We had some periods in there when we were getting this figured out where we were hovering around like 99.4 percent uptime and I wouldn't have included this slide if I had been doing the demo back then Because it it feels pretty bad. It made me wonder. Why am I hosting this not having someone else do it? I'm hoping that this presentation will convey some of the the lessons we learned that and you can sort of skip To the the better part of this curve And we are at the I think at just about that time to take questions Are we done? All right, we don't have questions talk to me