 So, I'm assuming everyone's had enough of a power point. I'm going to be using the latest in presentation technologies known as LESS for today's presentation. So, it's less is more, and this is a talk about, you know, DevOps-y stuff, so command line. Okay, so I'm talking about testing infrastructure. It's DrupalCon, so I'm talking about testing Drupal infrastructure. This talk is going to be about what it means when you are configuring servers to, you know, to do things, to run your app, and the server infrastructure has some kind of functionality in and of itself, and how one tests that. Lots of people have given excellent talks and written a lot about how you test your website. There are lots of tools for that. That's not what this is about. This is about actually making sure the infrastructure behaves the way you think it will. So, the first question, we have an advanced page forward technique. How's that? Good? Okay. So, first I'm going to talk about, you know, what a Drupal infrastructure looks like, right? It starts off, and we all, whoops, we all kind of start here, right? One server, you're running Apache and MySQL, and it works pretty well. And if this is your infrastructure, there's not a whole lot you need to do, though I think as you'll see before I'm done. There's still some, even here, you'd want to test. But very quickly, assuming your site becomes more successful, you go beyond this. So the first thing you might do is say, well, let's separate our web server and our database server so we can scale them both independently, get more resources and bear. And then perhaps after that, you realize you need more than one web server, sorry, maybe you want, got my slides wrong. Maybe you want a DB slave server so you can offload some of your heavy read queries. Then you want multiple web servers. So now, of course, you need a load balancer spanning over your multiple web servers. And of course, if you have multiple web servers and you're using Drupal, which tends to have the files directory written in the file system somewhere, you need some kind of file server, whether that's NFS or Gluster or perhaps you're using S3 or whatever it is, you got to store your files somewhere. And that's great. This works pretty well. But now, you know, at this point, your system's running on six, seven servers. And that means you're getting relatively successful. And someone's going to ask you, hmm, why do we have single points of failure? So you get into the HA game. You start off, maybe you have two load balancers in case one of them goes down. Eventually, you need your file system to be highly available. And then you do your database. And so I'm not going to, you know, I'm not talking today about how do you implement a master master database with failover. But there's a variety of ways. You do one of them, right? You have a system where you have all these moving parts now, right? Multiple web nodes. Some form of failover or redundancy or replication or something across all these components. And now, you know, this is a pretty complex system. And I haven't added in, you know, Memcache or Redis and worker queues and, you know, Drupal 7 or Drupal 8 supports, or 7 supports, you know, job queues and so you need queue workers. And there's, you end up with a fairly hefty infrastructure. So this is what we're talking about testing is the stuff in this picture, not the app itself. So server configuration is software. So when you have one server, you can run app to get install Apache and, you know, your site will come up and you can install my as well as not too hard and you can edit your V host file by hand. But pretty soon, especially when you've got that many machines going, you're going to realize you want to automate the configuration of your servers. So there are, there are tools that let you do that. They're pretty good. Aquia Cloud, we happen to use Puppet, but Chef is out there and Ansible and there's some others and there's new ones coming out all the time. And generally speaking, they're great. They do fairly similar things. They can, you know, install packages and cron jobs and, you know, file permissions and users and whatever. But the important thing is that when you're doing that, the whole value of those tools is that you're turning your server configuration from a manual task that your sysadmin performs into software. Software has bugs. That means software must be tested. Like all software, if you go read the principles of continuous integration and any of the million web pages that write about them, you'll see that the ones pertaining to testing tell you that your tests need to be automated. They need to be as fast as possible and they need to run in a clone of the production environment. It does no good to run your tests or it does less good to run your tests in something that's not like your production environment because obviously you test in something that's not your production environment and you push your code to production and it's not going to work. I wanted to put Jaspin's law, but that seemed a little pretentious. So I have this aphorism which just says if it isn't tested, it doesn't work. And I cannot tell you how true that is. Brief example. On Aquia Cloud, customers can download their various logs, Apache access and error logs and the PHP error log and their MySQL slow log and all that sort of stuff. And we recently rolled out a change in the way we implemented that and after we rolled out the change, the only log users could download was their MySQL slow log. All the others didn't work. Turns out we had a test of our log downloading system that tested the MySQL slow log and so guess what, that one worked. But we discovered that we had messed something up where our testing environment wasn't complete, it wasn't an exact replica of production and the mode on the file was different from the something that was expected. So you couldn't download them. So we quickly identified the problem, rolled out a fix, and then all of the logs except one worked because we tested all of the other logs except that one. If it's not tested, it doesn't work. So, whoops, that's not what I meant. There we go. So I want to take a brief diversion and talk about unit tests and system tests. A lot of people who talk about testing talk about unit tests. Unit tests are great. We should all have them. The way unit tests work is they isolate individual components and they test those. If you have dependencies, like if you have a module that talks to a database, you inject in a fake database that hardcodes the queries it's going to return. So you can say, when I call this function, I expect this query, I'll simulate that return value and then the rest of my code should behave in a predictable way. They're great. But it turns out that they don't really work for infrastructure that well. I mean, they can test some of your code, but you cannot mock out the entire real world, you cannot mock out the operating system. So a very good example of this is something we encountered recently where we use Puppet, we install six or seven Cron jobs on one of our types of servers, and it's been working great for years. And then just recently in the last month or so, one of our tests failed. And I went and dug through the logs and dug and dug and dug and dug. And eventually I discovered that it looked an awful lot. Like one of our Cron jobs just hadn't run at the exact moment that our tests expected it to. And I discovered this great little bug in Cron. So when you use the CronTab program to install a new Cron file, what the CronTab program does is it writes it into Varspool, CronTabs, username, and then the file, I think it calls a CronTab. Or maybe it names it after the username, I don't remember. The CronDemon at the top of every minute wakes up, stats all the files in those directories. If their mod time has changed, says, oh, the file's new, loads in the new file, and then does whatever it says to do. So because we're using Puppet, we have six or seven Cron jobs that get installed automatically. Puppet is relatively fast at doing that. And it installed all six of those Cron jobs within the one second at the beginning of the minute. So between 0000 and 0001. After it updated four of those six Cron jobs, the CronDemon woke up and started the file and said, oh, it's changed and loaded it in. So the CronDemon loaded in the new file. And then within the same second, Puppet wrote two more Cron jobs to that file. One minute later, Cron woke up, started the file. File mod times are in a resolution of one second. And Cron said, oh, the file hasn't changed, didn't load it in. There's no way any unit test of your software is ever going to catch issues like that. And our system test didn't catch it for a year or more, but eventually it turned up. And so we were able to fix that. And that's one more cause of random failures that we won't have. So the important thing about system tests is they're end to end. We write our system tests to test our infrastructure as the application will, as a Drupal site will. So we launch real servers. We happen to run an EC2. So we launch real EC2 servers. We set real DNS entries for them with our DNS vendors API. Everything that the Drupal site can do, we exercise in the tests themselves. And I'm going to talk about exactly what those things are in a few minutes. But the point is that we're operating outside the environment. We're basically doing black box testing of our infrastructure. It's hard. There's an unbelievable number of race conditions. There's a number of things that come up that actually in production would have been fine because you're writing a test and you want to make sure that, you know, you write your test and say, make sure within 20 seconds that such and so happens, you're going to find out one day it's going to take 22 seconds and your test is going to fail. Actually, it would have been fine in production, but you had to put some time out in. And so then your test fails, so you go fix it and you end up iterating over this process for a long time. So you get all the timeouts just right. These things, these tests are the hardest thing I've ever done. I've been engineering since I was, you know, 30 years ago. And this is the hardest code I've ever written. But without them, we could not run our infrastructure because there's just no way we would ever keep it reliable because there's so many little details, little ways that things can break. Okay, so what are the kinds of things we do? So first of all, we start, we build servers, right? We launch servers in the cloud and we configure them with Puppet and some other software. We always start from a known reference base image. So for example, we start with the official Ubuntu 12.04 image. What we don't do is we do not incrementally evolve images. So we don't launch an image, install a package and make a snapshot. And then next week say, oh, now we want, you know, Redis. So then we install Redis and make a snapshot. But when you evolve server images that way, what you end up with is a server image that you don't know how to reproduce. And I guarantee you're going to someday forget to write down in your changelog file exactly what you did and exactly what config file you edited or what cron job you installed. So we always start, you know, we write our Puppet configs to always start from a base image. Things like, you know, Puppet and Chef make this the natural way to work. So most people are doing this these days, which is great. Unfortunately it can be really slow to build up an image. I mean, just installing the packages, the first thing you have to do is download all the packages from the nearest mirror, right? And that takes some time. And then running all that apt gets or yum installs or whatever can take a little while and I'll talk about how to optimize that a little later. So, okay. So you always start, you write your configs from a known base image. Then what you need is basic build tests, right? Did Puppet work? Basically. Are the demons that I think running or running? Are the files in place? Are the mode, you know, the file permissions correct? You scan Syslog for errors, you might see Puppet errors like, you know, failed dependency because I couldn't download the package from the apt repo. Or, you know, you might see errors from demons. You end up having to come up with a list of known error messages, you know, like it turns out that, you know, the NRPE, if you happen to use Nagios, it logs a message that has the word fail and it doesn't actually mean there's a problem so you make up a list of regular expressions or the things you want to ignore. And then you scan Syslog and you say, is anything else here? Does it contain the word fail or error or, you know, failed dependency or whatever? And you just check things. And then you check that all the demons and all that are running. Okay. Basic build tests. So now you've got, sure. You're building it from a known base. Yes. Right? Sure. Okay. So the question is, if you're, if you're building from a known base and you're building it over and over again, why do you want to test that? And the answer is because you're going to change the code that you use to build your server. So today you're installing Apache. And tomorrow you're going to be like, oh, now I want to install, you know, the, some other Apache module to go with my standard Apache build. And perhaps you will forget to put into your Puppet configuration that the Apache module has to be installed after the base Apache server package. And if you're not testing that all the things that you think are there and you forgot that dependency, then it might not be there. And what's really great about Puppet in particular, and I've never used Chef so I don't know if this is true, but if Puppet is a, it's a declarative language where you define all your resources. You say like Apache is installed. And then when Puppet runs on that code and it goes and look and it says, oh my goodness, Apache is not installed. So it installs it for you. But then all these different resources you have to specify the dependency graph. Well, if I want to install, you know, the, the MPM worker package for Apache, I better do that after the Apache demon is the server is installed. And you, so you specify, you know, requirement statements. You say this has to come before that and whatever. And within the context of your stated requirement statements, Puppet, Puppet offers no guarantees about the order in which it will do things. And as far as I can tell, it randomizes them. I think it's actually using an internal hash data structure. So if you change one little thing, your hash keys can get reordered or whatever. And so your code can work for months. And then you change something in a completely different Puppet file. And some, you know, Puppet file over on the right side of your code changes the dependency order and something starts failing. And if you're not testing it, you'll never know until you launch in production and your server doesn't come up. That's why. Okay. But you can still get random failures. Right. So, okay. You want to test the moving parts of your infrastructure. So lots of people have backups. You want to trigger a backup just like your system would, and then trigger a restore and make sure that what was restored is what was backed up. If you use message cues or auto scaling, you need to put at least enough load in them to trigger. Yes, I can scale up. Yes, my load balancer works. You know, if you have database failover, turn off one of your database servers and see what happens. Or that your system triggers a failover appropriately. Cause an event that should trigger an alert and make sure that you got the alert. Because if you don't, then the event's going to happen and the alert's not going to happen because your Nagio's check script had a bug in it or whatever. So, right, moving parts. These moving parts test, we do these sort of not via the application. We do these just by hand. Like if we want to test database failover, we'll SSH to one of our active server and shut off my SQL and then see what happens. Does the demon that monitors and triggers the failover, does it work? This is by explicitly triggering each of the failover components. So this is a little bit white boxy. I'll get back to the other half of that topic in a moment, apparently. Beyond all that, we do a lot of security compliance work in Acquia Cloud. And, you know, I mentioned a moment ago, we'll like SSH into a server and shut off my SQL. Well, so we have security compliance requirements that say no one on our operations team and even our automated infrastructure is not allowed to log into our boxes as root. You have to SSH in as a non-root user and then run the sudo and then we have a list of commands that you're allowed to sudo and all that. So what we realized is if we wrote our tests to cheat, if we said, all right, during the tests I want to shut off my SQLD, I'm just going to log in as root and run my SQLD stop, then we're not testing in a clone of production. For us to have code that enables SSH in as root in tests means that we can't be 100% sure we don't have code that lets SSH in as root as production. If our tests depend on it, we can't turn it off. So we had to go out of our way to write code that didn't log in as root even as our tests, which ultimately our tests now run SSH and run sudo. It wasn't that big a deal, but we had to have the discipline to say, no, we are not going to violate the assumption that tests is the same as production. So we use a two-factor login system for some of our servers and so our tests have to play with that if they want to log in and test those servers, etc. Okay, so that's more sort of testing the moving parts of the system. Then we do application testing. And of course, because it's AquiaCloud, the application is usually Drupal, so we install Drupal sites, we'll install Drupal 6 and Drupal 7 and some distros, and then we test the infrastructure via the app. So, for example, we might, first we'll just create a node and make sure that it got into the database, that we can read it back. Then we'll shut off the primary database server and try to create a node again and make sure that the failover code works and that the logic we provide to our customers Drupal sites, that they use the failover properly. So that we know that when a failover occurs, all our customer sites will work, not just that the MySQL system is working, but that we've configured the app properly. We will turn off individual web nodes and make sure that the load balancer keeps sending traffic to the web nodes that are still up. Just a variety of things like that. So we're doing all of this end to end. When we want to test the Drupal site, we'll SSH to the database server and shut off MySQL and then we'll run curl and hit the front page of the site. Like, what do you do? How does it work? Just to make sure that the app really is functioning properly. So, another set of useful things. Feel free to interrupt me at any time. I should have said that earlier. So you've got to make sure that you've hit the site and it's in the process and you just happen to curl and it's still being off. The point he's pointing out is that many opportunities for race conditions. If we decided to shut off one of our web nodes by running EC2 terminate instance, well, that doesn't happen instantly. So we could easily trick ourselves, terminate the instance, hit the request, hit the load balancer and that instance is still running and then we wouldn't have verified that the test worked with that server down. What we would generally do, like starting and stopping servers, I have a slide on this at the end, it actually is relatively slow. So, when I say we shut off the web server, what we actually do is we just rerun Apache Stop at seeing that the Apache 2 stopped. So we know the demon has stopped, but you're absolutely right. There's plenty of race conditions. A really good one is a couple slides ago, I said we test our backup and restore systems. So what we do is we load a database in and then, sorry, we load an initial database in, we make a backup, we then delete a table and add a new one. We add the doomed table and we delete the necessary table and then we restore the backup and we make sure that after that that the doomed table is gone because that wasn't in the backup and the necessary table is back because it wasn't in the backup. And we do this against both our standalone database servers and we do this against our HA replicating pairs, database servers. And we had a bug for a while in our tests where sometimes the tests would either try to make the backup, so we make our backups off of our replicas because we don't want to take a live MySQL dump unnecessarily off of a running server. But then when we restore them, we restore them to the active because we're replacing the content so we might as well make sure that both the active and the passive are current. So we had some bug where we would restore to the active and then go test is the doomed table gone and is the necessary table back. And the bug was that we were making that check against the passive server and it so happened that at the instant we made that check, the replication, we loaded the database into the active and it just hadn't replicated to the passive yet. So our test failed. Now, if our test had waited a few extra seconds, then actually the test would have been fine. But again, there's just an infinite number of little edge cases like that, although in that case it revealed a real bug, which was that our tests were using one of the libraries we've written to decide which server to talk to. And in certain cases it was talking to the passive when we thought it was talking to the active, so that proved to be very useful. One moment. Okay, so application tests. Reboot tests. Exactly what it sounds like. After your server's up and you've tested a variety of things, reboot all your servers. Do they come back? Do your demons restart? Do your file systems all mount? Does your site keep getting served? Especially if you're running in a cloud environment, I can tell you for a fact this happens in EC2, your instances will just spontaneously reboot sometimes. And if they don't work when they come back, then your site will be down until someone goes manually, restarts the demon. So when we first added our reboot tests, we discovered, oh, in our Puppet Config where we say, please, you know, we assert, you know, the Apache package is installed and the Apache demon is running. Puppet has an option for, and it's in the init system so that it'll start automatically a reboot and we hadn't set that to true. So when we rebooted our web servers, Apache did not come back up. So obviously we fixed that and then now we know when our servers reboot, they will work. You know, in a perfect world, we would run all of our tests and then reboot our servers and then run all of our tests again. And we'll talk about test timing and how long things take at the end. So we actually don't re-run all of our tests. We kind of reboot halfway through and then we do some smoke tests after we reboot. We make sure, did the demons come back? Does a sample site. So we'll, you know, for our reboot test, we'll install a site, reboot all the servers and then make sure that that site is still up. But meanwhile, all the things we already tested, you know, when I insert a domain name, it gets put in, when I install a new domain name, it gets put into the load balancer and we don't repeat necessarily all of those tests. We just do our reboot sort of in the middle and figure, oh, you know, we still have half our tests left. You'll see why towards a little later. Okay, so after reboot tests, we have relaunch tests. So something else you will learn on EC2, excuse me, this podium keeps moving on me, is that in addition to your instances spontaneously rebooting, your instances will also spontaneously just die. So that's one thing to configure your server from scratch, you know, with maybe if you're using EBS or not, you know, with an empty database, you know, you bring up a brand new database server, but you also have to know when your database server dies and you bring it back, however it is you're persisting your data, whether you're using a persistent disk like EBS or whether you're, you know, going to feed it off of a replica, whatever that system is, you have to know that the server will come back and repopulate itself with its data and reinsert it back into its replication pools or whatever it's going to do. If your load balancer, you know, the reboot test too, if your load balancer is load balancing all of your web nodes, you have to make sure that when it reboots or relaunches that their load balancer will discover it again and start sending it traffic. So for these tests, we, you know, install a sample site on a collection of servers and then kill all the servers and bring them back. And is the data still there and all that? Again, in a perfect world, you would rerun all of your tests, you know, first normal, then reboot, then relaunch, depending on the extent of your system test suite, this may or may not be practical. Again, we sort of do this, you know, another part way through our test runs. But, you know, if you don't, you might think, oh my God, like really? We need to do this. So for each one of these things, I mean, I said what happened, we discovered when we added the reboot tests, a lot of our demons were configured to come back. When we added our relaunch tests, we discovered this entertaining little wrinkle. So I think it's on our monitoring servers, we use Nagios and we use a RAM disk, TempFS for some reason or other. And we discovered that, yes, TempFS was in the Etsy FS tab, so when the server boots, it should be remounted, but in point of fact, when the server reboots and it got to running FS tab, it hung. We, to this day, don't actually know why. For some reason, we cannot, we cannot relaunch a server. Actually, this might be, I think this line is wrong, I think this should be in the reboot line. We cannot reboot a server with TempFS mounted in FTAFS tab. I don't know why, if anyone happens to know, please let me know afterwards. But the point is, we discovered this problem and in fact, in production, our monitoring servers cannot come back up completely unautomated because of this problem. But well, at least we know that now and our ops team knows how to deal with it. And we only, I think the ops team had known that this problem existed, but until we implemented the test and actually found out why, we didn't know why. We had to fix the problem. Anyway, moving on. So, upgrade tests. Okay, everything we've talked about so far has been testing servers that were newly built for the purpose of a test run, right? You bring up, you start with your base image, you run Poppeter Chef, you install one of their other software, and then you run some tests. You're basically testing a new server, one that you're launching for the first time. And that's a very important thing to do because as your service grows, you will add more web nodes or more customers or whatever. And so you need to, your service is growing, you need more servers. But, obviously you have old servers too and as you add more features to your platform and you start enhancing your configurations, you're adding more stuff to running servers. So you also have to know that when you deploy your new version that all your existing servers will be upgraded properly and that they will keep working. So, we get, we call it the upgrade test dance. So, we launch a set of servers based on the old version of our code and then we install some sites on them and do some smoke tests. And then, we upgrade them to master and to the current version. And then, which of course means that we have an automated upgrade process to go from the previous version to the current version. And then, we run a bunch of tests on it. Again, ideally we'd run them all, we don't, we run some, we have some smoke tests for our upgrades, but we have, you know, we found plenty of problems in this process too. The code that works great when you run it for the first time but doesn't really work in an upgrade environment. So, there is this, the secret little bullet requires a fully automated upgrade process and my ops team members who are in the room are staring at me saying you don't have a fully automated upgrade process and I think I have more about that on a later slide. Um, right. So, oh yes, in fact, I have a whole slide on it. So, what this means is, you know, to do upgrade testing means you need to have an upgrade release process which ideally would be fully automated, but it actually turns out to be very difficult to fully automate it. Tools like Puppet and Chef cannot really do orchestration across multiple servers. They are great to say, you know, don't install this cron job before that package is installed. But they're not so great as saying don't install this cron job before this package on that other server is installed or until this other task is completed over there. Actually, I was just speaking yesterday with Damian at Commerce Guys and he was talking about a system that he's built that tries to solve this problem as part of their hosting infrastructure. I look forward to him, he said he was probably going to give a talk about that maybe in Prague and I look forward to hearing about it because this orchestration across lots of different servers really is a very complicated problem. You know, sometimes we'll have releases where for this particular release we have to update all of the load balancers before we can update the web nodes or we have to update the web nodes before the load balancers or the database servers before the job servers like all possible different combinations occur. Sometimes, you know, an upgrade, let's say we're reconfiguring MySQL for some reason, some setting in My.conf. Well, that means we have to restart all of our database servers. We have customers who are paying us for HA database. They don't really like their database servers to go down. So, for those, we upgrade the right half and then we upgrade the left half of each replicating pair. We haven't built a tool that actually automates all the possible variations of that dependency graph. We have talked about it, we've thought about doing it, our ops team would probably like us a lot more if we did it. But, you know, for now what we end up doing is we write procedures that the engineers write and test them and give them to our ops team for when they actually run the release in the middle of the night with the guy in Australia who does our overnight releases. And because they're not automated, sometimes those manual release procedures fail. And that sucks and that just goes to the importance of trying really hard to automate as much as possible of it. So, you know, nothing's perfect. Anyway, moving right along. Okay, so I mentioned at the beginning that we always do our builds from base reference images, you know, the base Ubuntu version that we're using. But it takes a long time. The fact that it takes a long time makes our tests slow because we have to start by launching a set of servers and then doing something with them. And so if it, you know, it can take if we launch, you know, on an M1 small, you know, an inexpensive server, it can take 30 minutes for Puppet to complete. And that means that our tests can't start for 30 minutes after they launch the instances. Incidentally, it also means that when a customer signs up and they choose an M1 small and we launch it, it can take that long before their server is ready. So what we have do when we started this late last fall is we're automating builds and snapshot bundles on an automated basis. So at the beginning I said you can't evolve an image. You can't make one little change in snapshot and another change in snapshot. What we're doing is every night we build new servers from scratch of each of our types you know, web nodes, database servers, load balancers, job servers, etc. And then make snapshots of those so that for the next 24 hours when we run our tests they can start from the pre-bundled images. Now they come up in like 3 or 4 minutes instead of in 30 or sometimes more minutes. Makes our tests a lot faster. We've been working with these for a while. Our plan is to start using them in production as well so that when a customer signs up it takes only 3 minutes to launch their server. We've been waiting just to get really comfortable with the bundling process because as my bullet point says you will hit unexpected bumps in the road just like with reboot and relaunch, you always find the craziest things. So we discovered one. When you install MySQL for the first time or Percona or Mariah or any of those variants it creates a post-install script at first it puts all the files in place and the user sbin, MySQLD and all that and then it runs this post-install script and that does things like creates the MySQL user and then creates Varlib MySQL and initializes the data directory and creates the initial MySQL database and the root user and all that stuff. So we use as I've hinted we use EBS Amazon's Elastic Block Store for our databases so when we bring up this machine whose sole purpose is we're going to launch it and bundle it as the database server template for the next 24 hours we install MySQL. It initializes the database directory. It initializes it on an EBS volume that was attached to the server at the time we launched it for the purposes of bundling. When we're done we kill that we snapshot the server, we kill the server and we delete the image. Now we relaunch that bundled image later. MySQL is already installed. It thinks that Varlib MySQL has already been initialized but we're launching it with a new brand new EBS volume either for a test or for a customer. So we need to take this thing that MySQL did at installation and reproduce it post launch of the bundled server. It's not that it was that hard it's just something we had to do. We had to find all these issues and sort of work around each one. Obviously we had to make sure that all the file systems were mounted in FS tab except that remember when you're launching it when you're starting from scratch and you're launching a server in EC2 whether it's a base image or one you've bundled the process is first you launch the server and then you attach EBS volumes and so you can't be 100% sure as it goes through its boot sequence that those EBS volumes are going to be in place by the time FS tab runs. So on a bundled server you can't say oh go mount devsdc as slash mount because devsdc might not exist yet so therefore we had to make sure that in our bundling process we took our demons that expected the file systems to already be there and told them not to start automatically at boot on the bundled servers until we have time to go attach the EBS volumes and then we run puppet and then that goes and installs all the init script so that if the server ever reboots again it'll relaunch with its disks in place and all the demons will start. Again none of it's rocket science it's just plowing through each failure in turn and then of course we run our tasks both on freshly launched instances and on bundled instances every night so that we can flush out these issues and that's through that we're finally getting confidence with our bundled images we're going to hopefully start deploying them in production pretty soon. Alright yeah I think I actually said this already so as I said we build these development images nightly we bundle them and then we test with them for 24 hours both you know just if it's the middle of the day and I'm a developer and I have a new feature and I want to run a test I use the bundled images because it's way faster and then we also have an automated bundled test run so that we know that nothing went wrong in the bundling process and that you know our script which does all those tweaks that I was mentioning where you know fixes the MySQL directory that that works on a nightly basis okay how are we doing good okay so as I've said these tests are slow I mean so you launch an EC2 server that takes a certain number of minutes I'll actually talk about that and then you install a bunch of packages and you start you make backups and you do SSH's and you bring the server up and down and you know you want to make sure that the load balancer takes the server out and then sends some traffic and then you start the server again and you need to make sure that the load balancer puts it back in rotation well you probably configure your load balancer not to put it back in rotation immediately like maybe it needs you know I guess you know maybe on the first successful hit but like it takes some time to launch and there's a lot of delays inherent in the process so just you know even a successful test can take you know several minutes to run and then when you have hundreds and hundreds of them it adds up so it used to take back last fall before we improved the last parallelization it would take around six or eight hours to do one test run that is death for a developer who's trying to work on a feature and then they want to run a test and it takes six hours which means basically they start the test and they're not going to know until the next day if it passed it totally kills your development velocity so now you know since then we've added a bunch more tests I think our serial test runs are up to 12 hours now but we never run them that way anymore so it's a parallel test runner that we created which you know very simply takes our test breaks them up into chunks and runs them on parallel sets of servers well so one one interesting wrinkle is we thought well you know couldn't we launch one set of servers and then just like run our tests in parallel and some of them could because some of our tests you know we install a Drupal site and then we do something to it we you know make sure the Drupal works we install a new domain name which goes into the Apache V host or whatever and those it probably could run in parallel but other other tests do things like as I said kill the web server well if one test running in parallel kills the web server while another test is making sure that the load balancer is sending traffic it doesn't work right you end up your tests step on each other so when we run tests in parallel we launch completely independent sets of servers you know whatever you know that infrastructure diagram that I drew at the beginning you know if we're running eight sets of tests in parallel we'll launch eight copies of that and then you know run tests against them in parallel we then look at you know okay how can we what can we do you know to make our tests as fast as possible we thought well you can add more workers and that certainly works though it only works to a to a limit as I said launching the servers themselves themselves takes a certain amount of time on our nightly runs where we are using non pre bundled images where we're starting from scratch and it might take 30 minutes to build up all those servers if we you know let's say we have 100 tests and each one of them takes five minutes if we said all right we're going to run 100 workers we're going to launch 800 Amazon servers all at once well it's a little silly to spend half an hour building up your test environment run one test for five minutes and then tear it all down so you know the number of workers that we use are correlated to the setup time so as we minimize the setup time we need to add more and more workers we also find that at a point adding more workers until you get that setup time smaller and smaller adding more workers just doesn't help you much another thing we found is don't run your slowest test last if you've got eight workers running in parallel and they're all churning through and then the very last test takes an hour then that one worker is going to be busy for an hour after everyone else and then your whole system can't really say it's finished until that last laggard finishes there it turns out there's a whole field of mathematics involving scheduling algorithms for when you have a variety of tasks and you don't know exactly how long they're going to take but you have some historical data how do you organize the timings in order to minimize the overall work the overall time it takes and there's there's some really interesting things that come out of that branch of mathematics where there are scenarios where you know if you have six workers and a set of jobs and you had a seventh worker it can take longer depending on how the jobs you know maybe what used what used to be what is your longest test used to be falling somewhere in the middle when you had six workers but when you make seven somehow the balance of jobs you know work out differently against the workers and now your longest test is last so the whole thing takes longer what I can show you is this is a little a little graph of when we run so I ran this is some job I ran from last week we had six workers down the left column and each of the little colored regions is one test so as you can see some of our tests are very very fast here this one Nagia Solar test took all of 50 seconds whereas other tests our whoops let's see what is this one called our shared site provisioning test the thing that we test when you come sign up for AQUIA cloud free tier and then convert to paid takes you know two thirds an hour we so when we started off at the beginning by the way so this orange thing the way our system works is we have a central server the master or controller server and then that is involved that's where we store all our data about all the servers and customers and SSH keys and get repositories and stuff that exist and then we have all the servers themselves in which we install our customer site so what you see at the beginning of all of our test is the setup task right this is launching the master this is launching all the servers and then there's some amount of testing that goes on and so every worker has to do these two and so there that's our that's our setup time right until we can get that to be a lot smaller which we are always working on you know we're not you know we might add we might run eight workers but we're not going to run 20 because it just wouldn't make very much sense you know so like different tests take various amount of times and we thought we would actually need to really think about you know taking our slow tests and running them first so that we didn't have that problem of a long tail on one worker in point of fact we never did any of that and I think we're just getting lucky as you can see they all kind of finished about the same time but that's dumb luck you know this is just this is a little Google spreadsheet we you know the the test class that we created that runs all of our workers in parallel says oh I ran this test and it took that many seconds and we just you know grovel through the log file take out all those numbers throw it in a spreadsheet Google makes a nice nice chart for us and so we can say oh look you know they're pretty balanced it's good enough but this is a really fun graph because we can say oh hey let's work on you know let's speed this one up today I don't know why that's so slow or why this one really confuses me why validating varnish of all things should take that long like varnish is really fast I I don't actually know what that test is doing so prime candidate for optimization okay the question is do we test the test machine so we're we're I'm not sure what you're asking we run all these tests from a Jenkins server and I mean our tests are basically a program that like goes and launches a master and launches some servers and installs a site and moves a domain name and whatever so I mean we're testing all of those servers so what do you mean by am I testing the test service well oh you mean like with the Jenkins server oh there's problems with our Jenkins server all the time and we do notice I mean like Jenkins runs out of memory and it's written in Java and it you know enters some spin loop maybe it's garbage collecting I don't know and yeah it fails and so periodically we come in the morning and our nightly run didn't run because Jenkins blew up Jenkins is awesome but it is not perfect incidentally you know when I was at the very beginning when I drew that diagram of what your infrastructure looks like if you use Jenkins to automate you know maybe you have a code repo and you want to push changes every time you make a commit to a production environment and then you have a system for pushing to production environment what you're going to do is you're going to write jobs little shell scripts and run them under Jenkins well that's software you have to test those too that's exactly the sort of thing you need to test if I make a commit and it triggers the Jenkins job because it's tied to get does the job run and does it actually push my code in production perfect example of the moving parts of your infrastructure that you need to write tests for okay great I promised a little detail about EC2 so we have learned a lot of things about EC2 over the four and change years we've been running Acquia Cloud so you know we as I said we launch these test workers in parallel we launch a lot of servers so actually our base set of servers is currently I think 16 when we launch a test system you know we launch a certain number of web nodes and database servers and file servers it ends up it's about 16 servers and we will sometimes launch eight workers in parallel we're launching about 120 servers for every test run and so if I Barry decide I have a new feature and a patch and I want to test test run I do that and meanwhile my buddy to the left is working on his user story and the sprint he'll run a set of tests so our test account actually has a limit of 600 servers that we might be running at any one time so we launch a lot of servers so EC2 instance launches they can and do fail and they can fail in very entertaining ways so for example run instances which is EC2's basic API for that you know if you use the ET2 run instance command line tool that's what it calls the run instances API it can return you an instance ID which turns out isn't an instance ID it's just wrong or maybe you know if you try to describe that instance it'll say instance does not exist if you waited two minutes maybe it would later everything in Amazon is eventually consistent and so the API which hands you the ID is not necessarily the same as the API that describes the ID and their different databases we have found instances that can transition directly from pending to terminated so you launch an instance and it's pending for a while and then it's running but sometimes it's pending for a while and it just goes straight to terminated so you can't even assume if you get back an instance and it you describe it and it comes back as valid that it will eventually even launch at all what's really fun is that we've seen servers that have gone from pending to terminated to running I'm like what the heck does that mean the most outlandish is that we've had instances come back to life six months after they were terminated just poof like where the heck did this come from that's really unrelated to testing but you know everything you can imagine will happen sometimes also instances fail to get network so they'll launch and if you describe them it says running but you can't ever ssh in and you can't ever reach them at all that might be an Ubuntu issue I'm not sure yet but at any rate it happens so you know in our launcher we have a lot of error checking and retries etc it takes a varying amount of time to launch these instances we did a little analysis over about 10,000 launches several months ago and what I found though is that most of them like more than 95% get to state running and get network access in between 3 and 4 minutes but we were waiting up to 15 minutes and what we found is that that remaining 2 or 3 or 4% is spread out over that after 3 or 4 minutes like 1 will come in at 6 and 1 will come in at 12 and they actually succeed after 12 minutes but it turns out that such a small percentage succeed after 3 or 4 that what I'd say is after 4 minutes just kill it and start over because you'll end up being a lot faster that way so security groups this is obviously this whole section is only relevant if you're working in EC2 so EC2 has this concept of a security group where it's basically firewall rules you can say you know my servers are accessible on port 80 and port 22 but not in anything else so and will allow ICMP ping traffic or whatever so when you change a security group rule or the set of instances in a security group which means if you launch a new instance into an existing security group all of the instances already running in that security group even though they haven't changed the security group rules will enter a transitionary state where all of the instances will both lose and gain and lose again and gain again their network connectivity so if you've got 50 instances running first instances 0 and 1 that have been up for 3 months might suddenly lose the ability to talk to each other for somewhere in the vicinity of like seconds to a minute or something we actually wrote this little web app one of our developers did it in our demo day where we had two servers and they were just pinging each other and the web app was green when they succeeded and red when they didn't and we launched a third instance and like they were green and we launched the instance they went red and green again they just came and went so what's really entertaining is that I might run a set of system tests for my own patch and then the next developer will do one for hers and those all run in the same EC2 account and because of the way we wrote our software 4 years ago they all use the same security group in production we have one account, one security group all the servers exist within that but in testing we have what are basically independent installs of our product but they're using the same security group name because we were a little lazy so what that means is my test can be in the middle of running and my neighbor starts 120 servers for his test run and all my test servers can just suddenly start sporadically losing access to each other so just getting the tests to be reliable even when the code is right can be entertaining you end up putting in a lot of retry code a lot of you know I got a connection timed out but I'll just you know wait a few more seconds and try again one thing that we do is after we launch our initial set of servers so one worker launches its set of 16 or over many servers it launches them it runs puppet, it does all the setup it waits for DNS to be set and then it logs into each one and runs Netcat to ping each of the others on a port that it should be able to reach them on and it just waits for that to work and literally there's a comment in our code that says wait for the security group to stabilize and we're just sitting there waiting for everyone to be able to talk to everyone else because before we did that we got a lot of launch failures where like puppet on one of our servers would try to talk to the puppet master and just time out actually the flow was we would log into the server that was coming up and run puppet to initiate a certificate signing request because puppet uses SSL with certs and then we'd log into the master to sign the request and then we'd log back into the server to run puppet again which retrieves the sign request and then does the puppet thing does all the work and so what would happen is we'd log into the server try to run puppet, it would talk to the master and fail with a connection timed out then we'd log into the master to try to sign the request we'd succeed and the master would say I don't have a request from that server because the connection had timed out then we'd log back into the server and try to get the signed request and it would say oh ok I'm talking to the master and I'm issuing my request now and these things would be two seconds apart so within a time frame of two seconds we could reach both of our two servers but they couldn't talk to each other on the first time but they could on the second time it was like madness anyway so recommendations are if you haven't written your code yet create separate security groups for different test jobs so that at least your test jobs don't step on each other and you run them in the same account if you have multiple developers running them in parallel as I said after your servers are all up just run Netcat to touch them to touch each one from each one to assure stabilization and then expect failures anyway EC2 is a very entertaining environment EBS volumes so EBS is one of the weakest parts of Amazon however it's also how you save persistent data without having to worry that your services blow up at once you're dead so we have discovered things like we can attach a volume to a server the server describes as terminated but the EBS volume is still attached to it whatever that means sometimes this will come back if you would a little while other times it won't and EBS volumes just remain attached forever there is a forced attach option on the API call to Amazon so you can use that 99.5% of the time that will work and if it doesn't and you still can't get your volume back then that test fails and you just log it and go on and what you end up with is tests that that are 99% reliable but yet on a regular basis there's still a sporadic failure that you have to go look at and you say oh okay I understand what happened here this was an EBS attach failure and when that happens what you can do is oh wow I'm almost out of time you can you tighten your code up so the first time we found this we're like oh we'll try again with a forced attach and that got us another power of 10 in reliability so now we're at 99.9 instead of 99 and you can improve your retries and your logic we've never quite gotten to 100% okay very quickly you need to clean up, you need to tear all these things down make sure your tests are very reliable at that there are a hundred party tools that will help you Netflix has it Simeon Army they have something that will go kill all your instances which is good because your tear down code is also software and it will have bugs in it and since it's your tests tearing down the servers that they launched you're writing, if you want to try to write tests to test that your tests tear down the servers it gets very confusing so having a sort of an unrelated system that just kills stuff after 24 hours might save you a lot of money we have an EC2 account that is just for running our tests and what we know is at any time we can kill absolutely everything in that account and it doesn't matter so that's very handy, if our tests get out of control and they just start spawning snapshots we can just wipe it all and we know we're not killing anything in production okay I'm unfortunately almost out of time but I have a couple minutes for questions if there are any go to the microphone please what does this mean for a customer of AQUIA what kind of tests should we be running if we are going to launch Drupal on AQUIA if you're on AQUIA cloud you don't have to deal with any of this you should write your application tests Drupal tests, make sure your website works this is all our problem alright one more nope okay well thank you all very much