 So, thank you for the introduction, Francois. My name's Lindsay Holmuth. I'm going to be talking about burning down silos and a sort of more of a technical introduction to what DevOps is from a technical perspective and how we can implement that. There's a lot of other content in this talk. You can think of it as like a lump of coal that I'm slowly compressing into a diamond. So, without further ado, it's all about DevOps, right? And I'm not really focusing on the sort of airy fairy business stuff within DevOps. DevOps, I'm looking at the hard technical stuff behind it, the way that you actually go about implementing this within an organization. And we're doing this, I am looking at it from the perspective of a case study, all about applying that, applying DevOps principles with technology. In this particular case, we're talking about a high profile fundraising website that runs in the month of November. The strong siloization, so the dev and operations teams are not only like completely separate, but they're actually in completely different companies. So it makes things, it amplifies a lot of these DevOps problems. And the last thing here is we have 100% uptime business requirement for the duration of the campaign. So I'm gonna go into three concepts that I'm gonna keep on referring back to throughout this talk. So those concepts are consistency, repeatability, and visibility. So I'm gonna go through those quickly now. So consistency, what is consistency all about? It's about ensuring identical behavior within an environment or across multiple environments. So in a typical HA website environment, you're gonna have multiple stages, I suppose, where you're deploying the application to and it's gonna be promoted through those before it can get up into the production side. So it's vitally important that there is consistency across each of those environments so that you know that if something works in one environment, then it's gonna work exactly the same in the other environment. So technology-wise, we're looking at configuration management here and a whole bunch of testing, whether that's manual testing or automated testing. So Puppet is my configuration management system of choice, actually, interesting idea. Who here has used Puppet before? Put your hands up. Okay, fantastic. So that's about 50% of the room. So I'll do a really brief introduction to it and then I'll sort of move more to the low-level technical stuff. It becomes a bit more difficult when you are managing large Puppet installations. So Puppet is a language for describing how you want your machines to be configured, a library for applying that configuration to a machine, and a client server for distributing that configuration around deep speaking. Is it better? Not hearing any of that feedback anymore? Better? Better? Cool. Thank you, Martin. I'll just talk a lot quieter. So this is an example of Puppet Manfest, the programming language that you use to configure your systems. So up here, we're saying that, does this work? Yes, it does. So we're saying here that we want the Apache 2 package to be installed on the machine. We want the Apache 2 service here to be running. And generally, you'd wrap that in a thing called a class which is just a way of grouping similar bits of configuration all together. And you can think of it as sort of like a building block as like a Lego building block, I suppose. And a traditional Puppet workflow that pretty much everyone uses is they write their manifests, they apply those manifests to a machine, and then they debug them because you probably never get anything right the first time. Good example of that. If we go back to the class that we were looking at just a second ago, Puppet is declarative. So you can't guarantee that when you write something in a manifest that it'll be executed in the order that you've written it. So for example, there are cases where that Apache service will try and be started before the Apache 2 package is installed on the machine. So what you do is you set up an explicit requirement of the service onto the package. And that's sort of like Puppet relationships 101. And this starts getting very, very complex when you have lots and lots of manifests in your environment. So in this particular case study that I'm talking about, we had about 137 manifests that describe all of the different things in our environment. So there's a lot of complex relationships there. The other thing that can mock things up sometimes in a commercial environment at least is the dependence on proprietary software. You might be getting a database from Ravendo or maybe some sort of application service from Ravendo. And quite often, proprietary software is not conducive to being automated. So it means you spend lots and lots and lots of time debugging. So you get really familiar with that, with that testing lifecycle that I was talking about a second ago. A good way that we found to get around this was VMware Snapshots. You can do snapshotting with whatever virtualization stuff you're using, but you basically take a snapshot of the system before you make any changes, apply those changes, didn't work, okay, revert. Things start getting a bit more complicated as well when you have these multiple deploy environments because you've got to maintain consistency between all these different environments, right? So one problem that you can have is configuration drift where if you don't, I suppose, group the configuration and parameterize it well enough, you end up with a lot of duplication across your different environments because you've got your stage environment, your production environment, your ad environment. You might want to make a change in stage and then make sure that it, you want to make sure that it applies clean into all the other environments as well. So one little abstraction we came up with this was a concept of a role. And it's just built using puppet defines, so the sort of like parameterized classes. So this is our example define that we've got here or example role, sorry. So we just got an app server role and we're saying that it's got a base server configuration Apache 2, MySQL, blah, blah, blah. We've got a bunch of different options that we pass into it as well. And that's really important because up in the nodes, the node configuration files, it's very obvious to see the data that's being passed around to puppet. Like one of the common patterns of puppet is that you just have a whole bunch of classes that you apply to a node and then you've just got these variables that are just floating around in the global namespace and that introduces lots and lots of problems. So using roles like this is very explicit what you're passing through to all these different components and you reduce a lot of duplication there as well because you're using the same app server role across your staging and production environments. The next thing as well that's really cool is just regex matches in the node names. So it basically wins configuration. No, it's a problem. You can't really see it all that well. Oh well. So with the regexes, you basically, you don't have to write a new node block for every machine that you're adding into the cluster because puppet already knows the configuration for it. And using that, we're able to get down to closer 30 minute builds for any of our application servers when we needed to add them to the cluster. So that sort of rounds out consistency there. The next thing here is repeatability and the question that you're probably asking yourself now is how is repeatability any different to consistency? Well, repeatability is really a function of consistency. It's all about automating to remove human error and increasing speed by shortening feedback loops. So a really good example of that is automated deployments, automated application deployments and configuration management. It goes hand in hand with that. So for application deployments, we're using Capistrano. Capistrano is a Ruby DSL around SSH in a for loop. This is admins in here. We'll probably be groaning inside a tiny bit right now. It's simple, powerful, and it will blow your legs off if you don't use it properly. And that's really, that's actually a big problem with the simplicity of it. It provides a whole bunch of sort of base constructs and you can chain them together to do arbitrary things. And it's very, very easy to put too much automation, try to do too much with your Capistrano tasks. So it's not a substitute in any way for configuration management. You should be using Capistrano to automate the repetitive tasks but using it basically to trigger off something else that does that task. So for application deployments, the particular application that we were dealing with was PHP. So we're using a thing called RailsListDeploy, which is a plugin for Capistrano. What it does is it removes a whole bunch of the Railsisms that come with Capistrano. Capistrano was designed around working with Ruby on Rails. But it makes it absolutely fantastic for using with PHP applications. The next thing that we used as well was Capistrano multistage. And this is the really, really cool part of Cap. So if you look here at the stages, the first four lines are pretty standard Cap configuration but the last part is really interesting that you get with Cap multistage. So what we're doing here is we're saying we've got three different stages that we're gonna be deploying to. We've got UAT, staging, and production. So there's a configuration file for each of those. So it means that you can apply customizations to each of those stages. So this one here is for the staging one. We're saying that we've got two app servers and one static server. And then for the production, we've obviously got different names of the servers that we're deploying to. And there's an extra application server here. And then when it comes to actually deploying to each of those environments, what you do is you go, okay, all right, you go Cap. And then the environment that you're deploying to and then deploy. And so for this particular case, you'd be doing Cap staging deploy. And that will require all the information from the staging environment and only talk to the staging environment for doing that deploy. And then you test it and go through a normal QA and all that sort of thing. And then you've got Cap production deploy, which again, we'll do exactly the same thing but for the production environment. And Cap is starting behind the scenes also requires a tiny bit of bootstrapping as well for it to just work. You can't just go, okay, I'm gonna Capify my application and then I'm just gonna deploy it. Cap expects certain directories to be in certain places, certain users to exist, that sort of thing. So that's the configuration management side of things here where we're basically automating as much of that as possible so that we can fire up an app server and just have this all work. So really, really simple, I'll just run through this quickly. We've got a deploy user and a deploy group, which is what the deployment is being gone through. So the machine is, so the user is just SSH-ing, sorry, Cap is just SSH-ing into the machine as a deploy user. And we've also got a bunch of SSH keys here as well. So all the deployments are passwordless. And then on top of that, we've got a Capistrano site and these are the configuration, sorry, the directories that I was talking about a second ago. So we're making sure that a bunch of different directories exist that depend on one another and there's like a place for logs and a place for configuration files and all that sort of thing. So this is the interesting part here where we can basically go Capistrano site charity.com and we're applying it to all the app servers. And the awesome thing about this is that you can do multiple Capistrano sites for a single machine. So that provides all the infrastructure and does all the legwork for you behind the scenes. And that makes deploying to your new application server as easy as taking the existing configuration and adding a new line. It's all done, it's all there, it's all ready for the developers to go and do whatever they need to do. Or you can even refactor it into something maybe a tiny bit more simple. So you can just, you know, increment a number basically and it'll just work magically for you. So you don't have all that, all those extra lines of configuration there. Okay, next thing on the automation stuff, a geared SVN mirror. So why geared SVN? Well, to give you some background here, the repository, the application itself that's being deployed is 182 megs of size because there's a whole bunch of static assets that are in there as well. So we've got 182 megs of application being deployed to 20 application servers. And the data center is in Sydney and the SVN server is in Melbourne. So that's a lot of traffic that's got to go every single time you want to do a deployment. So Capistrano has a little thing built in called Remote Cache. What that does is on each of the application servers that you're deploying to, it keeps an SVN checkout on a random bit of the disk. And when you actually do a deploy, it's just doing SVN up in that checkout. And then it just copies the repository across, it's using copies, not doing a clone or anything like that. The problem with that is that it doesn't work particularly well with SVN tags. And SVN tags are a fundamental way to version releases, right? So there are a bunch of corner cases that basically make it unusable. So a really simple way around that is just using Git SVN and Cron. So basically using Git SVN to mirror and then launching the mirroring stuff out of Cron every few minutes. The cool thing about that is it gives us fast clones. So whenever you're interacting with the repository now to do a deploy, it's just a Git clone. So it's a very, very, very fast compared to SVN. We meant that we had commit access. So if we ever needed to go in and change something quickly or update a configuration file to do a deployment, we could do that. And it's just 21st century technology. So that rounds out repeatability. Next thing, visibility. So visibility is really all about keeping one eye on the past and one eye on the future. And technology-wise, I think I don't need to say anything more. Next thing is co-changers here. We wanna be able to see changes that are coming down the pipeline. So we wanna know stuff that's happening at the application level or the configuration management level. Monitoring as well is vitally important for the visibility and reporting as well. So on the metric collection stuff, my metric collection tool of choice is CollectD. CollectD is a lightweight statistic collection damer with an emphasis on collection. So you can think of it more actually as a platform for collecting time series data. So it's plugin-based. The default installation of CollectD has almost 100 plugins now that do all sorts of things like Apache, MySQL, low-level system statistics as well, like memory usage and that sort of thing. It's network aware. So any of the data you collect on a machine, when you collect data, it can be piped someplace else on the network. So it's really great if you just wanna have lightweight collectors on all your front-end machines that aren't writing out a lot of stuff to disk, but then forwarding it to a data collection server behind the scenes. And it's got fantastically well-defined APIs. So a very well-defined API for writing plugins. There are also language bindings for Perl, Python, something else as well. And also the network API is very, very well-defined. So it's very easy to write a network code that just talks to it. And it's all done using EDP. So one quick example of that is Curl.Json. Curl.Json is a plugin that you get with CollectD out of the box on most monolithic distributions. So this is an example configuration here. And what it's doing is it's polling this particular URL at set interval, generally about 10 or 20 seconds, depending on how you can configure it. And it's ripping, so it's getting a whole bunch of JSON from that and it's extracting different bits from different keys and whatnot. So this makes it really easy to instrument a whole bunch of statistics within your applications. So for example, on your application, you just expose a URL or slash metrics and then a bunch of other stuff underneath that if you want, if you want to have sub URLs. And then CollectD can just talk to that, rip out those data, rip out those bits of data, and it just stores it internally like any other CollectD statistic. So in code changes, application and config management is really what we're talking about here. So we want to be able to see changes to the code that are coming down the pipeline and stuff that's going to be deployed. And the configuration management changes that are progressing through the different environments as well. I can't strongly recommend GitHub enough. If you want something open source, there are plenty of other things out there as well. The best thing about this though is the news feed. It's really vitally important if you've got lots of disparate teams, if you can get a news feed for an organization and you can basically see all the changes that are happening across all the different repositories. It's a really great way to keep an eye on what other people are doing and pick up errors and do code review and that sort of thing. So monitoring, this is more of just an interesting quirk that we discovered. We found that with TripleM, MySQL cluster, that it would actually block under high IO if you're running the TripleM control show. But so the interesting thing about that is if your monitoring system is doing a TripleM control show behind the scenes, all of a sudden your monitoring system will report, holy crap, my TripleM cluster has disappeared entirely. But if you just connect to a socket and do a show and a quip, it works perfectly because everything's actually working perfectly behind the scenes. So obviously the solution to that is just open up a socket. Okay, last thing, reporting. So reporting is really important from a managerial perspective, being able to know what changes are happening over time and how they're affecting your overall performance at the site. So one really simple example of that is NK Query Digest and Log Rotate. So NK Query Digest is from the market package. Basically it's a MySQL slow query log analysis tool. So we had this run every day out of log Rotate. And what it would do is it would give us a ranked list of the slowest queries on the database. And it would send that to both the developers so they knew what they could focus on to tune the queries. But it also sent it to us, the operations team as well, because we were able to see how we could actually tune MySQL behind the scenes to make it run faster for those slow queries. So, okay, those are the fundamental tenets. I'm just gonna do a couple of really brief retrospectives about different times that this sort of stuff came into play during the campaign. So one of the things, one of the events that happened was it had a slave explosion near the end of the campaign where we were taking the most donations. What does that actually mean? Well, we had MySQL replication all being managed by Triple M. And there were two masters and four slaves. And they all had, the masters had floating IPs and the slaves also had floating IPs as well. So we got a replication fail on one of the slaves. It's not too uncommon, it happens occasionally with MySQL, not a big problem. So we were down to three slaves in the cluster. Created the cluster load, as you would expect. So, obviously, if you've got a machine that's taken out of the cluster, the other three machines that are remaining are gonna have to be doing a lot of work to play catch up with that. And then we noticed a replication delay on another slave, but only on one of the slaves, not on the other two. Quite interesting, quite curious. So they took us down to two nodes because Triple M will take that replication delay node out of the cluster. So we did a bit of an inspection on the replication delay slave because we wanted to know what was different about it that was causing this one to fall over that the other ones weren't. And we found that it was swapping like mad and it only had half the memory allocated to it. So that was pretty simple to fix. You know, you were shut down, you do an upgrade, you do a reboot, everything's fine. Rejoins a cluster, not any problems at all. And we were able to do that really, really easily. It took less than probably 20 minutes to be able to work that out because we had fantastic visibility over what was happening on that machine. We were using Kalei D here, we had metrics so we knew that the machine was swapping like mad. And we're able to use that data and just go, okay, well, we're having a swapping problem on this, let's work out what's gonna be memory related. Oh yeah, we've only got half the memory, maybe we should fix that. And that's really a problem with consistency as well. We've got fantastic consistency at the software level, but not necessarily at the virtual machine provisioning level. So it's sort of an interesting anecdote. Okay, database connectivity. This is another incident that we had at the very beginning of the campaign when we were doing a soft launch. And we were noticing a whole bunch of PHP connection errors would happen randomly and we couldn't work our way. And we found that the configuration file itself was pausing and loading, everything seemed to be okay, but the application server still couldn't talk to the database when we're timing out. So what we did was we asked the developers to add a configuration dump URL onto the application so that we could look at how the configuration that's on the disk is being interpreted by the application and the application framework behind it as well. So that's really a way of increasing the visibility. It doesn't really matter what is on the disk, what actually matters is what's happening with the configuration on disk and how it gets translated into something that's running and production. So we're using something sort of similar to code JSON for that, it was sort of like a JSON dump and we may have been able to plug that into Intercollect if we wanted as well. So after that we need to redeploy and we waited a bit and we sort of looked and then we discovered that there was a typo. So the interesting thing about that, there were two reviewers of the configuration management system. So all about that visibility, we knew all the stuff that was coming through, the person that was making the change knew what they were doing and then we also had a reviewer as well that was looking at all that. But both of those people in the operations team. And that's really a visibility problem then. If we had better visibility of all the code changes, if we were using something like the news feed feature in GitHub, then perhaps we would have more reviewer diversity. Developers would be looking at the configuration changes that we're making in the ops team in relation to their application. They'd be able to go, oh, hey guys, there's a typo here, maybe we should fix that before we go live. The last thing is data consistency. So in this particular case, we're doing a new release of the application here and there were a bunch of database migrations. So there was a new report that was being added to the application. And we had the standard release promotion cycle where we take a release, we deploy it to UAP, then if it passes there, it goes to stage and then to production. So in this particular case, passing UAP, no problems at all. Then we did another deploy, went to stage, no problems at all there either. Went to production, holy crap, it exploded in our face. Anyway, that's just basically the worst case scenario where all your QA testing doesn't catch a bug and it goes into production. So what's the thing that's different in all these different environments? Well, the configuration is the same because we've ensured consistency. We know that everything is repeatable because it's working the same with all those different environments, no errors. The thing that was actually different behind the scenes was the data. Because the data in production database wasn't being synced back to what was at stage and what was in UAP. And in particular, the report, the first time it was being run was doing a create table food, which should have been a create table if not exist food. And the interesting thing about there is we found that the production environment, the database was initialized from a slightly different source than what the stage in the UAP environments were. Again, a problem with repeatability, right? And consistency. So the easy way to get around that was you take the production database and you send it back to stage. And then you take the stage and you send it back to UAP. Easy way, easy fix. And that's really all about repeatability. So hopefully I've sort of given you an overview about how you can implement DevOps at a technical level within your organization. You can take a lot of these ideas and extrapolate them. So just go over them one more time. Consistency. So about ensuring identical behavior within an environment or across multiple environments. Repeatability. So it's a function of consistency. We're about automating to remove human error and decrease the feedback loops. And visibility, one on the past, one on the future. But there's one key point that I missed out. Communication. We didn't have any communication between all the different teams. We weren't actively working on our communication. None of this stuff would have worked. Thank you very much. We have time for one or two questions. So anybody has a question? Do you have your configuration someplace? The public configurations or what now? Release publicly or? Yeah, why not? That's a nice idea. Some of it has been released indirectly, but most of that's not the secret source. Especially the proprietary component stuff that I was talking about before. There's a lot of R&D that went into making that work correctly. Couple of quick comments. Firstly, if you search for DevOps on the internet, there's some groups that are basically trying to put together the sort of repositories that you're talking about of. This is how you would do a best practice and install all of these things that you could then pick up on. And secondly, for people who are interested in DevOps, we've got a couple of talks in this mini-con, including one at the same time tomorrow, by DevDes. And you might want to come along to those as well. Yeah, the interesting thing about that is you're talking about best practice there and I'm a firm believer that best practice is the enemy of innovation, right? You're basically saying there that with the best practice, we can't possibly ever get any better than that. And that's absolutely impossible. You're always gonna be changing, always gonna be improving. And sort of what I talked about here is these fundamental principles that we've applied to our environment and looked at ways that work and ways that you do. Last question. The matter of comment, it's never best practice. It's always best current practice. Sure. Good enough practices. I mean, it can always be improved. It's the best we go right now. Thank you very much, Lindsey. Thank you.