 Hello everybody! I know there's not a lot of people in here, but hello everybody! Hello! There we go. Did you put that over here? Did I do that? You did that? I should be a little more careful with this. So, I'm like a pest. This is Ryan, this is Rudy, and Narayan is someplace not here at the moment. But he'll normally come in a little later. For the recording, because I think everybody can hear me, including all the people in there, because I'm somewhat loud. So, you're here to learn about Drupal infrastructure, some of the pain points we've had, and some of the take-home lessons from that. So, oh, this is Narayan. I will give you my seat, sir. Narayan is a very interesting presenter. He is so talented he can present with his back to the audience. So, we should probably start with a little background information. First of all, I've got to give credit to all these slide images. Rudy did a great job of coming up with some crazy looking images. So, a little background on Drupal.org. Everybody know what Drupal.org is? Anybody not familiar with this site? One person. Are you at a WordPress convention? So, Drupal.org is not as small as you think it is. Drupal.org is a very large property. I realize this image is really hard to read on the projector resolution we have in here. But who can name a component of Drupal.org? Go ahead, just yell it. There is no memcache. There is no memcache, okay? So, no memcache. The association website. There's obviously Drupal.org. What else? API.drupal.org. Drupal CI. Get. Release packaging. Localize. Localize. Camps. Security. Good answer. Events. Issue queues. Updates. Drupal code, which is the frontend get instance. All the dev sites we have around Drupal.org. I'm going to look at the little easier to read sign. We have solar running on Drupal.org. Who knows what solar is? So, solar is a really awesome way to search content on a Drupal site. Really easy to use. Yeah. Is it better than holistic search? Elastic search. We are not starting Flame Wars today. We do not have Composer on this slide, but recently we now have Composer Endpoints available. So, Drupal.org is not your typical Drupal site. It's many different code. Well, Drupal.org, the one site, www.drupal.org, is one code base. But what makes up the Drupal.org infrastructure is not one code base. It's multiple sites that all interact together in multiple ways. Rudy, Ryan, anything to add, Narayan? Okay. I don't know why it's... So, the other thing to think about the Drupal.org code base is it is, for the most part, really old. Like, this site has been around with the current database for a very long time. Drum, you want to put an age on it? It's older than Drupal. That's a good point. It is older than Drupal. So, what's the longest Drupal site you have out there? It's the site you've got that you've been running for the longest amount of time. Anybody top five years before just blowing it away and starting over? Six years. Currently maintained sites. Drupal five does not count. Drupal seven or eight site that you've upgraded from, let's say, Drupal four. Okay. It gets a lot of technical debt. Is it anywhere near as complex as Drupal.org, www.drupal.org? So, one of the things that happens when you keep building off of yourself as you keep upgrading Drupal is you incur a lot of technical backlog. And we have to keep... We have to be... Yeah, these images need to be a little larger, but we have to be a little careful that Drupal.org has some very tight interlinking between content. So, if you just go to the front page of Drupal.org, you see some basic news information. You see some download links, but there's a lot of related content on Drupal.org and a lot of that content gets used with third-party tools and put in different formats. So, all the project releases get turned into XML for update status. It's a really complex project. There may be a complete list as far as I could come up with of all the components we're running. Now, there may be some people in the room, I'm looking at Narayan, who may be able to come up with something that I forgot on here, but we've got on the left-hand side all of the websites that we run. So, these are all of the websites that exist on the Drupal.org infrastructure and, of course, we got packages on there. Okay, good. I added a few things. Thank you. So, we run a lot of websites on our infrastructure. And then on the other side is kind of all the services we run with the data from those websites. So, I'm not going to read these to you because people can read. Oh. I'd just like to add that there's something missing from this slide and that's some of the things that are missing from that 13 years of technical debt. Like, we used to have MongoDB and there used to be a Cassandra instance and there used to be multitudes of things that were like, why is this here? How did this end up being as part of our stuff? And so, yeah. We could probably do another slide twice this size with all the things we've removed from the infrastructure, but we're looking forward, not backward. Yes? We've seen the largest Jenkins instance you've ever seen. We missed Jenkins on here. We have a whole slide dedicated to Jenkins. Yes, there is a large Jenkins instance at play. So, let's take a look at some of the numbers that we're using. We had, what's the date range on this? April 2016. So, four months of, oh, just one month? Yeah. So, for one month, one month, we've served 1.7 million requests. Billion. Billion requests, sorry. 1.7 billion requests and is it 24 terabytes of bandwidth? Anybody close to that? So, Drupal.org is really big and has a lot of traffic. It's kind of the point from all this. Of course, that traffic went, of course, all of these, went over all of these different services that we run. Okay, this clicker really does not work well. You want to talk? Yeah. Oh. Isn't there supposed to be an image on this? There it is. So, click it. There we go. So, we send around 100,000 emails a day. Now, that doesn't include the promotional content you get from the Drupal Association. These are emails that are sent mostly programmatically by Drupal.org. So, if you're working on an issue and you get an email update, you get one of these emails. If you're on a listserv run by Drupal.org, you get one of these emails. But these are the emails that we send through our mail relays every day. 100,000 emails a day. And surprisingly, none of them are spam. Well, depends on your personal opinion. And if you found the magic button that lets you subscribe to everything, could you not? There is a magic button that does let you subscribe to every issue. Don't do that. It's a lot of emails. We have 1.4 million nodes on Drupal.org. This is just Drupal.org. It doesn't include any of the other sites. We have 1.9 million users. 600, I'm sorry, 6 million comments. I feel like I'm in kindergarten. You got to like read the number in front of you. 6 million comments. 39,000 get users in almost 700,000 commits to get. And of course, like... Just admire the animation here. I'm just going to keep watching this for a little bit because it's very entertaining. So of course, like everything... Of course, like every large complex site, we've had some failures. And hopefully that's why some of you are here today to kind of learn about what we've done and what we've changed and why we've changed it. So one of the things that we are constantly fighting is spam. How many people fight spam? I guarantee you, we fight more spam than you do. Ryan, do you want to talk about spam? Yeah, I can talk about spam. One of the ways we've been using to fight spam is Mollum, of course, and Honeypot Module. Mollum is a content analysis tool for spam. It takes what people have submitted, does some other heuristics on it to figure out whether or not what people are submitting happens to be spam. It's somewhat effective. There's a lot of things, but a lot of things still get through. And so the next thing we had was Honeypot, which was stopping people from behaving like computers. People can still program their spamming tools to look more like people, but we don't let people slam the site really rapidly or really fast. But people are still getting through. And so we implemented another feature on Drupal.org recently where we've been working with a vendor called Distill that does a browser hashing sort of algorithm to figure out whether or not people behind proxies are the same individuals. And so we've been using this data. We turned it on initially to just have some rules to block people automatically. And it was a little overzealous. There was a few false positive holes that we didn't want to lock too many people out from registering. But now it's just collecting data, and we're going to use it as a manual interaction. So what we'll be able to do is determine that a spammer who's running through 50 different proxies is all actually the same spammer. Block that fingerprint, and we won't see them come back until they figure out what they've done wrong. And of course it's an arms race with spam. But that's one of the things we've added. So browser fingerprinting is a really cool mechanism that basically looks at how your browser's configured and assigns it a unique ID. And it's pretty good at determining that you are your browser. It's been really great at helping prevent spam. The next big change we've made is Fastly. Fastly has been really helpful, really, really helpful at serving our traffic. In 2014 we were on a previous CDN, and we were hitting actually the 30 megabit VLAN cap we have, which of course dropped packets, and the CDN really didn't have very good monitoring for itself. So it was hard to tell what was going on at the CDN level, whether errors that users were seeing were coming where in the stack it was coming from. And for those of you who aren't familiar with the CDN, a CDN sits in front of your site. User's traffic goes to the CDN. Your CDN sees if that data's in a cache. If it is, it just serves it to your user. If it's not, it makes a request to your, what's called an origin server, or in our case, our servers. And then it takes that data and it stores it in a cache for the next person to come. Now it can't cache, you know, logged in traffic, because those pages are dynamically generated based on who you're logged in as. But the beauty of it is that most of Drupal.org traffic is anonymous. So if you get 50 anonymous users, hit your site very quickly, you might have one request to our server for those 50 users, and that keeps amplifying. So if you've got 5,000 users, you might have one request. Give or take a few. So yeah. I just want to say, so on that list of things, updates, the reason it says updates here is because a huge chunk of our traffic and really what was causing us to hit the VLAN cap was updates at Drupal.org. So more Drupal sites, more calls back home to check for updates for modules, things like that. It's written in a not so optimized XML format and the processing of updates on the Drupal sites themselves is bandwidth hungry relative to like the amount of information it's actually calling back for. So do I have an update for this module, this module, this module? It pulls a lot of XML down to process that. So updates was like the biggest chunk of bandwidth that continued to grow over time as the project grew. And it's not just updates. It's Drush pulling the same files that updates uses to figure out what type of files to download. So that XML that we're talking about is created from the get information that we hold and is then propagated to a bunch of different sources that use it. So the entire, you know, there's large chunks of the Drupal ecosystem that are dependent on that XML being generated in that format. Security.Drupal.org uses it to figure out if there's a supported version of a module to get security coverage. Yeah, so we had a CDN previously and it was helping with just the bandwidth load, but, you know, every five minutes there's a cron job running or something hitting. We'd see these spikes. We'd run out of packets to send down the pipe and drop packets and decide what's low down or you get an, you know, unable to connect sort of error. So one of the reasons we chose Fastly was because it's basically distributed varnish. Who's familiar with varnish? So, yeah. If you're not familiar with varnish, we're going to pause 30 seconds for you to Google it. It's worth Googling. So, okay, you are a presenter. It is a, Narayan indicated it is a thin veneer over the problem. You should be a little careful when Googling varnish because you're not going to find computer software. You're going to find a wood varnish, so you probably need to add varnish cache or varnish software. Sorry. It's okay. Well, since most are familiar with varnish already, so Fastly is just a distributed, not just a distributed varnish, but a well-architected distributed varnish with things like built-in support for origin shielding, which is a way to have all of your points of presence for the CDN distributed among the world before calling back to our origin, call back to a predetermined sort of origin shield origin. Our servers are located in Corvallis, Oregon at Oregon State University at the open source lab there. So, the closest nearby, which is actually peered with our upstream ISP, was in Seattle. So, our origin shield in Seattle handles all the requests that are coming back to start-up your blood org sites that are fronted with Fastly, including updates. And so, turning that on and also turning on soft purges resulted in an update request coming from Hong Kong, going to Fastly's point of presence in Hong Kong, seeing if it has it or not. If it doesn't have it, it pulls back to Seattle. Seattle probably has it because someone else has probably looked for a Drupal 8 update or whatever they're looking for, right? So, that never actually hits our origin in Corvallis. If it doesn't have it at that origin, Fastly is smart enough to pull it from our origin and pause and wait for any other request for that update until it reaches the Seattle origin shield. So, instead of sending back 30 requests because there were 30 requests coming into the origin, it does one request back to our origin, up to the origin shield, and then back out to all the rest of the caches. What's a soft purge? And a soft purge, so when we're packaging releases and updates data, when a new release gets packaged, there's update data there. We call out to Fastly's API and tell it to soft purge. And a soft purge lets it in the background fetch the new XML that was generated when the update was packaged without stampeding kind of the origin or serving up like an uncached thing. So, it lets it basically seed it to the origin shield in the background to avoid like a big rush of things back to the origin. And there are other products out there that do this. Like Fastly is not the only provider in this space, but Drupal.org uses Fastly, so that is the one we are talking about. Yeah, and Fastly actually has an open source alliance program for open source projects, so they give us a nice sort of like benefit of being a non-profit open source product. We get a big discount on sort of like using it, because seedings aren't cheap. So, let's talk about development environments. Drupal.org, just Drupal.org itself, not start at Drupal.org, but Drupal.org itself is large. The code base is large, the files are large, the database is large. Neil, how large is the database for Drupal.org? Guess. Oh, you have that? Yeah. How large is the database? 223 gigs. All the databases are 223 gigs. Okay. Drupal.org itself. It's 200 gigs. So, developing locally with Drupal.org is really not an option. It's too big to try to develop Drupal.org locally. There's too much data to try to, you know, how many people have a 300 gig laptop or desktop hard drive to put the entire database in? It's just not feasible. So what we do is we provide development environments for people who want to contribute to Drupal.org. Speaking of which, there's a sprint on Friday. And so what we do is we basically build these development environments. We sanitize the database, and we then spin up a development environment per user. And we, you know, this process used to take four hours per development environment, which, you know, would pretty be, you know, you show up to a sprint, it's 9 a.m., you're like, hey, I want a development environment to work on this issue. And then you've got to wait four hours. And we've got that down to about 10 minutes now through a combination of a bunch of stuff. That's why there's... There's a lot of bullets here, and it's been kind of an ongoing hard problem to solve. So we have, right now, and it's evolving, right now we're using a combination of Docker and a whitelist sanitization process that sanitizes out a ton of data. I mean, we go through and we run queries that polls, like, not all the data from Drupal.org, because it's such an old site, limit by time, limit by association, or it's private data, and we don't want it at all. Wipe it, it's cache, you know, cache tables, stuff like that that's easy to purge out. We purge it all out. So we have a custom sanitization in process right now. We use Docker in kind of a backwards way to build images of the database data, which is why this is in the failure section, because, turns out, people don't want their development environments to just disappear, which is what was happening. So having the database data in Docker's kind of custom file system, which I won't get into too many details about, but don't use it for any sort of data that you don't think is ephemeral, that's going away. But that also gave us a 10-minute site deployment, because the data was in a container, essentially. So spin up the container, wait for it to start. We could spin up a bunch, run the Drushmake, and build out the sites, and we have a development environment. So that solved that problem, probably not in the best way. So it's a continual evolution of things. What we'll probably end up doing is using something like extra backup to do binary snapshots of the sanitized data, and with MySQL 5.6, you can actually import the binary data into MySQL tables with an important tablespace, I think, is the command. So do something like that to give us the speed and keep the data around so people don't lose their development environments. We didn't have full-time staff until about 2014, except for a few of us, and now that we have it, when dev environments go down, it's not a productive week. People lose a lot of work, and it's kind of just a big stopping point. So some failures there, and continuing to evolve them to the point where, you know, maybe we can sanitize better, sanitize more, make something that's small enough, smaller than the, I think, 10 gigabytes or so, that we have them down to now that would work better on a local development environment, give us a faster, kind of, refresh of development environment data, so like being able to run a Jenkins job that just wipes your database, runs the updates that you've done with, like, the latest data, some kind of middle ground between just killing your site completely when you've done your development, and, yeah, being able to, kind of, like, merge in, like, different changes, do some integration through the development environment. So that's kind of where they're at now, and haven't fully been improved yet, so we put a Band-Aid in place, which is to EBS Snapshot every night, the Docker database, so. I do want to pause on the slide for one second. Drupal.org actually does use DrushMake files for everything. It used to use BZR. That should give you an A... How many people know what BZR even is? Okay, so BZR was kind of... I don't know if I want to call it a precursor to Git, but it was definitely a competitor to Git. Same tool set, distributed versioning. Drupal.org used to be... All the projects used to be in BZR repos, and we moved to actually building Drupal.org sites through Makefiles, so there's no repo that contains that's permanent, that contains everything. We build them as needed, so when you want to update a Drupal.org site, you change a Makefile and it rebuilds the servers in production. Just a little bit of random history while we're at it. BZR was complex, and it kind of belongs under a failure option because there was really only one person who really totally understood what happened there. Yes, I'm looking at you, Neil Drum. And so it was complex because the tool wasn't widely used. If you ever wanted to look at some history, go play with BZR. We had a major schema cache bug. How many people have ever gotten a white page on Drupal.org or a fastly error page on Drupal.org? So we ended up with this fairly complex schema bug issue that was a core issue that would take a web node offline and not have it be able to respond to requests. And then eventually all the web nodes would go down if you didn't step in and do something. So you could sit there and refresh Drupal.org and get a page load, refresh it again, and it'd be redirected to a different web server. Drupal.org runs four web servers. So you have a 25% chance of hitting any one of the four assuming they're all working properly. And what wasn't happening correctly is nothing was actually figuring out that one of the servers was down. And so it was still directing traffic to it. That traffic generated the white screen of death and you got a white screen. And we went through several different steps of trying to fix this. Yeah, most of the steps we went through was trying to do forensics as to what's going on. And there was a lot of core dumping was happening, but we had a lot of things that were stopping core dumps and so we had to fix that and wait for the next random issue to happen. And so eventually we got to the point where we could actually trace through the core dumps and find out where PHP was failing and found the issue on Drupal.org at the same time Neil discovered it and this is an actual bug in Drupal 7. And the... I'm not sure if somebody else had that bug or if they just saw it as a potential maybe someone will run into this, but when you have as much traffic as we do, we ended up hitting this bug. And so there was a patch for that bug on D7 and now we're running that patch and it's been fine since then, right? Like no more... Yes, it's been fine since then. And we're not sure why it always had to happen at like one in the morning to three in the morning, but that's why when it would manifest itself except sometimes it would be in the middle of the day, so... Yeah. So we've got some solutions. You know, one of the things that for whatever you do on your infrastructure you should have good monitoring tools in place and we could probably do an entire session on what monitoring tools do what. We've... Drupal.org is using mostly server density, New Relic, Pingdom and PagerDuty and I'll go through kind of... and on the left-hand side I list some other tools that are available. Server density is a Python... it's a paid service, it's a Python-based agent. It runs on your server and it collects data about what's happening on your server, CPU usage, memory usage, how many Apache children are there, what MySQL is doing, and it reports it back to a central dashboard and then that dashboard you can write all the alerting rules on. So if I'm using more than X amount of memory, trigger and alert. Pingdom is a really simple tool that pretty much just pings your site and sees if it's responding and you can do some more complex things in there like try to log in as a user. So one of the things that's a little bit dangerous with Drupal.org is if you just say, hey, give me the front page of Drupal.org, that doesn't actually tell you if Drupal's up. Drupal.org is up and the reason for that is is because that's actually a test of Fastly to deliver the page to you. It doesn't actually go back to one of the Drupal.org servers and determines if it's working. So one of the things we did with Pingdom is we configured it to log in as a user and so we kind of forged that request for logging in. And if that times out or doesn't respond in a normal amount of time, then something's wrong. PagerDuty is... How many people know PagerDuty? So PagerDuty, Ops, Genie, VictorOps, they're all kind of in the same realm and they basically, when something happens, they figure out who-to-who page. So you can do on-call rotations. You can keep notes on an issue. You know, if an issue is handed to me and I'm not able to deal with it because I'm not available or because I'm not technically competent enough to deal with it, I can say, hey, someone else, take care of this. And so it'll alert someone else. It's a really great tool if you've got a team of people, especially if they have on-call fatigue. It's not really useful if you're a one- or two-person shop. And New Relic is a really nice toolset that basically works as a PHP module. And so it knows the internals of what's happening in PHP and can kind of give you great detail as to what's going on inside your PHP processes. It gives you speed, how fast is your server-filling request? Are there generating errors? It's got some metrics that it builds in there. It's got a really nice phone app that kind of lets you watch things in real-time. I know Rudy and I were watching the phone app during the keynotes because traditionally, that's when Drupal.org has a problem historically, although we were kind of mitigated by the wireless or better infrastructure or Fastly or a combination of all the above. Yeah, I can do like specific... It's got Drupal.org plugins, so it'll actually tell you what module is taking up the most of your site rendering time, what function has callbacks. Neil, you want to talk about New Relic? Which query is bad? That's from my SQL level. It knows what's going on inside the PHP stack and gives you great visibility in there. Yeah, and we have it integrated with our deployment, so there's a little... It'll have a point in time on the graph and say, like, oh, there was deployment and we can track, like, if we're going back for historical reasons to see, like, Schemecash bug or something like that, oh, what deployments do we do that may have caused this or go back in time and see what's going on there? So, yeah, we have a script that executes and basically does a get pull on the server. It basically goes out to New Relic and said, hey, I just deployed at this time due to this Jenkins job and we'll get to Jenkins in a second. So, the other thing to keep in mind is when you have lots of servers... We need to show the server slide. When we have lots of servers and we have lots of servers, you need a central place to look at your logs. It is not efficient if you have four web nodes and you're trying to troubleshoot a problem to log into four servers and look through four log files simultaneously. There are a lot of tools out there that do this for you. Splunk is probably one of the better known ones. It's kind of expensive. Elasticsearch, Logstash, and Cabana are an open source version of this, sometimes referred to as the ELK stack that you can just set up and run. I should put that in quotes because it's a little harder than just, you know, running some packages and getting it to run. There are a lot of application challenges there. There's a lot of cloud providers that now do this where you can basically just send your logs off to the cloud. And, of course, there's the old time-tested approved syslog forwarding where you just let syslog forward to a central logs host. We are currently doing the syslog forwarding mechanism. We've played around with the ELK stack in the past. And syslog forwarding works really well. It's simple, it's basic, and it is more than tested and true. So, Jenkins is... Drupal.org's instance of Jenkins. How many jobs do you think are in there? I couldn't even... Narayan, you want to guess? Yeah, there's a lot of jobs in there. So what we're using Jenkins for is not the traditional version of Jenkins. Traditionally, you use Jenkins to automate your builds. You do a get a code commit to get. Jenkins then picks up on the code commit and it runs tests and it tells you is this passing, is this failing. And you can then have Jenkins build it on like a Mac, on a PC, on Windows 95, 98, 90, me, yes, Windows me, Windows 2000 and see that your build works everywhere. We use Jenkins to run tasks on slaves. So slave servers are servers out there that have access to different things and we have jobs defined that we go in and we run them on these different servers and we do that for a couple of reasons. One, it provides us the output. Jenkins keeps track of the output of the job build. So we can go back and see what happened. Some jobs run on a schedule, like run this job every 24 hours. Some jobs run when someone pushes a button. And so we keep track of who ran the job when it was run and why. We use this for pretty much everything you can imagine. So if you want to build a dev site back to our early slide, that's a Jenkins job. Oh, yes, that's pretty good. If you want to deploy Drupal.org from a Git repo, that's a Jenkins job. If you want to run cron, that's a Jenkins job. If you want to sanitize the database, that's a Jenkins job. Make a backup of the database, that's a Jenkins job. And Jenkins jobs are nice and that you can do dependencies. So I can say, hey, run this Jenkins job if it succeeds and you can see that the second job there failed because it's got this big red circle around it. If it succeeds, run this Jenkins job, then run this Jenkins job. Jenkins also can alert you via email, IRC, hipchat, Slack. It can create JIRA tickets automatically. Jenkins is a really awesome tool. And if you've got more than one server and you kind of are SSH-ing into servers to run jobs or you've got cron tabs configured on multiple servers, you should look into Jenkins. There's a competitor. I don't want to call it a competitor. There's a newer tool called RunDeck, which is kind of trying to do this space. Jenkins has been around for a lot longer and I think it's a better tool but I haven't looked at RunDeck in over a year. So I would still stick with Jenkins, but it's a really great tool and it saves us a lot of energy when trying to figure out what happened where. How many people are using a configuration management tool? I'll ask that another way. You can log into a server and edit Apache.conf or php.ini directly and save it. I see some people like doing this. So Drupal uses... Oh, cool. Drupal uses Puppet to pretty much configure its servers. So rather than logging into a server and saying I want this package installed, I want that package installed, I want that package installed, I want that config file and make it say that and this config file and make it say that. Oh, I typoed that config file. I'm going to spend the next three hours editing it. And then of course, when you're dealing with a multi-node web cluster, all the config files across all the nodes have to be identical. And if they're not, you end up with some challenges. So Puppet lets us do that really easily. We can define the files and configure the files as resources in Puppet and then we basically go onto a server and say, hey, Puppet, this server is a web server for a web server is pushed down to that server. If we edit the configuration for a web server saying adding another V host, then that just gets pushed out to all of our servers. It lets us do things like when there's a vulnerability that comes up and you've got to update GLib across all your servers rather than logging into all your servers and running a command and then, you know, kind of hoping that you remembered all your servers, you go into a Puppet configuration file and you say, hey, ensure that the GLib module is the latest version. And then you can go to a dashboard and watch all your servers update. Or if you're really impatient, you can then go update all your servers but watch them in the dashboard as they update so you know if you've missed one. We also use some DRDB configuration for our database servers and for our NFS servers. And Git servers. Thank you. Yeah, actually, database is on there because we use the Percurna replication manager, which is a set of, you see, plugins for Pacemaker, Coruscink and Heartbeat, which is part of the Linux high availability project. So it's a cluster suite, essentially. And those plugins manage the replication of the database pair that's in production. So it's on GitHub. It's just a set of tools. There's decent documentation around using it and what it does is let us manage the cluster, manage the cluster using the HA tools. So it tracks replication between the primary, secondary database servers and manages which one's primary, which one's secondary, or if both are out of rotation or if one's out of rotation, then all of that gets managed by the Linux HA suite. And then DRBD. Who's used DRBD before? A few people. It's essentially RAID 1 over a network, is how I would describe it, simply. So it manages like block level replication of drives. So if you have two servers and you have something like Beanstalk that has like a bin log that writes, it's like, what does it call it, the queue, the Beanstalk queue, it writes it to a bin log. It doesn't really have an HA built-in like MySQL where it's doing replication in the software itself. You can put that bin log on DRBD and it will replicate it. One node can be active at a time. If the active node fails for some reason, DRBD can be failed over using heartbeat pacemaker, cluster suite stuff over to the secondary. It can mount the DRBD device and it manages failover essentially for things that don't have that built-in out of the box. We use it for Git, we use it for Solar, we use it for NFS and Beanstalk and pretty much anything else that we're like, oh, that's a relatively low-traffic thing that we can just throw it into a two-node cluster. And the other thing we should mention is we use Fastly as our load balancers. So in our previous environment, we had load balancers and traffic would come in and those load balancers would figure out what server to point them to. What we did is we got rid of those load balancers and basically said, hey, Fastly, we have four web servers route traffic accordingly and Fastly actually can check to see if those web servers are healthy. It's a feature in varnish. And if they're not healthy, Fastly in theory doesn't route traffic to those web servers. So, you know, this is something that happens when you're running a volunteer infrastructure for so long is you don't really have planning. It's, oh, hey, we need to do something. Or even worse, we need this working and we need it working by, you know, 15 minutes from now because we're in the middle of a Drupal con and the server's going down. And so we haven't really always had a really good planning mechanism in place. And in some instances, planning has been how Narayan was feeling that morning. And so, you know, we can't really stress this enough when you're building an environment, plan. Plan everything. What is your load? How are deployments going to be done? Are you going to test? How many servers do you need? What can you handle? What happens if you get a traffic spike? What are the business continuity rules around those things? Plan everything and then test it preferably before you're in production. Have a staging environment so that you can test these things without taking down your primary site. You should go through and be, you know, really diligent about this. Don't just assume that your provider who's selling you an instance and they say they can do X, that your site can do X on their instance. And I had a customer come to me a while ago and they're like, hey, you told us that my site would run on your hardware and it's not working correctly. And I'm like, well, it should work correctly. You've got a dedicated server there. And I'm looking through their site and they had 572 modules enabled. Oh, well, that's why it's not working. So, you know, this is probably like one of the most important slides in here. Plan, plan, plan. Have procedures in place. Know what your limits are. Know what your environmental limits are. Or your provider's environmental limits if you're not hosting yourself. Yeah. And like tools, tools like Puppet, things that can kind of enforce that, that plan are good to have. Like somebody edits a file on a server somewhere. Puppet's like, nope, back to the plan. So, so yeah, I mean, planning is one of these things where you've got to be, you've got to be on top of it before you run into a catastrophe, especially if you don't have a full-time staff of SA's system administrators sitting around to help you when things go bad. And, you know, one of the things, if you are running your own infrastructure, you should have a vendor's contact available to call on in the event you get stuck because your bosses are going to expect that environment to be up and running 24-7. And if something goes wrong and it's outside of your scope, you should have someone you can call on. We've got some new features available on Drupal.org, such as throwing what? Okay. So, I assume this is what we're doing to Util once we decommission it. Util is a very old server. So we've run 138 tests since July, 138,000 tests, I'm sorry. We've had Drupal CI, which is our new testing infrastructure on Drupal.org. Essentially stable since Barcelona, 138,000 tests since July, minimal feature enhancement since December. We've added testing to the test runners, and we've got JavaScript testing now working with Phantom.js as that was a core maintainer request. Drupal CI is a really cool project for the Drupal community and has really enabled Drupal 8 to kind of come over the final mile. So we have this new site called packages.drupal.org, which Ryan has been working really hard on. If you were in the last session, it's basically a Composer endpoint for Drupal.org that allows you to use Composer to kind of get data out of our system and use it with Composer. One of the challenges with Composer is that Composer wants this data in a certain format, and you can kind of fake it if you really wanted to, but since we have all this metadata in Drupal.org, we can put it in the format needed for Composer, and that's what packages.drupal.org does. It's an alpha status, but you can use it now and test it and give feedback. I had a curiosity, has anybody tested that yet in this room? Aside from us? So, my favorite part of this, ask some questions. Anybody have questions on Drupal.org, on infrastructure, you've got three, four people up here, come up to the mic. We've got four people here and we know almost every problem imaginable in one form or another. I was wondering with, is this on? Yes. Oh, DRBD. DRBD. Have you guys tried other things? What was the reasoning behind choosing that, or was it like it was like that and we don't want to mess it up? It started a long time ago. So, DRBD actually started with the NFS servers because this was way back when there was really no other options and our NFS servers haven't actually changed significantly since I built them originally and one of the ways you can run NFS is using DRBD because it's not a synchronous, it's entirely synchronous depending on the protocol you choose and we run it over under protocol C which is the sync one. So, you can put the NFS metadata on the synced volume and have an absolutely perfect copy that is synchronous, which means that you can run a failover NFS node and it will actually work assuming you change some things in NFS configuration because from the client's perspective nothing has changed, it's just the NFS server has gone away and come back. It looks exactly the same. It turns out that synchronous quality is useful in some other applications and also it proved itself really well and that is basically why. I think DRBD really actually still has a place some places just because it is so proven and it's volume level. It's not, so I think it was said to be disk level before which it is but you can actually do it at a volume level and just make sure that absolutely everything on that volume is synced over exactly like block level which can there are actual applications where that's really useful. It sounds like you have an alternative would you like to say what the alternative is? Yes. I haven't seen it in production but I've heard about it and some people have horror stories but I've heard some have my own about Bluster. Right, so it's kind of from it's definitely a blast from the past so it's not one of the new tools it's been around for a while actually a very common setup with DRBD was not the master slave architecture that we have necessarily but having it sync in a master master setup with the 8x branch clustering file system on top of it that has a locking subsystem so it's only writing to one and that's really where it started coming out. DRBD how we use it in a few places how we use it it's the right decision and a few places it's just because we're used to it. Gluster I have some other personal problems with but that's unrelated like Gluster there are other options between DRBD and Gluster that would be good to look at such as the ones I'm interested in right now are CFFS and FS is still part marginally useful. Gluster Vest is really useful for virtual deployments because there just aren't that many players in that game and I'm also looking at a few places of using more document store exposing an S3 API but that's kind of niche. Yes. I'm very familiar with New Relic and the value that you get out of that but I'm not very familiar with server density. Can you provide some examples of what you guys are using server density? Cool. Yeah, so it's a live demo. Server density replaced our sort of shared Nagios that we were using with the open source lab as a way to not run our own Nagios server so it's doing very it actually has a Nagios API integration plug-in so we could kind of like migrate some of our things over from Nagios to it and use it to do sort of like the basic system monitoring stuff that we were doing with Nagios why server density over like New Relic server which also does a lot of this the reason was like the New Relic stuff only worked on very specific operating systems that we weren't running everywhere so server density worked everywhere they had a puppet module it was easy to deploy they they do all of it for us in a nice way. Server density has been really easy to use it's basic I mean it's really simple and the the New Relic configuration get difficult so we've got server groupings of servers and so you can click on a server grouping itself and you can write some rules in here and so if CPU utilization gets above 95% trigger and alert if disk usage gets too full trigger and alert if my SQL connections get greater than 400 a second trigger and alert and then for every alert you have different actions set up so I can see that for disk usage we are going to open a critical incident trigger a webhook and send a slack message and for something like my SQL connections we are going to alert a bunch of people and do all of the above and so those are set up at the server at the grouping level but also when you get into the individual server level you've got historical data going back as far as you want to go back this may take a second to not that fast so I can look at the load average over time the memory usage over time the disk I over time the CPU utilization over time our disk usage over time what was going on there good question I deployed a set of scripts that basically checks the cluster for which bin logs have been read and used and have it running on those servers now to purge the bin logs because we were hitting a point where that volume was getting close to full frequently yeah so getting full because we were writing more to the database because we moved not all but some most cache tables back into my SQL essentially field cache so a lot of queries started hitting the database and writing to the database not just selecting from it but actually inserting and those inserts get written to a bin log they get moved over to a drive and my SQLs granularity on like expiring the logs is one day at a time and we were writing enough to fill up about 100 and well it would fill up to 137 gigs that we had dedicated to it but you know at the rate that we were writing things like we needed like a 400 gig volume there so implemented some scripts and that's reflected in this tool to see quite the cutoff here you know the other thing that server density does it gives you a snapshot of every minute of every day what happened so you can see that the loads on our database server are low look at the CPU utilization look at the memory usage look at the ethernet traffic going on it's just a nice visual interface for pretty much everything going on within your environment it also has the ability to monitor monitor services this I did not mean to click that so this is similar it's a little bit like Pingdom does but what this is basically doing is triggering a 404 alert and verifying that it gets a 404 alert back from a bunch of different locations around the world and if it doesn't get a 404 if it doesn't get a 404 alert back it will trigger an alert this one you see 100% here is basically because it's testing fastly this one because it's a post request to a 404 page won't get cached by Fastly so it goes straight past Fastly to one of the actual web nodes that we host I'm testing a 404 because it goes right past the web nodes and there's no cache involved I'm also doing it as a post request to really make sure that there's no cache involved Fastly's configured not to cache post requests any other questions we can show New Relic while we're here so we're running MariaDB 10 .0.0 .0. whatever the latest semantic thing is mostly because we haven't had a reason to upgrade and that's like the most critical part of all the infrastructure is the database part so database updates and things like that are really the they're easy to do the most potential breakages you have some stories from doing MySQL upgrades and Drupal doing things that we weren't planning on seeing and similar to the schema cache bug you hit things that you're just like oh we found that stress case so yeah here's New Relic and so with New Relic you can basically it's breaking down the amount of time web servers are spending processing PHP processing MySQL requests processing memcache requests and then processing external requests that the browser is making or that other things are making within there and then we've got the most like what functions are actually taking the most CPU time and then it breaks it down by the four web nodes that we're hosting and they're all pretty much within the same speed some of these are different hardware so you can actually see that www7 is actually better hardware than www1 by the amount of time it's taking to respond to requests on average and the amount of CPU utilization and the amount of memory it's taking oh I'm sorry I should not do so it's a really nice tool they do offer a free tier and a paid tier and we don't get any well I don't get any kickback from New Relic any other questions so I'd like to remind everybody come get involved with the Drupal community you can work on Drupal.org or any project and please fill out the session evaluations for this session thank you so much