 Good morning. The time's now 9.25. The subject we're discussing today is high-available trupole. If that's not what you're coming for then you still have a bit to leave. But if you are looking for high-available trupole, that's what we're going to discuss. Let me first let me introduce myself. My name is Bram or Bram for the English speakers. I used to be in molecular biology. At least that's what my professors at university told me. I got into a big data before big data was a thing in the IT world. That's we produced about a million data points per week. That means you have to write code to get through them. So I became a developer. I joined the dark side as my colleague said and I spent my time developing bioinformatic code. In an academic world, you're most of the time when you're the developer, you're the server guy. So I built my own academic clusters, which meant I joined the dark side and I became the OBS guy and that's when Bini meets my current employer, picked me up and said this is the guy you want to work with. So from afar it looks like a strange step to go, but to me was a gradual sliding skill. So what we're discussing today is also the OBS works about how what things it would take to run Drupal in a reliable way. And in a unit, we have a couple of reliability principles. This is the stuff that we teach our developers, our operations guys. What would it take to have confidence in our infrastructure? Not just their infrastructure, but also actual platforms. The platform, the infrastructure, the code, the monitoring, our processes that are in place or should be in place before we can actually reliably serve a product to a customer or serve one of our own products. So now I would like to step you through the several principles that we have come up with. First of all, previous speaker already mentioned it, stick it in a provisioning system. If it doesn't exist in a revisioning system, it doesn't exist. If it's in your home there, if it, how well meant it is, it does not exist. You will leave the company, you will go on holiday. And then we need to hunt down wherever your script is running from. So we stick it in version control. We have all, which gives us the benefits of version control, which means history, which means blame, which means we can easily diff. Some stuff you produce, some artifacts you produce are binary, so then vision systems become more difficult to work out what the actual diff is. But whatever yields the binary artifact will need to be in version control. So if it's not in there, it doesn't exist. And this is a very strict rule we apply to pretty much anything. We deploy artifacts, not code. And this is a very important principle. What we see with a lot of customers or beginning teams is that they would want to go to your product system, as is agent to it, CD to it, the directory and then do a get pull. This is dangerous because you will never, never, ever know if your process will succeed until you try and do it. And this is too dangerous on a production system. So we will actually have a promotable artifact. This is, we'll have a pipeline, we'll have a process where we create the artifact. In our case, we like to package as a native package. So when we're a central shop or a red hat shop, so we produce mostly RPMs. Which brings another benefit to our system because RPMs or any building package manager is how you basically stand on the shoulders of giants. They've been around for literally decades since the beginning of operating systems. They'll give you the benefit of history. They will give you the benefit of known version updates or upgrades or downgrades, dependency management. It's all built in. You don't have to reinvent it yourself. And the other benefit most of the time is all, most package manager, native package managers, I know of, will have a feature that will tell you what package provided, what file. Or even if your file has been changed on this since the installation has taken place. This doesn't only, not only does it find colleagues that helped out on production by fixing something. It also will help you track down hackers. They will, on more than one occasion, why we were brought in to find a attack factor we basically look at what files have changed on this. It's as simple as that. If it looks like something that we shouldn't have changed then probably someone else did. If you automate the things like a previous speaker already mentioned, when you have a pipeline building an RPM out of something is only a simple step. I believe the official Debian packaging guideline is about 800 pages. There are tools out there that will basically make it a one-liner tool called FBM, which we use heavily. It won't yield the packages that will directly be allowed into Debian or Santos, but it works just fine. The other reason we're on the subject of artifacts, we only create artifacts based on resources we have in-house. Another strict rule we apply. The first one is a bit of a teaser. Did I mention angry upstream developers? The most language environment ecosystems have a central package repository, think RubyGems, think Composer, in our case, where providers or developers have pooled packages that everybody else relied on. If that package goes away, your build will fail. Some people will say meh, but if that's the time you want to push out that emergency deploy, you're actually screwed. Which means manual intervention, manual interventions, it will also mean straying away from the thing you know best, which is your automated pipelines. When you're in a stressful situation, you do not want to deviate from your standard patterns or have a very well scripted deviation. So when we build in-house, we do it because we know we have the tools in-house. We are responsible, but also have the opportunity to fix when stump is broken. When Composer goes down or RubyGems goes down, or in our case, I don't know, people still around from the early GitHub days, GitHub used to go down a lot on Fridays. People have processes around it, so don't deploy on Friday, which is weird anyway, because if you have enough confidence in your process, you can deploy it at will at any day of the week. Also, we work in a lot of corporate environments and corporate firewalls means that Composer might not be available to you and will not be made available to you, so you need to have something in-house. In our case, it's a combination of pulp, which is the RPM package repository and Nexus. Other tools are available like Artifactory, which is actually better, but it comes with a heavy license fee. So build in-house. This is something that our junior developers and engineers actually struggle with. Your Composer lock files, we actually have a built-in check. If your Composer lock file has a GitHub.com URL in there, we will fail your build. Because we have a Nexus setup with a built-in property. This still might give you problems if Composer isn't available at the moment the developer wants it, but that's a reduced risk already from what we started with. The only way to get to fraud is pipelines. This is an example of Jenkins workflow. It's one of the newer visualization tools. We have a multi-step process. Same as the previous speaker. The only way to know if it works is building and deploying not into production first. For the people that look closely, we have nice names for our environments. We do not call them dev-UT prod. We have names like practice, theory, reality. Because if it works in theory, then we can deploy to production. We will have multiple productions. This is also something people struggle with. For instance, our demo site will be launched before the general release. Then normally people get into trouble, but it's not production fully, but it's not UET because in UET we actually run quite destructive field tests, which we don't want customers to see, so we'll have an additional production platform. This one was important enough for me to put in caps. Config does not belong in a package. Config has a different release cadence from code. For instance, if I want to change my Config on production, I don't want it in the original artifact because then I need to promote it all the way through to production without any benefit to the underlying environments. So how we actually deploy this will come, we'll discuss in the next slide. But by separating your Config from your code, you will also have built-in opportunities to do canary testing, to have A-B testing. And also have a deploy and feature switch, so as to make it a two-step process. Deploy your code, deploy your Config. When you're ready to toggle Feature X, you deploy your Config. And this is much quicker than doing full deploys. Full deploys will take bigger artifacts, will take longer to propagate through your network, where Config is mostly tax files. So it's KBs versus MBs or even gigabytes at some point if you deploy a Java code. So we use Config management for all the things, otherwise known as infrastructure as code. Our poison of choice is Puppet. This allows us to also not only model code, but model our infrastructure like it were code. It also allows, because it is now code, we can have all the benefits from code. So we can stick it in a version code. We can test it. We can actually version it. So it's not at git hash x. But at any given point we actually know which version we're modeling our software with. And this also allows us to have environments for infrastructure. And this is actually an L-shaped profile. So we have a production pipeline that will go to prod. And our prod code is actually deployed for dev, for UBT, for prod, for the platform itself. So we can develop Apache code, HAProxy, MySQL settings, without actually interfering the active development trying to get to production. And measure all the things. Monitoring is not a hindsight. It's not an afterthought, because the next phase of this presentation will go into stuff we actually saw from our monitoring. So do not accept like no budget, no time. It's built in. This is the part of the work you're actually trying to do. And the last one actually should also be being in-capped. We tell customers we do not go to prod if we do not have meaningful logs, manager, and alerts. Because if you pay me to wake up in the middle of the night and I only want to be woken up by stuff I can actually fix, it needs to be actionable. But meaning actionable and means I know what it is, I know how to fix it. And it actually needs to be fixed right now. And actually this should be the obvious one. If you want to be reliable, if you want to be able to survive catastrophic failure then it needs to be more than one. This is an obvious conversation to have, but not always obvious how to fix it. And N plus one, does that mean in our case we run everything on our own hyper-visors. It needs to be on a different hyper-visor. But does it also need to be in a different rack? Does it need to be in a different DC? Inter-DC will give you weird latency problems. Does the customer actually want to pay to run at the other DC at a different supplier? They don't go back up too often, but we've had one customer where the entire supplier went offline. Quite dramatically, quite quickly. So they actually now pay us to have it run at two suppliers. Which means two times your processes. And we try to keep them as close as possible. And then at the end it actually is a conversation about with your customer what they're willing to pay for. How resilient do you want to be? What kind of level of catastrophic event do you want to be able to survive? Banks want to be always online. Websites running cat movies. I'm not going to care if the Netherlands will flood. So in our case, the simple summary is a working service is an automated combination of our application code and infrastructure code and the availability of monitoring. So now that we've built up to our principles, we're actually going to discuss the surviving the hurts. So this is our basic setup. We have somewhere on the Internet that connects to a virtual IP. There's a two HAProxies in two different data centers that cross connect to a patch sheet where our Drupal actually runs. We should then cross connect via a virtual IP to our MySQL and we have a shared storage on Gloucester. And we're going to build our way back from right to left. So first of all, shared storage, probably obvious, but Drupal can generate a lot of small files. You can write quickly and that basically will kill your share storage. So please tweak for small files. The other problem we had was Drupal. So we were originally Symphony Shop. Before that we were Drupal. We were moving back into Drupal. So we came to a lot of surprises. When tweak writes out the cache, it has a unique cache identifier. If you have two instances of Drupal writing in the same directory, during a deploy, the hash key will probably most likely change. So one node will actually be clearing the cache while the other one tries to fill it. And that will get you into race conditions quite quickly. So when you do this, please move your tweak, especially your tweak hash, move it off your shared storage. The other thing we saw was Drupal is quite sensitive to geospatial problems. So we had one MySQL cluster, which in 2DCs, 2DCs were actually quite far away. One in Germany, one in France. One customer switch, from the German customer switch to the French MySQL, actually the latencies were going up quite quickly. So now our monitoring actually was fine, and then 95% how response time looked fine. But when we were actually looking at populations within the response time, we actually saw the ones going across countries, and that was slowing down considerably. So now we fixed those. We're basically moving clusters, DCs, closer together. Luckily we have providers that we know where they are, and we can basically have an optimum distance, which will have problems. Well, if you have a local catastrophe, like a power grid going down, you're more likely to have problems. But then you can actually, if your customer is willing to pay for it, you can actually split across four data centers in two countries. In our case, customers are not really willing to pay for it. Master, master applications, we use a simple master, master, which means MySQL uses a staggered increments for primary keys. When you're looking at a table like the semaphore, it actually wants unique IDs across all tables. A unique and the same across all databases. So we had to put either we code in place that would stay away from semaphore or have reconciliation mechanisms in place. Moving your cache actually down to out of the database will also help with our geospatial problems. For most people that are on Drupal, we'll do this naturally, but we have to experience ourselves. Filling cache can actually destroy your database when you have too many active users. So we moved it into Redis. First, we moved it into local Redis, which gave us orchestration problems. You basically have two local caches, so if your customer updates an article on one end and then gets routed to the second node, they actually don't see their changes. So we quickly moved to a failover system where basically one is primary and we only flip over when the primary goes down, but that's not a shade, that's a failover. So now we have orchestration tools in place, code in place that actually flushed the cache on two sides. Then Composer Optimization. Composer, we were a PHP shop, we knew Composer, the way it builds up its class map can be optimized. It's a process called Authorized Authoritative Class Map, which basically disallows PHP auto-loading to walk through all your trees with code. It actually builds one authoritative file which feels counterintuitive, but it's a big key value pair, basically, class location. We've had problems with Drupal accepting this because some places will produce proxy classes which are not automatically put into your Composer file. So you have to manually or semi-automatically go through your tree and basically add those to the class map. Post and Sol. Most employers are, because we use RPMs, they're almost automatic, but then we allow developers to write their own scripts we call Post and Sol scripts, and then the rush happens. And we found that they wrote tasks that on their laptop with a small data set would take seconds, but on the back, the big data set will take minutes, so we aggressively review and refactor those where we basically, anything that can be happened in the pipeline will happen there, even though time takes longer or it's a bit annoying to build stuff like caches. And then we try to work around the 20-minute deploys and I think I'm actually all many minutes over. So thank you for joining us. Tomorrow there are more places to collaborate and if you'd like to rate me, here's a link somewhere.