 All right, my name is Jason Ford, CTO of Blackmesh. Been doing this for, well, security for most of my life, for the most part. But some of the challenges that we see and the kind of things we've gone through with our customer base, I kind of want to give it here. I'm not going to be talking about our customer base. I'm going to be talking about really what we do, per se. I'm going to talk more about frameworks, which if I say the word security and you get excited, you're going to love this. If you go, oh my god, it's security. I don't want to deal with that. I'm probably going to bore you for 30 minutes, but you can bear with it. So some background for myself, like I said, I've been kind of doing security since 97. I started in FedCiv space in the FBI headquarters building and then kind of went around the civilian sector for a while. Us as an entity, we're PCI DSS certified for level one. We have FISMA moderate. We have FedRAMP moderate as a platform and as infrastructure. We were founded. I'm one of the founders in 2003. And we have basically slide customers, everything, commercial side as well, from that side. So that'll be the last black mesh thing you see up here. Besides the logo on the corner. That's going to be it. All right, so security versus compliance. I mean, a lot of you probably out there are kind of familiar with this concept. Maybe you aren't, but security is how you actually do those things. How you actually implement those controls. Compliance is a snapshot of how you implement those controls. So from that snapshot, it really doesn't give you a holistic picture of what's going on. It's just how it looked at that point in time when you looked at it. So security is that framework that you actually see everything that is associated with either that system or that plan. Some pretty notable things in government for at least this base, anyway. White House runs a version of Drupal. We actually host the Department of Energy, their main website. Department of Ed, obviously recovery.org, drought.gov, recovery.gov, Homeland Security runs it. There's a lot of them. So the point is, it is a secure platform. And that's one of those things that a lot of people seem to have to fight a lot of, especially in the government summit yesterday. We talked a lot about that. How to present Drupal as being secure. And that's really the hardest part. And if you can get over the Linux part and you can say, well, Linux is secure. You're doing that with Stig's and in the Disaspase. You're doing that with other things as well. Why shouldn't it be other open source technologies like Drupal, like other CMSs, to be accepted in that same framework? So frameworks really at the end of the day, there's a ton of them. But here's the main ones. PCI for credit card, FISMOS in 2002. The basically federal government decided that they wanted to actually have some kind of compliance around that. So they created that off of NIST 853 framework. NIST is the, if you're not familiar with them, they're the entity that advises the government and others on how to secure their, not only their IT infrastructure, but all their infrastructure, financial, everything else. ISO standards are more global. HIPAA is obviously healthcare, FedRAMS, virtual version of FISM at the end of the day. And then DISA has its own thing as well in the DOD space. Really comes down to just this, compliance in general comes down to just these six steps. I know it's hard to boil down compliance into anything. But at the end of the day, it's really just these things. So going through, figuring out what you have to actually monitor on, what you have to report on, how do you assess those controls, how do you actually implement them, and then you monitor those things after you implement them. At the end of the day, it's as simple as that. How you actually have to do that process is not so simple. And that's what we usually run into and that's our problems. So kind of going into a couple of these compliance frameworks in a little more detail, I'm gonna focus more on FISM and FedRAMP, more on anything else, just because those seem to be kind of the harder ones to do. PCI is also difficult to do as well. HIPAA is kind of a joke compared to the other ones, honestly. But from a framework perspective, like I said, FISMA came out in 2002. It was based on this 853 framework from NIST. At that point in time, all systems actually had to get an authority to operate an ATO based upon how the system looked at that point in time of implementation. So kind of some of those requirements, you have to have a system inventory, you have to basically say what your network diagram looks like, what's the system look like, basically get all these things in place and then figure out what your risk factors are from that and then try to figure out, all right, if these are my risks, is it as acceptable risks or not? And it doesn't matter if this is on the software side or if this is on the hardware side or any of those sides, it's all of it, including yourself as a human in the system, right? You're part of this, whether you can be spearfished attack, for instance, and compromise the system. And then you go through a certification process and you go into continuous monitoring. I'm kind of flying through this pretty quickly, but I could, I mean, we spent literally eight hours yesterday talking about security and compliance and we didn't even scratch the surface. 30 minutes here is hardly justice for it, but just want to kind of give a high level overview. Within those, that framework, there's different categorizations, so you can have low, moderate, or high and it depends upon what that system needs to have. So if you have a high system, for instance, as I always use the example, if you're in the space of energy, for instance, and you have nuclear work codes, then that's obviously gonna be high, right? And there's gonna be a lot more controls wrapped around that, whereas a low is gonna be similar like a public-facing website that you may not even collect data on, right? It might be just like a brochure site or something like that. It is based upon FIPS compliance. If you've done any federal compliance work at all or anything in a security world, you know FIPS already today. It is the way that encryption happens and how those things are encrypted, whether that data is at rest or whether that is in transit and how you are interacting with that system to actually see that data, whether it's in the clear or not. Just a couple more, like security controls, things of that nature, so how we categorize these things. All those things can categorize into an SSP or a system security plan. Those things, there are things that you have to write as a system owner. So you as a developer, you have to write an SSP whether you know it or not. You may be answering questions for your security person, which then they were taking those answers and then writing that plan out. Really it's just saying how your system works. So if you're saying you use two-factor authentication for single sign-on, you have to say that you do that and how do you do that, right? Do you have LDAP behind that? Do you have something else? Are you using CAPTCHA for your forms? All that stuff matters, right? And you have to document those things. Once you have all that documented, then it goes into those control families, which I think I'll talk about in the next slide, but basically we have the, there's a really well-known format template for FSMA, but it's pretty lenient from the perspective of how you write those things. It's really just a word doc with these control families inside of it. Then you kind of go through the risk assessment, like I said, trying to figure out a vulnerability threat. It's no different than what you do for Drupal updates or any other CMS, and trying to figure out what actually is gonna impact the site and how someone's gonna actually attack it. And then from that, you either do compensating controls to basically deal with those attacks and mitigate that risk, or you figure out a way to write the control or put a better system in place to deal with it. Lastly is the independent assessment. This is the most painful part. You have an auditor sit with you for long periods of time, and then they ask you to take screenshots of things and give them screenshots so you can see all the stuff happening. So you can actually verify that you say that if you change, you have password complexity turned on in your application, show me, right? Screenshot that. And that's gonna be part of the compliance piece of it. So once you have the certification, the certification comes out of all of that stuff, that assessment, and all that stuff goes to, usually it's the head of security for that agency, typically in FISMA space, because the agency typically deals with that. And then once they have that and they look over things and if the risk is acceptable to them, then they will give you an authority to operate or an ATO from that piece. And that's the authorization piece. So FISMA requirements from the perspective of, continuous monitoring is the last bit of this as we saw from the first piece before we move on to something other than FISMA. Continuous monitoring is just ongoing what you're looking at. So do you do penetration testing? Do you do credential scans? Do you actually update your antivirus? Do you actually see, if you have intrusion detection on your stuff, do you actually update those policies? Things of that nature. Do you actually rotate your passwords? Is there logs for that? All these things go in that continuous monitoring bucket. Auditors, when they come in and look at you, they will ask you for all this data. That's part of the evidence. So that's the important bit to actually get that recertification piece at the bottom. If you don't do all that stuff, then they flag you and either you have a very short amount of time to remediate that or you lose your ATO is what it comes down to. So that was all FISMA side. FEDRAMP is, FISMA is really for, it's a one-time authority to operate for that system how it was deployed. If you make any major changes to that, then you have to change the whole thing and redo it again. FEDRAMP was an ATO once and then you can operate it multiple times. So, and it's also virtual versus physical. So FISMA is like physical servers, bare metal servers. FEDRAMP is all virtualized. So that'll be Docker containers, cloud stuff, things of that nature. Same kind of concepts, but at the end of the day, what they've done is they've changed some of the, control sets to deal with virtualization and multi-tenancy. So FEDRAMP is, FISMA is done at the agency level. So whatever agency you're dealing with, FEDRAMP is done at the GSA level. So GSA owns that program. The PMO office is run by Matt Goodrich currently. He and his staff basically deal with all these incoming packages from all these different cloud service providers. And I'll get into some of that here in a minute. So anybody who's doing virtual services for the government, you have to go through this process. So that includes infrastructure as a service, platform as a service or software as a service. So if you'd come up with a Drupal platform and say you wanted to offer multi-tenancy for multiple agencies inside of something, you'd have to be a SaaS provider. And then you'd have to find either someone who is an infrastructure or a platform service provider in order to help you with that. Or you do it all yourself, either way. It was basically geared at having a standardized approach for offering virtual services, which didn't exist until this program happened four years ago. And like I said, there's software, platform and infrastructure as a service. So the whole idea and concept of this was that you can use it by multiple agencies. So for instance, for us, if we got FedRAMP certified, another agency could actually use our ATO and not have to go through that whole process again, unlike with FISMA, right? FISMA is laborious, it's tedious, it's all those things. So having that ability to leverage that ATO off of another agency means that you get to inherit a bunch of things that you don't actually have to go through again and wanna cry yourself to sleep at night as you're trying to deal with this. Two methods, there's the joint accreditation board or the JAB, and then there's agency sponsorship. Agency sponsorship is the fast path to this because the agency that in question that gives you the ATO is actually responsible for doing the continuous monitoring efforts with you, whereas the JAB is done by volunteers inside the GSA. So there's three high level security officers from different agencies that volunteer their time to the GSA. How many government agencies do you know or officials wanna actually volunteer time to read 1200 page documents? It's a long process and it's awful. I mean, it's bad. There's about 100 CSPs right now in the JAB queue. Average time to get through that is about two to four years, depending upon how fast you move. And then agency sponsorship can be much, much less. Same kind of things, so we talked about before on the FISMA side and the requirements. Again, hardware, software, network diagram, data flow. Data flow is new, right? So how does your data flow through your application? Where does it see things that are encrypted? Where is it at rest? Things of that nature is all important here. And then on top of that, if you have to have a trusted internet connection or a TIC to talk from your agency to your system, that all goes into that data flow section as well. Again, same thing, FIS 1.99 based. It's actually a dash two now. So how those things are encrypted, how those things get pushed around. Right now, there's only load moderate in FedRAMP. There are no highs. GSA doesn't know how to deal with it. Not only that, no one wants to put high data in cloud service providers these days. There's an equivalent cloud in DISA space, which I'll get to in a minute. It's DISA's actual cloud is actually high, but it is inside of the DoD. And only DoD people can use it. So again, security controls, again, 853 framework, FISMA and FedRAMP look a lot alike. The controls are written in different words, but they mean the same thing. Access control is still access control, password complexity is still password complexity. How are you doing this in response? Still, how are you doing this in response? It's the same thing. So if you do one, it's pretty easy to transition to the other. Not super easy, but easy enough, right? Low systems typically have about 120 controls. And controls are broken up into those control families like I talked about before and I'll just show those in a minute. My system is roughly about 300. And that's before an agency can actually add more stuff to you so you can actually get more stuff. Same kind of thing with FISMA, system security plan, you gotta write that. Just the template is 350 pages before you put any words to it. Once you start writing paragraphs and paragraphs for each control, it can go over 1,000 pages. Ours is currently about 1,200. So as time is forward, you keep adding to it. So you have to have these supporting documentation. It's just not answering the controls is enough. You have to do things like have a user guide on how to use your system, right? How do you do privacy control? How do you do encryption at rest? How do you do rules of behavior? How do you do security training? All those things are attachments on top of that. So you're starting to see it's a long, complex system of stuff to deal with, right? That's why it takes so long to get through it. Because usually the first time you get through it, you don't get it right. You have to go back and modify and edit those things. Risk assessment, again, very similar to, and you're seeing a lot of commonalities between FISMA and FedRAMP, figure out those threats as vulnerabilities. Again, do the risk analysis. Here's the different part. So the assessment's not done by the agency. It's done by a 3PA or a third party auditing agency or organization. Those are all certified by the GSA. So they're actually commercial companies that go and get certified. There's about 30 of them today. But here's the difference is that not all of them will do accreditation. Some of them will help you write your documentation. So they focus on the upfront, how do you get your rules and behavior done? How do you get all your documents in place? How do you actually get a successful package to go through the first time? Because it's costly and it's difficult to do. So again, certification authorization, again, very similar to FISMA. Either you get the authorization granted by the JAB or by the sponsoring agency you have. If you get it from the JAB, you get a PA or provisional authorization. If you get it from the agency, you get an ATO, which is an authority to operate. I think it's monitoring, again, it's kind of the same things here. You have to do these things more often than I wish to say, because it's really a lot. But at the end of the day, what I'm trying to show here is that you're always going through this agile environment of trying to figure out security. So all these things that you guys are doing to develop code, it's the same process for security, which it can be a difficult task to try to find everything and try to plug every hole whenever they change all the time, right? So again, kind of showing here infrastructure and paths for FedRAMP, you have this ATO for CSPs and then for what we're showing here is like different dev shops would have stuff on top of this or applications or system owners in this case. So your system has to go through, if you're using a platform or an infrastructure service and you're not doing multi-tenancy inside your application, then you can use FISMA for your application. You don't have to go through this FedRAMP process, which is much, much lighter. And on top of that, you get to inherit a lot of controls out of the CSPs package in order to leverage up for your SSP for your application. All right, so that's FedRAMP. All right, as I'm seeing time in 10 minutes. All right, so this is the next one is DISA. DISA has their own thing, which is awesome because that makes things so much easier to understand. You have basically impact levels in DISA space. So some of these translate over, some of them don't. So FedRAMP moderate will translate into DISA impact level two. A FedRAMP plus, which I haven't talked about yet, which is additional 25 controls on top of the FedRAMP moderate, will translate into a DISA four, impact level four. What usually comes out of from four to five is you have to go through background checks, you have to facility clearances, you have to have a security officer on site. I mean, it's a whole laundry list of things, plus you have to have encryption between all your stuff between the agency. The list goes longer and longer. And then there's actually a six as well. I'm not aware of a six that's out there today. So some of the basic controls, I just want to kind of pull out some of the stuff that would apply to Drupal, in particular or any CMS really. So some of these things like AC7 failed login attempts to count lockout. So if you need to lock out a system, right? So you need to do that, you need to log it, right? Not only do you have to do it, you have to actually report on that. And then you have to have some kind of automated alerting system saying that that happened, right? Session activity, AC11, same thing. You have to make sure that the inactivity timer actually meets those times. So if you have a session lockout, it's not only for Drupal, but it's also for PHP sessions, right, it's both. So whereas Drupal, you have to worry about one thing, but you also have to worry about the undercarriage, which may or may not work the same way you want, until an auditor finds it, right? And then you have to fix it. And then you go into remediation and then you do a poem, basically that's an after-action piece that you have to fix those deficiencies. So some Drupal modules, we kind of talked about these yesterday in pretty heavy context in some ways. Paranoia, the first one, was the one for D7, it was used heavily. It hasn't been updated for D8 yet. Security review is the one it has been. The guys at Civic Actions have been doing a ton of work on that. And we actually came out of yesterday talking about how we're gonna, or how they're going to, along with other people that are in the room, come forward with a plan to basically push that even faster through D8. These are just a bunch of ones that you can kind of see. Some of them are updated for A and some aren't. So session limit obviously gets you past that AC control, right? Password policy, same thing. You want to have complex passwords. And the last one should go without saying, just upgrade your stuff. I mean, I know it's a pain. I mean, upgrade your modules, upgrade core, keep it upgraded. I mean, that's the shortest way not to get into a remediation step, right? And do that with having some continuous, not continuous monitoring, but actually tracking your changes, change review, that kind of thing. You have to go through that process as well and show that you can document that. All right, that's all the federal stuff, I think. So now, on to quickly through PCI and HIPAA. So PCI, I'm sure you guys have dealt with credit cards at some point in time in your life. You probably have a couple in your pocket. That kind of thing, you bought things. So in the service provider industry, it's PCI DSS is the thing. So it is this version of PCI that they just came out with today, or this last, I think it was the end of last year. They did a major revamp of the control sets and it makes it actually more difficult to go through. But that's a good thing because now credit cards are, in theory, less easy to compromise. Very different from the perspective of how FISMA and FedRAMP work. You basically get a report of compliance or a ROC and then you have a QSA or a qualified security assessor come in and then you get an AOC, which is the same thing as an ATO. So that process is pretty straightforward. And you can do this for your application, you can do this for the infrastructure, you can do this for your platform. It's all different things. And I'm sure you guys have filled out a SOC queue before if you've done PCI before. Healthcare side, very light compared to the other ones. HIPAA non-high trust is a very easy process to go through compared to PCI and the other ones. Still has sensitivity data because you're dealing with PII, right? The harder part is just making sure that that stuff is encrypted. That's the end of the day. Make sure things are encrypted, both at rest and in transit. There's basically the agreement that you have to sign between you and the person that you're dealing with this as a BAA. It's literally just a contract form, two usually two pages long, has some stuff in there, some stipulations saying what you have to deal with and what you're gonna deal with and how you're gonna deal with it. And then has some instant response stuff inside of it. That's about it. Pretty straightforward. So tools. So OpenScap is out there. God help you if you try to write policies to it. It's a pain, but it is there. It is customizable. It is open source. You can install it on any Red Hat system just by the time you install app get for anything on Ubuntu. It is maintained by NIST. So you can get some policies and some control sets out of them by default. You can write your own. You can create regression testing. If you wanna dump it out of Jenkins and then do code review, all that stuff is all here. We talked about yesterday the code review module on Drupal actually hooking OpenScap into that. So it's a module. So whenever you do a checkout from Jenkins or something happens, some event fires, you can actually run some of these policies on top of that and then they can fail the build. Things of that nature is kind of what it seems like is the process. Kind of going into another tool is Nessus. If you've never heard of them, it wouldn't surprise me. But they, Tenable is the name of the company that owns it. It is commercial. It's fairly inexpensive. But at the end of the day, what it does is it basically does credential scans or scans on your database or your code for vulnerabilities. You can use them for, you have to have an ASV or a certified scanning provider actually scan your code or scan your site. Every, I forget how frequent it's, it's pretty frequent. But they are actually one of those ASVs. So it is QALIS, if you've ever dealt with QALIS before, those type. So in DoD space, Stig's are popular, which are awful. Again, if you haven't got a reoccurring theme, security is not fun. It's consumed in my life for many, many years. And as much as I love doing it, it's still a challenge in every aspect. They are DoD Disabacked. You can download these things. You can apply these templates to the boxes. In the Linux space, only Rail, Red Hat Enterprise Linux, it has Stig templates for it. Ubuntu, Debian, nothing. I mean, there's nothing. You can write your own, but they won't be certified. In Windows space, there's obviously stuff for Windows. Again, you can go look at these things if you want. There's also a Stig viewer out there that someone wrote that you can actually try to figure out what those things are. Again, it is extensive. Logging is obviously something I kind of talked about a good bit here. Some tools that you can have, open source land is Elk, right? So, LASA Search, Logstache, Kibana. The hard part there is you have to know how to do Kibana. Logstache stuff, LASA Search is fairly simple. Splunk has an enterprise and a free version. You can set up a learning inside that. You can do all kinds of fun stuff inside there to basically automate your life into your inbox into a point where you have so many emails coming in for alerts that you won't be able to do anything else. And then OSSEC is the other tool that you can kind of use which is I think Trend Micro is the one that backs that one. It's basically host-based IDP, so you can actually see file integrity and things of that nature. It also does host-based IDP from a firewall perspective as well. All right, I think I actually did 35 slides in 30 minutes, so that was around time. So, three minutes left before the next people come up. Any questions? If you could, stand at the mic so we can get it recorded. Sweet, I bored you to death. All right, cool, thank you. How's everyone doing today? Good? So, how many of you shop online? Pretty much the whole room. I've been asking that question for 10 years and over the last 10 years, obviously more and more of the room has said yes to that question. To the point that I was just driving by my local sports chalet and it was going on the business because I realized that I probably haven't walked in there in two years, just like everyone else. And what I've also realized is it works the same way with user experience and the websites you shop on. So you probably, so you all shop online now, how many of you have your favorite websites to shop on? Right, because you've decided that's the user experience you like and it's come down to, it used to be, I don't want to get in my car or drive to a store anymore. It's too inconvenient. It's now come down to a matter of seconds, right? Well, this website is two seconds faster than this website so I like that user experience better. I'm gonna go on that website and that's where we've come to. And studies show that you convert more on sites that are faster and where the user experience is better. And user experience is more than just the look and feel of the website. There's a piece to that which is just the passage of time and the patience we've all lost over the last number of years waiting for a website. So it's become very important for your website to have high performance. Drupal, which is open source, isn't great and the greatest benefit of Drupal is that it's open source but it's also it's greatest weakness, right? Because anyone can do whatever they want to it. They can customize it in however they want which is in the right hands an amazing thing and in the wrong hands, it can cause more problems than I've seen over my career. So even though Drupal Speed can set your revenue on fire, it can also slow you down and cost you a lot of money because of it. So how does performance impact your business? Did you know one second delay reduces conversions by 7%? This was a, they looked at this a couple of years ago and I saw the research and it's pretty amazing because we were talking about what could essentially be up to almost two and a half million dollars a year annually for a site that sells $100,000 a day which is a lot of money. And what we're gonna do here today is I'm gonna start here at the beginning going through some of these numbers generically but then we're gonna show a case study in real world examples of one of our clients to show you how the better your performance is, the more people come to the site and the better conversion rates you'll have. Is Mark in the room from Perfectly Post? So the case study we're gonna do is for our client Perfectly Post. We have one of their employees here with us and Christian's gonna come up in a minute after we go through this and we'll walk through the case study of bringing them from another hosting company over the platform and the success they've had on the performance side. And what really is important to understand and to know that performance isn't just related to the hosting company and it's very important but as important is the partnership not only amongst the client, the hosting company and the developer. In Posh's example, they are also the developer, they do it internally but having a partner for the development side is just as important. So what you'll see is as page loads times increase, you have lost opportunity and again we'll be able to show this in real world examples with Posh. What was happening on the other hosting company is as they had a rush on the flash sale, what was happening is people were not coming because it would take seven, eight, nine, 10 seconds and as more people showed up, the time to check out was getting more and more because of the structure they were looking at. So people were bouncing and they would have these sales that should last an hour or two and they were lasting three, four, six, eight, 24 hours because of the fact that the site was slowing down. So conventional wisdom today, we were talking about seconds says that you should have a sub two second page load time which personally I think is unrealistic because of how complicated websites are today. I think it should be closer to about four seconds and I'll explain that to you in a second. So what we did, does anyone know what the IR 500 is? That's the top 500 largest e-commerce websites in the world. So I spent some time reviewing them and the average page load time in 2014. So when I did this, this was a couple of years ago, it was 3.67 seconds and actually what's happened since then is it's actually gotten slower. So what are the ways, some of the ways it can get slower? I'll explain. I had a client call me once and says my website's slow. I said, let me take a look at it and I bring it up and they do a trace of the site and it was 1,050 calls to generate the page, something like 17 seconds to load the page and he sold books and I said, how many books do you have? What's your SKU count? He said 1,000. I said, you see the scroll on the bottom that shows about six books? If you scroll through it, it's actually loading all 1,000 books. What you need to do is have a lazy load so it just shows the six that are there and as you scroll it'll load the others. He did that and it took it from 17 seconds down to about one and a half seconds. But these are all things that you have to be cognizant of from both the hosting side and what things you can do there, like caching and compression, but also just as important are things like the number of calls on the webpage, the size of the webpage, all of these things in concert are what makes the user experience so great. And as I mentioned, to really do this correctly, you have to have a true partnership. You can't just have someone you hire and say, okay, you want a hosting company and I'm just gonna walk away. I give them the site, they're hosting it from there on out, it's their problem. Or I've hired a development company, they're supposed to develop my site. It's a partnership and that partnership doesn't just include the client, it includes Drupal. If you're choosing that as your application, it includes the Drupal developer and it includes the hosting partner. And what you really need to do is work together in order to have the best experience. So when we went to Posh, one of the big things we said in Dallas who's not here, that's Mark's boss, they came to us, we architected a custom stack built for them because of the unique way they do things and I'll explain it in a second, but they needed someone that could scale very quickly. And their current hosting company couldn't do that. They were either having a lot of downtime or they were having longer and longer page load time. So what we were able to do was architect a custom stack just for them, which is important. So what did perfectly Posh do? They have something called a splurge sale once a month. They have 45,000 independent consultants who all get an opportunity to buy one item. So they have 25,000 units and that one item might go on sale from $25 down to six, seven, eight, $9. And they all want to rush to the site and buy this before it runs out because once it's gone, it's gone for good. So what this was causing was a rush to the site but that rush to the site was crashing the site or slowing the site down. So something that should have sold out in a couple of hours was taking multiple days to sell out. So they went live with us on March after we custom engineered a plan for them and after the first splurge, these were the feedback we were getting from their independent consultants. We got things like magical, easiest splurge ever. All of these things that can show you pretty easily the problems they were having on the other hosting company. And what was the problem? They were throwing hardware at the problem and they were throwing the wrong hardware at the problem. So there was really multiple issues. You can see at one point they had 20 web heads. At some points they had 30 and 40 web heads. I can tell you we're currently at 10 and we're not running out of resources. I think we finally hit the, we might add one more now. We added an 11th. They were at 30 or 40. They were throwing the wrong hardware at it. It's crazy. So what did they do when they moved to platform? So I'm gonna bring Christian up now and he's gonna walk us through what we did in the different phases to get them from having all the problems they had on the other guys to the most magical splurge ever. Hi everyone. So the typical deployment we do for an enterprise hosting solution is that we have three hosts all load balanced across each other for both web requests, for database requests, and for caching and for search, if you use a search service like Solar for example. With Posh we wanted to scale out beyond that because they had so many requests coming in. We knew that they were typically at 20 web heads with their previous solution. And so we needed to do something slightly different and what we did is go through their requirements, look at where they were bottlenecked in terms of their performance reporting from their previous host and worked out a solution that we thought would work. But we left a lot of time to plan and look at that solution and evaluate it and test it before the go live point because we wanted to have success out of the gate. And we didn't wanna have to provision two of these environments to do testing in the future because that would have been very expensive for them and so we left some space for us to use the production environment in a test phase before we deployed. So, let's see, I'll leave this up here for a sec. Exactly, so the first thing we did is we decided because we need to load balance across more web heads we'll deploy an additional set of web heads that are all the same size so it's easy for Elastic Load Bouncer on AWS to work with and then we left the core, what we call the core servers, the original three hosts were just used for the database, for caching and for solar. They weren't serving any more web requests and the main reason we did that is both for the load balancing and because we wanted them to use the full amount of CPU and IO. We put those hosts on AWS instances which are specific for storage. They were using SSDs rather than the metal based network file system called EBS on AWS because we wanted high IO performance. They had seen a lot of database trouble on their previous hosting company and so by splitting those we were able to optimize each set of hardware for the performance results we needed. We also planned to use Redis for caching which is what we typically support that had been using memcache. This will come back to this later, it becomes important. And we decided to add Fastly for a full page cache. The advantage of Fastly being that we could add, essentially Fastly is varnish as a service it allowed them to implement a few things in a custom VCL file, a custom varnish file that allowed them to do configuration where pieces of their page would be loaded via the cache and then other pieces would be lazy loaded into the page via AJAX. So this is what the system, the system normally just has the three hosts that you see in yellow, green, blue and those are all three hosts, app one two three and DB one two three would be all together. But by moving those out we were able to pick the correct hardware for the DB hosts and pick the correct hardware for the application hosts. As Doug said, we initially structured them with 10 application hosts that just run anginex and PHP FPM. We have since added an 11th, that was just last week. And then the DB one two three hosts are running on instances which are optimized for IO and also are much larger in CPU terms as well. So and those hosts are running MySQL in a Galera cluster. They also have read instances and solar instances that know how to fail over to one another in a high availability setup. And there's a little backup post off to the side which we is also replicating from the Galera cluster. We should adjust this diagram slightly. But and that's just a read only copy. So in the event of complete disastrous failure on all three of the database hosts we still have a backup that we could restore from which is isolated from those machines. So that that was the custom engineering we did on the hardware side. And we said, okay, we think we are ready to move into a testing phase. And so what we did is we looked at a number of load testing tools. We decided to go with Blazemeter. Some of the engineers at posh had used Blazemeter before and we said, we need you to write the initial Jmeter scripts that will run this performance testing because while we know what a Drupal commerce application looks like their application was quite custom. Their workflows were custom. We needed their expertise to go and say, this is what a customer workflow will be. These are the pages it will hit. This is what a sale experience looks like to us, okay. And when that was done, we would then toss it into Blazemeter and say, okay, I want to run 600 of these at a time and observe what happened. And we had enough time spaced out before their launch that we were able to do this repeatedly and evaluate in all of our load testing tools which mostly was New Relic but also in Blackfire and other monitoring to say where are the difficulties here? Whoops, what did I do? Something wrong. No, I think your laptop suspended. I'm clearly not exciting enough for this laptop. So this is an example that we have from New Relic on both hosts on the right side. You can see another thing we discovered is that they under in high load situations were crashing due to memcache becoming unavailable and you can see the majority of the page request time there is from memcache taking really almost 80% of the request time. And that's because the memcache instances they had were not load balanced and simply were not enough. What we discovered as soon as we put this on Redis on our side was that this actually was a bandwidth issue that the hosts talking to one another only have limited bandwidth within AWS and with that number of requests we're peaking at well over one gigabit per second in raw data transfer because posh's developers had done a pretty good job and were caching absolutely everything they could which is normally a good decision and you don't anticipate that bandwidth is going to be your problem. And so what I did is quickly threw together an upgrade to the Redis Drupal module that compressed that data and was able to then uncompress it on the fly. This has a CPU cost. We said adding CPU is easy. Adding bandwidth to AWS instances is hard. We will solve the problem in the quick and dirty way. And that worked extremely successfully. So once we resolved that problem and because we were working so closely together during this load testing phase we were able to identify and fix that within a matter of days. Once we got that solved we see that we have a much more normal request profile here with the majority of your time being in MySQL requests to places that you would expect for a Drupal application essentially and we were able to knock that average time down substantially even under very high load. Another thing we ran into and this shows how the partnership was very effective is we saw initially during the load testing that they were hitting database calls that were locking in a way that you would not normally have expected that a particular table was being locked and that it was slowing down all of the subsequent requests. Well as it turned out the load testing script itself was assuming a profile that was not accurate. They were assuming that all of these requests would go to a single consultants website rather than be distributed among their 45,000 consultants. So they were really testing as if the company was several orders of magnitude larger than it actually is. And with multiple pairs of eyes on it we were able to identify that relatively quickly and get that fixed. And so that sort of back and forth almost agile process and partnership enabled us to move forward very quickly on this. So here's some comparison numbers. We got these from Dallas who's their engineering director. Average time it takes, it took for our queries under similar load went down to on the order of milliseconds from two milliseconds down to 0.2, eight times faster with Redis. Substantial improvements on the database layer also right and as I talked about Memcache had been accounting for a lot of that time. We suspect that the problem with Memcache on the other host was similar to the bandwidth problem we discovered we don't have anything. We didn't have enough information to necessarily see that but that is a strong suspicion we had. So having gone through that load testing we felt like we were relatively well prepared. Goli was on March 5th. We did a prep call for the first splurge on March 21st and we decided we need to upsize database hosts and for us upsizing is relatively simple process because everything is in a high availability setup if I turn off one of the hosts the other two respond normally if the host I turned off happened to be the leader that would fail over automatically. So we did that we upsized one of the database hosts brought it back up it takes in their setup because they're using SSDs and ephemeral storage on Amazon that takes about 10 to 15 minutes to sync all of the data back over to the rebooted host and then we were able to make that larger host the primary so it could accept reads and writes. So the splurge went well. We saw that the site under severe load for roughly 17 minutes during that time about 14,000 out of the 25,000 units sold and then at that point the site was not under any stress anymore and sorry, yeah, oh it's all here. At that point the remaining units sold out over the course of the next I think two to three hours to get through their 25,000. In comparison the previous splurge in February was a total disaster the site went down and could not be restored for many, many hours and so that took several days for those items to sell out. The January splurge was not as bad with the other hosting company. They had somewhat fewer requests per minute but the server time to fulfill those was roughly, oh no. I don't know, oh maybe you're not, cause you're not plugged in. Oh that could be it. Did you run out of battery? No. I think you did. Get a power brick. Sorry everyone, we're almost done though. We're extremely good at DevOps I promise. Obviously it's time to buy a new battery. No not yet. We're almost done. Kill me as much. Yeah. While we're getting this back up anyone have any questions Christian can answer while I do this? We'll do Q and A right now for a couple of minutes. Nothing? Yeah it doesn't like that. Went down. Yeah you see what happens? Doesn't make any sense. They want me to upgrade but no. I think it was just graphs left actually, right? No there was the April splurge down so it's worthwhile. Yeah. I know that off the top of my head. You also have it there. Yeah. It was hiding. We're back, sorry guys. Are you finishing? Yeah. Perfect, okay. Exactly so with, we saw roughly 20 to 25% load times at both at the database level and at the server level just the time to get a request and to load a page had been reduced by nearly 75 to 80% for the March splurge. And like I said that was about 17 minutes of very heavy load where the database hosts were essentially maxed out on CPU and that was the bottleneck. And we took steps to try and alleviate that both identifying problematic queries and upgrades on our side to the database layer. So approaching the April surge we did the same thing. We upsized the database hosts and prepared for the moments of truth. And during the April surge we saw much different performance characteristics. The peak request per minute went up from just above 7,000 to just above 9,000 for March. Their gross sales per hour went up 31% and they saw higher peak users on the site. And this meant that the site was under load rather than for just 17 minutes was under load for a full 45 minutes. And the times had gone up slightly because of this increased concurrency of requests but they sold out all 25,000 items within that 45 minutes rather than the three plus hours we saw just the previous month. And so what we've seen is that immediately we saw a change in customer behavior. As soon as these customers realized that the hosting problem had gone away, now they knew, oh, we need to get in here and get these items really quickly because otherwise they'll be gone. And that, you know, we saw that and realized that we had not prepared quite adequately because under that sort of load we saw increased performance times and so we explored additional things that we can do which included increasing the pool of PHP workers. We added an additional web server. We asked Posh to look at a specific database query that was causing a lot of trouble and we also added a new feature. This isn't on the slide but we added a feature which allows us to load balance read requests between the three web servers which are typically there just for high availability. What this endpoint allowed us to do is say to the application layer, okay, use this as your secondary database within the Drupal configuration and Drupal has the ability to point many of its queries, views in particular at one secondary database. But on our side that secondary database endpoint will load balance between all three of the database hosts and so we anticipate that the next words will see all three database hosts under substantial load but not nearly the max load that we were seeing with just the one under previous circumstances and so we feel very well prepared and we're hoping that we'll get the sale down from 45 minutes to 20 or 30 depending on customer behavior. So, all right, we have just a couple of charts showing bandwidth per minute. I think we peaked at about 500 megabytes per second or per minute in bandwidth coming out of Fastly and around 30,000 requests per minute from Fastly which translated into just over 9,000 at the origin so they were serving more requests of cached static assets and cached page frames that were then filled in with customer data after the fact. So, the takeaway here is that performance is driven by both of these components, both what you code and how well the infrastructure is suited for it and the partnership that we have offered on the platform SH side to work with customers on the back and forth of fixing each of these issues and tandem and making progress on them together is something that has benefited Posh immensely. Their conversion rates, their revenue are directly tied to that performance. This is an extreme example, obviously. Most sites don't have the sort of load where they're going to crash during a sale but that's why this was such a great experience because we knew that the effects would be dramatic if we could solve these problems. The profiling step was really important. Being able to look and say we know exactly where these problems have occurred and which problems we need to solve and so we leveraged server side tools to do that both to see where the problems are and then to go back, analyze the code, optimize it, all those things. And so we feel like that partnership has been a benefit both to us in being able to improve our product and obviously to perfectly Posh for being able to improve the experience for their customers and it's clear to us that their customers agree. So. I actually want to go back for one second because I love this graph again and if you think about it, we said on the other guys they were peaking at about 6,000 RPMs and then crashing, we got up to 30,000 going through Fastly. So that gives you a sense of the magnitude of the performance benefit we provided to them. And it's working still. And it's working still. So we've been able to scale it five times already within two months of going live and it's gonna get better and better and better and if you wanna ask Mark, he's back there and I'm sure he'll answer a few questions for you. So I could even bring him up if you guys have any questions for him now. Nothing? So we are in booth 524 in the exhibit hall. If you have any questions or wanna see a demo of platform, Christian is giving demos all day. Please come by the booth. Right by the coffee. Yeah, and as a side note, we already host some of the largest Drupal 8 sites in the world that are live. I think we have one that's got five or six million paid views a month, which isn't massive, but for Drupal 8 it's one of the largest out there and we're just gonna keep adding more and more and getting bigger and bigger. So if you guys have any questions, please come to the booth. Great.