 Good morning, everybody. Today we're going to be talking about performance optimization. My name is Scott Anderson. I live in Port Douglas in North Queensland, and I work for Technocrat. And what we're going to look at today is rather than bore you with 30 slides of technical details and instructions, what we're going to do is tell a story. So this story is about a project that I worked on for Adairs.com.au. So some of you may be familiar with Adairs. They have shop that stores in a lot of shopping centers. They sell bedding, sheets, quilts, lots of exciting things like that. And they have an e-commerce site, which happens to be quite a lot of traffic during sales. So what I hope we can get out today's session is to know what to do when you've done all the right things performance-wise and your site is still running slow. That's what we've found with this particular project is all the normal things that you know to do to get your site running fast. We did those, and it was still terrible. So what do you next? We're going to look at some tests we can use for diagnosing performance issues. We're going to look at some tips for analyzing test results. A lot of the tests you do when you do this kind of work spit out tons of data. And so it can be very tedious getting like looking through it and trying to find actually what you need to find. So we'll look at some ideas for that. And some tools that can assist us in testing, analyzing, and debugging our site. So this is a Drupal commerce D7 website which we inherited. And on the outside it looks pretty nice. And you wouldn't know that there were major issues with it. On the inside, it had some scary problems. When we first got it, there were a lot of bugs which we had to work through and serious ones that caused incorrect prices to be charged and things like that, which was obviously not good. But once even those were resolved, there were some significant performance issues. And the performance problems were most notable during these triple discount sales. So these sales happened twice a year. They happened for a period of about two to three days and they start at 4pm on the first day. And at that time, over a thousand concurrent users converge on the site. At the same time, the traffic ramps up really quick and people are desperate to get massive discounts on their sheets. They start checking out as fast as they possibly can. So we'll look at it in more detail in a minute. But just quickly, the site overall was quite fast because we had a varnish cache on the front. So a lot of the pages would load quickly. But what happens is if you tried to add an item to the cart or you check our page, then that would load really slow during these sales in particular. So in preparation for these sales, we did all the normal things that you do when you're trying to get a triple site to run fast. We set up varnish cache on the front and we configured it so that even after they added an item to the cart and now they have a session, we still would strip out the session cookie from the request and return to them the static cache varnish page. And then we'd load the cart that'll counter in the top corner by Ajax. So that was the personalized little piece of content and then the rest of the pages were still getting returned by varnish. So that's all the product pages and category pages. So that was important because that took 90% of the requests away from the web servers and returned from the varnish servers. We also set up memcache, which handles all the Drupal's caching needs and we set up Redis, which is similar but we used that just for the sessions and for the form cache. We also migrated the site to an AWS environment and I'll show you a diagram in a minute. It was an autoscaling AWS environment so when you hit a certain threshold, the web servers would autoscale and fresh images were created and it could grow theoretically as large as it needed to to handle the traffic. We also had the ability to scale up because we knew these sales were coming. So we had a Jenkins task that would run and it would increase the size of all the AWS services. So the RDS is the AWS MySQL database. So you could scale up before the event and what we're doing was scaling up to the largest RDS database that AWS provides. So during the sale in October 2015, despite all this preparation, we still managed to crash the database, the largest one that AWS has, and during the peak traffic periods, New Relic was showing server response times of 15 to 18 seconds. So the site was still loading fast on the product pages. As soon as somebody went to the checkout to complete their order, you hit the checkout button and then at least 15 seconds later, the page would load, which is obviously not ideal. So this is a diagram of the AWS infrastructure. At the top is SQUIXL, which is a CDN cache that's returning all your static assets. Then a load balancer, then varnish servers, then another load balancer, and then the web server layer, and that was autoscale. So during the large sales, we actually got up to 100 web servers running, which is actually an indication of very poor coding, not so much happening. And there's a solo service, don't worry about those, and then on the left there, you can see there's RDS database and two elastic cache servers which are running memcache and res. And so that RDS database, being the largest one available, it has 244 gigs of memory and 32 virtual CPUs. So what this proved to us is that you can have the most amazing and powerful environment available, and if your code is rubbish, then you can still crash the site. So we have to diagnose the problem. Obviously you couldn't fix this during the sale because it's not enough time, so the task that came after that, okay, now what do we do? We've set up everything, varnish is working as well as it possibly can. We can't possibly scale up the environment anymore, running on a lamp stack, or we're running into an expert. So we have to actually dig down into the site to see what's going on. Basically, what reminds me of the medical field is what we have to do is run a series of tests and then look at the test results and see why is this patient so sick. So test one is an easy test to run. You've probably done it before. You can use the development module and the memcache module and just turn on your logs. So turn on SQL log and query log and the memcache log, and all that will do is spit out all the queries and memcache requests on the bottom of every page. And so you can see for a particular page, every single query goes to the database and every single request that goes to memcache. Now, as I said before, you get a ton of data. There's hundreds and hundreds of queries and hundreds of memcache requests. And so looking through all that can be really tedious. So we're going to look at what do we actually look for in that data so we can identify a problem. What we look for is too many requests for the same thing. Yeah, if it's loading the same entity or whatever it is, the same item over and over again, then that's probably not a good idea. There might be some custom code in there that's doing the wrong thing and not casting where it could, for example. You want to look for requests for things that we don't need on this page. And around a couple of cases, that was true. But you shouldn't be loading something that's completely irrelevant to the page you're on. You should find out why. We want to look for database queries for items that should be cached. So if this is in a cold cache request, like if you've already got your caches warmed up, then what you want to make sure is that everything that should be getting cached is. And you don't want to see any database queries that are going, for example, for a product information, like a product institute, because that should have already been cached. So you don't want unnecessary updates to cache items. So we're seeing cached sets on page refreshments. Like the caches have already been set, but we're seeing them happen again and again and for some reason not get used. So this is the first problem we came across. And this was on loading the checkout page, scrolling through the massive list of SQL queries, and you might not have got a SQL query, but you can. We came across this one, which looks as vicious. And if you look at what it's doing, it's selecting from field data, Coupon code, where entity ID is in, and then there's a very long list of entity IDs. I'm just going to switch and show you what it actually looked like. You can check our page. Here's a SQL query log. When we get down further, and we actually see these really ugly ones. Back out there. Now, you probably won't be able to see that there are 2,141 entity IDs in there, which happens to correspond to the number of coupon codes in the database. So the question was, why are we loading every single coupon code on the first load of the checkout page? Because at that point, no coupons are even entered or anything. So this falls into the category of loading data that we don't need. Not only that, it's not just one query. There are multiple queries. It goes for a very, very long time. You can scroll. Anyway, I'm going to keep scrolling because we'll take the rest of the session. So, are we excited when I saw that? Because I thought, okay, this is it. We fixed it. So how did we fix that? Well, unfortunately, I was able to search and find a patch with a coupon module that prevented coupons loading, every coupon loading on every page or every checkout page. Obviously, it's not a major issue if you only have a few coupons, like if you're just loading five coupons, then you wouldn't even notice. But because of the size and the number of coupons and some other factors which we'll see later, it was causing some major issues. So that was an easy patch. Since the patch that problem went away, the only reason it hadn't been done earlier was custom code in the site that relied on that particular version of Commerce coupon to work. So it would have been updated earlier had that custom code not relied on that. Okay, test two. So that's when I thought, okay, we're winning, and my job is almost done. So now I'm going to do some load tests and prove that it's all good now. So to do the load tests, I used jmeter scripts to try and replicate real-life transactions. So just basically add a product to the cart, go to the checkout page, and completely check our process. Use the service called redline13.com. If you've never seen that and you want to do load testing, I highly recommend it. It's on the front page. It says almost free load testing, and it is. So all you need to do to use it is your own AWS account. You plug your AWS credentials into it, and you can load jmeter scripts, and what it will do is ask you how many EC2 instances do you want to spin up, what size, what regions, and then it just runs the script for you, and at the end of it, it gives you real-time results, and then at the end it gives you some pretty charts. But yeah, very useful for load testing and not expensive at all, because in the end probably ran 300 load tests by the time we finished this project. And so obviously when you're running load tests like that, you can't use a SQL query log, and you can't debug, you know, you use X debug to step through the code while load test is running. So you need some way of looking at what's happening on the inside during the load test. So for that we use New Relic, and I'll show you in a second how you can use that. But that is super useful for being able to identify what's happening during a load test that's causing things to run slowly. So what we're looking for are slow requests and transactions, and slow function calls within the transaction. So just quickly switch to New Relic, and this is for a different site, but we'll look at that in a split. So what happens is you load test and you'll be able to see on your New Relic chart where the load test starts and ends, and so you have this period of probably increased response time. So what you want to do is highlight that period so that you're only looking at the data within that. So you highlight the period of time you need and then click on Transactions. And then on the right-hand side, you'll get a list of the requests that took the longest period of time. So find, you know, the worst ones or the ones that are occurring the most, and you can click on that particular request. And then it will load details from that request. It takes a second. There we go. And it tells you what took the longest time, which particular function, the pretty chart, but the most useful thing I found was this trace details. This breaks down all the function calls, how long each function took to run, and then highlights some of the ones that ran slowly. And that's super useful. So to have that, you can figure out where the slow point is and you can start to investigate and encode why it's slow at that point. So that's the way we ran these tests. And this is what we found. So this is a screenshot from the trace details page. And what you can see here is we're trying to load an entity. So we've got an entity load. It comes down all through these caching functions. And then it gets the memcache. And it does a memcache get. And it does a lock wait. And I'll explain about that in a minute. So it's, yeah, there's a lock there. So it waits for five seconds. And you can see on the left the yellow-orange color, it has five seconds. Then it finishes the lock way. It tries to get again. And now it's lock waiting again. And then it tries again. And then there's another lock. So by the time you get to the bottom of the screen, it's waited 15 seconds. And this is one request. So I'll explain about what lock wait is. So some of you may be familiar with stampede protection. What that is, is when you use memcache, you can turn stampede protection on. What it does is a request comes in for a particular cache item. And when it's not there, so it's a miss, it creates a lock. That request creates a lock for that item. And then it goes away to populate the item. And any other requests that come in the meantime just have to wait because they'll say, okay, someone's already populating this. You can wait until we've populated it. And then once it populates it, it releases the lock and everyone gets a fresh cache dot. That's what stampede protection does. And when it works, it's very useful. So you can sort of see the logic of it in this diagram. So what we were seeing in that trace is a request came in for a particular entity and stampede protection is on. So it goes into that lock wait phase. But what happens is by the default settings, it waits for five seconds and it tries that three times. And so it'll wait for five seconds. Now there's still a lock here. So wait for another five seconds. Now there's still a lock. Wait for another five seconds. Okay, and then it gives up and returns a false. It was a miss. So the question we had to ask was, why are these locks not getting released? So what's creating the locks and then causing them not to be released? So after looking at a lot of the new relic data, what we were able to find was that the most common cause was actually coming from loading a taxonomy term. And so I had to use xcbug and go through line by line to figure out what was happening. But what happened was, there was a particular taxonomy term that had been deleted at some stage, but it was still being referenced in the menu somewhere. So the code was trying to load it on every page. So what happens is it requests to load this taxonomy term, and it goes to the cache and to miss. So then it acquires a lock, because I'm going to populate this from the database. And then it goes to the database, so it's not there either. So it's another miss. And then instead of releasing the lock, it just says, oh, no, there's nothing here. But because it forgot to release the lock, all the other requests that have come in the meantime are now waiting for 15 seconds because there's a lock on that item. So the fix for that was just a few lines of code to say, okay, if you acquired a lock and now there's a database miss, then release the lock anyway. So at least all the other requests can see that there's nothing there. Does that make sense? Yeah, so the fix for that was in the end five lines of code, and that made another massive difference. And if you can think about the impact of that mechanism, yes. It hasn't yet, and I'm going to post, yes, I'm going to post a thread about it. And I'm just curious because we were using the particular version of it. It's in the entity cache module. The particular version we were using wasn't the latest. I actually went to the latest to see if that resolved the problem. And it did, but the code was quite different and caused some other issues. So I went back to the version we had and fixed that. But I'm going to post it anyway because it might be helpful if someone comes across the same issue. Yeah, okay, so that fixed a lot of problems. Next test. Okay, so this one was actually using the memcache log. So we're looking at all the requests that were going to memcache to retrieve cache items. And after we saw what was happening, I had to use PHP storm and XD bug to actually figure out why. So this particular one, this is the same tips as before, but what we're actually looking for here is requests for things that we don't need on this page. So this is a snapshot of the memcache request log. And what you can see there is we're returning these Ctools export views data. And on the right-hand side, all these different views, commerce cart, wish list, commerce orders. And you can't see the full list there, but there's actually a really large list. And what the list represented was every single view in the website. So every single one of the views was going to... And so this log was appearing on every single page that we navigated to. So on every single page, we were reloading the default settings for every view in the website. And if you think about the combination of something like this in the previous issue, where we had like lock issues, so requests getting locked up, it was just obviously disastrous for performance. Now, why was this happening? So that was where extra bug came in. What happens is the Ctools cache loads on its first load, loads a cache version of the default settings for every view, and then it's just sitting there in cache. And then if you want to load a particular view for the page, it goes to that cache and says I already know this much about this view, but it's more data than it needs. But what was happening was there was a particular menu view, so it's on every page, and this menu view, when Ctools tried to get it from its cache, it was never there. So then it would say, okay, well, if this view isn't in cache, we'll refill the whole cache and we'll get all the views that we need. And it would refill the whole cache, but it wasn't getting that particular view and putting in the cache. The reason it wasn't is because that view was in code, sorry, in database, but not in code. That's how it's defined in the database, but it wasn't in code anywhere, it wasn't in a module, it wasn't in features, or anything like that. And for whatever reason, Ctools was not storing it in its cache just because it was a database-only view. So the solution for this one was just to export this view to a feature and just the fact that it was in a feature, meant that it was in code, and now Ctools was happy to cache it every time, and that prevented reloading the whole setting to every view on every page. Does that make sense? Okay, so the next trip we'll discuss now. So after all this, lots of load testing we had done, and we'd load tested on the production site at one in the morning on one day, swapping out the production database with a backup, and we were confident that it was handling it well. One thing I learned from that, though, is that when we tried to send about 2,000 users simultaneously to the site, we crashed the load testing servers. So you need more easy to instances for that. But yeah, I was confident it was going well, but you can never tell 100% until we get real-life traffic. I was gonna respond. So the next triple discount sale started, and I was sitting in front of a computer with New Relic open, and just watching the chart, because you've got the response time. So the resting heart rate of the response time was like 300 milliseconds, so I was just watching that and see what's gonna happen and it was getting close to four o'clock when the sale was gonna start. So the sale hits, and one of our other developers sent me this Google Analytics shot, 1,600 people on the site, and this is what I'm watching. So you can see on the bottom right is the throughput, and you can see the traffic increasing significantly. And what you see in the main chart here is the server response time. So a very slight increase, but in the end we're still only doing about 300 milliseconds. If you could have seen, I never got screen traffic, if you could have seen that server response time for the previous sales, at that point, at four o'clock, it would have spiked to 15,000 milliseconds. So that was a very exciting thing to see there. Big win. This is some of the data that comes out of section.io, which used to be called Switzerland, and you can see there are median low times for different kinds of pages. So there's category pages of 2.4 seconds, product pages at 2.9 seconds, and then the checkout page at 3.9 seconds. And that's a pretty awesome achievement because all those other pages are varnish-cached, and checkout page is no varnish-cached. So that was good, and also the previous sale that had been sitting at about 16 or 17 seconds median low time for the checkout. So that was also very heartening to see. A few resources. There's a book, if you haven't seen it, that you can buy online, High Performance Groupal. It's a great overview for all the basics for how to get your Groupal site running fast in high traffic environments. But there's plenty of info on Jmeter, Redline, and debugging with PHP Storm, if you've never done that. And yeah, that's all. Any questions? Just interesting. Sorry, guys. Just with the Q&A. I'll pass around the mic. So just raise your hands. I'll bring the mic around, and just hold it to your chin. That's great. Thanks. How many service did you need once you've done all the optimisation? And did you need, like, the RDS instance? Is that big? Yes. Well, and actually that's what I'm watching. I'm able to assess the book, because I can see what the CQ is, what the data is from the book. Thanks, Mike, for 12 cents, which is right in the hands of the Prehab Minister, and I can't remember what size it was, but it's definitely not the biggest one. The service, the seller, not really what happened. They naturally had a group of just, like, two satellite servers, and they could work out what might be the only thing they could do. For the sale, we scouted out what we needed to do, and we explored how things were going to go, and so on and so forth. So, again, I'm really using the top of the attempt to prove that the performance needs to be positive proof that I'm going to go to any of the rest of the clients to see what I could do in the next second. For the public, I'm going to talk about the issues for Prehab Minister. To put your hand up, if you've got a question. For the New Relics, right? Did you look at the performance proof along with the extra bug and stuff like that? I think, yeah, it's in search of the same thing, but it might not seem to be the same thing. But, again, I've used it before, so I found a New Relics to be a very good place for me to identify and see if I can identify any differences when you're at a certain level. So, yeah, I think the New Relics would be nice to do. But, again, I think at the start, I think it would be really good. I've got a quick question. Sorry if you've answered this before, if I wasn't paying attention. Did it add to their bottom to the company's bottom line? Yeah, absolutely. I forgot to realise that I was paying the previous to the performance, but I think it would have made it a good place. So, in the previous time, when I was at Southern Alabama, there was a point there when I was in CESA, and really trying to university, getting back to what I wanted to say about it, getting into university, and all of that. Interesting. There were lots of sessions there, lots of tasks, so, yeah, I just ran out of New South Alabama and I think it was a good chance to do that. And did they appreciate your efforts there? Yeah, I mean, I'm sorry, I don't have any questions, but I'll please you for the talk. I think I wanted to do that but would you complain about some of the C-Tools caching issues on... Yeah, you should put it on Tripodog because there's probably other people suffering out there. That's all. Next. So, you were using Regis and Memcached. What was the reasoning behind this? Memcached. Yeah, two different points. The first one was running Elasticcash on the Red Scala, so we had Elasticcash running Memcached, and Elasticcash running Regis. And just the fact that we split them out and Regis and some more films were as consistent, so it was useful for me to do Elasticcash. So by sending Memcached out of that way and getting Memcached out of that way the rest of the identity justice would have... So on the other hand, you actually run Memcached on PC2 and then you run Scala but then, of course, you can't look at the relationship when you start running Elasticcash and running Elasticcash on PC2. One more. Have you attempted any sort of a level optimization or a climb step optimization? Well, this whole session was about showing us how to optimize Drupal application. Have you tried this server-level optimization on top of application optimization? Server-level. Server-level, yeah. Yeah, okay. Well, we did a server-level that was practically inspired much by Elasticcash. And we did work with passenger drivers for Drupal. We did work with them and we did a server-level with issues like if I can't find a way to use it then we're going to run it wrong without a network name or a managed driver and then we're going to do it without a server-level. And what I also said is that it was a very good application. There's a value of the service on the server-level and the service out of this server-level. Sorry, this isn't directly related to caching and performance, but you mentioned one of the solutions was to push a view to features. I'm wondering, is there like a because I guess like from our workflow we push like local and then bubble everything to features and push it up to after a get. Is there like a, I guess, a reason you won't do that initially? Or is there like a reason and also a way of finding possibilities in particular that you don't want to use it because it's not about many of you and it's a sort of thing that my question is, is there like a service out of this server-level? Is there like a reason you don't want to use it because it's a very good service? Yeah, it's a good service. Now, one of the questions is a little for me to ask you, is there like a a very good service out of this server-level and is there like a a very good service out of this server-level? Cool, any more questions? That's great. So please give Scott the ground of applause. And don't forget all these talks are going to be posted online at the end of the week, or end of next week, sorry, on the Drupal South website. So please share these talks with your colleagues. And you can use them to review yourself. Thank you.