 Hi, my name is Josh Wahee. I work at Aquia and have been there now for about coming up 10 months. I've been in the Drupal community now for coming up 8 years. In my career, I've been predominantly a developer and I moved into architecture from Drupal architecture and then out into more infrastructure stuff. Nowadays, I work with some of Aquia's elite accounts and there I deal with enterprise stuff. There could be lots of users. It could be lots of traffic. It could be lots of data. It's always lots of money. Today, I deal a lot with performance issues. It's a pretty common thing and I wanted to just talk about some of the building blocks, what I think are the building blocks of performance and how to understand those and be able to prevent making bad choices as soon as you can so that you can save yourself money whether that's hosting costs or development costs or support costs and also ensure that you keep your customers happy at the end of the day. I think as much as this is a DevOps track and we're talking about technical things, we do all of this to keep our customers happy and I think sometimes we lose sight of that a little bit so I'll be talking a little bit about that as well. Before we do get going though, I do have one disclaimer about my talk and it is about performance and that's not the same thing as availability. Availability is what can I do to ensure that my site never goes down and performance is certainly a factor of that. If it's a slow site, you've got a higher chance of it going down so that sort of factors into availability but it's not necessarily availability and in fact the infrastructural choices that we make are often at a performance defect but we do that so that we have high availability. I'll mention that because I don't want to be talking about architectures that are unrealistic that are not a part of the industry standard but I'll be focusing predominantly on performance rather than availability. This talk focuses on three areas. I've broken the building blocks down into three main areas. The first one is understand so it's going to be very generic because there's a lot of technologies around, there's a lot of ways of doing things so we kind of need to look at this at a very broad approach. The first one is understand it's really about learning up front what the requirements are of the site. Regularly I've seen there's sort of a one size fits all approach in Drupal and it's very true for Drupal because it's so flexible, dynamic and do a lot of things and we have a great community that provides very stable ways of doing things and so we're very quick to jump into those and just move forward. The second one is actually just building the thing. First we can go and build against what we understand and not just jumping in before that part and then the third part is to optimise. So a sort of iterative process, low testing, being able to ensure that the application can do what we tended for it to do. So let's jump into it, the first part is understand. To me performance means purpose built and that's a really hard concept to grasp I think in the Drupal community because Drupal is so flexible. Drupal is often chosen because it can be applied to so many different use cases. It can be a social media site, a commerce site, it can be a campaign site, it can be all of those things in one and so we kind of go yeah Drupal can do the job and we just throw it up there but if you think of Drupal like a person, every person is different and athletes in particular, they focus their lifestyle and condition their body to perform the task that they want to do. So you're not going to see a weightlifter run 100 metres sprint or if you do you'd be a very lucky person, that would be a special moment. But it's more to the fact that if you want something that's going to do heavy lifting, you want a lot of muscle, if you want something that's going to be quick and nimble then that kind of thing can be better suited for sprinting. So I always try to first start off with an understanding to understand what it is that you want to do, what are the characteristics of the Drupal site, of the users, the data that you have that will reflect on the kind of infrastructure you might need. And I break that down into, so let's think about what are the areas that we need to cover off when we're talking about understanding things. And again, I break those down to three areas. The first of which is data. So data is what are you storing inside of Drupal? So the users, the content, the categorisation. Sometimes you might have 100,000 users and so you might have 100,000 pieces of content and maybe 40,000 taxonomy terms. But when you're cross-categorising content with terms, you end up with millions of rows in the database because of how heavy you're doing your utilisation. At that point, do not use a taxonomy index table. So you need to understand the kind of data that you're storing and how much of it is and how complex the using of that data becomes. The second one is understanding your traffic. Just because you have 100,000 nodes doesn't mean that you're going to have a million hits per day. So you might have a lot of storage requirements but you won't have necessarily a lot of traffic requirements. So validate that if you have it at a pre-existing site, look at the old analytics and see what the traffic might be because obviously what you use at the front end is going to be different to what you use at the back end. The third one is automation. So what kind of heavy lifting tasks might happen that aren't traffic bound. So if you're doing content ingestion, things that enterprise systems might do that a number of triple sites might not do. So not just cron running but maybe things like bulk reporting or reconciliation with commerce payment gateways and things. So here's a couple of examples of what that might look like when you do evaluations. So a campaign site, it could be high in traffic and sort of medium range in data and not really much any automation. So you think like an election campaign site thing or something for social media likes or something like that. You'll get a lot of traffic through but there's not a lot of groundwork. But because it's predominantly anonymous in most cases you can get away with a very small build because you can sort of leverage a whole bunch of page caching. In contrast, a community site you have a lot of authenticated users. Now the user rate might be lower but you're going to need bigger servers. So in contrast to the campaign site where you've had a million hits in a day you might only get say a hundred users visit your site per day in this instance or a thousand as I've put up here. But because you're logged in and you're engaging there's more engagement between the server and the user you have to have bigger resources to deal with the additional load. Finally with the commerce site if you have a lot of automation you're making a lot of transactions, you've got to talk to the payment gateway, you've got to build reports for the business guys so they can see how well this site's proven out. You might want to consider taking your automation, things like prontas and stuff and running that on an independent server so that it's not impacting the load when you have traffic bound content, traffic bound load coming onto your sites. So just summarizing understand your resource requirements before we go into the build part. Understand your applications, resources so that it can enable you to build the infrastructure that meets the needs of the project and not just a stock standard out-of-the-box idea. So let's talk a bit about build and I have a catchphrase here which is scale vertically then scale horizontally. You guys heard those terms before in architecture? So let's talk to you a little bit about what those two terms mean. The key thing here is that you use the first before you use the second so you always scale vertically before you scale horizontally. So again, like understand, build is broken down to three areas. First, the architecture. So I'm just going to cover off the different kinds of architecture that we have in the industry and why they're used in particular ways. The second one is the software available that is sort of common in the Drupal industry. The third one is application. So looking at things from actually inside Drupal itself, actually this talk doesn't talk too much about the application component at all but rather that these are the things for your developers to look at and consider as well. So let's start off with architecture and how it works in the industry. So the first very smallest thing that we have for a Drupal build is the single tier or more commonly called the lamp stack. And whenever we get into Drupal development, this is the sort of thing that we've become most comfortable with which is hosting everything inside of a single server. So everything is made up of the web server so that might be something like Apache or Nginx and that then offloads requests to Drupal which is being run by PHP and there's a number of different PHP runtime environments that you can run You also have to have your files partition where Drupal saves its file data and then you also have the database and the database will be something like, in most cases it's MySQL, MariaDB, Makona and that's your single tier architecture in a box. And it's pretty cheap, right? It's just one box, you can buy a VM dedicated server now for like, you know, five bucks a month or cheaper perhaps and you can run Drupal on that It's not going to do your campaign site, right? You're not going to be able to serve more than one user but it'll do the trick and it works really well for developers it works really well in development staging environments and you can sort of upsize the hardware a bit larger if you like and get perhaps a few more users through to it But the key problem with this architecture is that all of those software inside of the operating system is fighting for resources inside of the operating system So if you imagine a typical server might have say, two CPUs and let's say this is Linux just to make things easier the CPUs will where the resources for those CPUs is managed by the kernel and the kernel will just handle things and increase load as it needs to to be able to manage everything but there's only two CPUs, right? There's four pieces of software running here so you can only do two concurrent things so everything else will be waiting until the other piece is finished So as soon as you get up to say three requests per second things start to collapse and the load will start ramping up and you can't really go beyond that because of the limitations So we start scaling vertically and to go vertically it means that we bring the architecture outward So the first step we would do for that is to separate the database from the web server So that means that when PHP gets CPU bound and it does because it's running Drupal and that's just a part of what it is then you don't want that CPU load to affect the database So that means that the database continues to be performing and later as we scale out horizontally it allows us to operate in a more effective manner So now it also starts to introduce the idea of turning our resources into components and being able to diagnose and understand them from performance level So what happens when you start getting a million hits per second like a thousand hits per second Well when that happens it's really hard for you to understand what's going on on the side of the server which request made this load go that high and why is everything collapsing So as soon as you start to put things into different servers it becomes a lot easier because the operating system is really good at measuring itself and not the software So you can just look at it and say the CPU load is high on the Drupal server and not on the database server Therefore the problem is not in the database and you can start continuing to look deeper into the application From there we go into what I would call HA or multi-tier or so high availability So this is where we start to scale out the architecture a little bit further and because we have servers now on play instead of just the one we also have to put a balancer in front that can offload the requests of the two different web servers So we've done two things here we've done another vertical step where we've added a load balancer in front but we've also scaled out horizontally as well now so the balancer is sort of necessary to be able to do that horizontal scaling but it also means that now we can start to expand our load across multiple web servers and in the Drupal architecture the Drupal component is the part that's scalable horizontally the database is not really that scalable horizontally unless you really want to get your, you know, pull your sleeves up and get into it quite deeply So ideally you want to be in a place where if you are having a bottleneck in your application anywhere in the infrastructure you want it to be CPU bound in the web server layer and scale it out if it's somewhere else you have to do application optimization to be able to get rid of it and so for the availability component of that we would also be adding in double everything so for example in a Amazon architecture you'd see a line down the middle and you'd have two availability zones inside of a single site and you would have them each component in one site so that if the region was to be, or half that availability zone was to be collapsed you'd still have half the architecture left over and so you do things like database and file applications so that you can switch over without compromising it's all availability stuff which we won't go into too deeply but in an industry standard architecture you'll have those additional two components that I just popped up there so that you can roll over to them if you need to and so once you have that in place you can file things out and so you continue to scale out your web servers as you need to as the concept and you don't have to scale out your database server so if we think about the very first slide this one if we had tried to scale that horizontally without doing the vertical scaling first the database would have been in every single web server and that's just pointless because Drupal can't write to several different web servers so it's very important that you go through these steps and for your project and for your application you need to figure out what the best one is this by all means would be the ideal situation for everybody but it also happens to cost the most so you need to find the balance of what you can afford versus how much gain you can get out of it but from an infrastructure perspective doing that vertical scaling is really advantageous so the question was are both balances in load or is one just a spare right yeah the answer is it's a spare because in most cases that's a varnish server and if you're load balancing two balances then you're also splitting your cache in half your hit rate in half so you want to send all your requests through a single one and then if it fails over you roll over to the other one yeah and there's other tricks you can do in varnish to loop the two together if you wanted to but yeah extras for experts there's also other components inside of Drupal that you can scale vertically as well a lot of people kind of forget about this part actually so Drupal by default is sort of constrained by the lamp stack as a community project you want it to work on as least amount of technology is necessary but for really building a proper application there's actually components inside of Drupal that you want to fork off into its own dedicated server or component or software that's better at doing the task than what Drupal is doing the task for so some examples of that is the caching layer is a big one that's normally written to the database if you think about what a SQL transactional database is designed to do it's not designed to store cache it's just the only thing that Drupal has is a persistent storage so it uses it other one is the locking mechanism so Drupal has a locking API lets you do atomic locking so you can do atomic operations but again the database is where it's stored and if you try and write too frequently to the rows in that table when you get some concurrency going that system starts to fail as well the queuing system is another one Drupal has actually got a pretty good queuing system and it's reliable to some point but when you start sending messages to the queue at such a high velocity again you start to tax the database and remember the database is used to pull out content and render it and the views module is sending bad queries to your database server so you need to save your resources for those bad queries you don't want to have to use the queuing system and again because these are separate components you don't want to have them fail when database becomes slower or you don't want database to become slower because they're failing so that's a reason why you want to abstract these components out and the final one is search everyone knows that core search is really bad options these days like solar search or elastic search that take care of it for you so there are like I mentioned there's a number of different alternatives for here so for cache there's memcache redis apc apc though works on a per server basis depending on the implementation of PHP locking you can use the memcache redis the queuing system there's Amazon SQS if you're on AWS there's RabbitMQ or ActiveMQ or Redis again and then there's also Patti Solar and Elasticsearch as I mentioned for searching so when you plug these components into your architecture you start to get a bigger picture and this next slide I really couldn't couldn't have enough space to put everything in but essentially you can start really expanding your architecture out a lot further and putting these additional components and now you obviously take on the latency and network connectivity and you gain a freer application so if you if you're running at the moment like in this diagram I put cache and search in the same box it's a really bad idea actually because they're both really RAM intensive so that wouldn't work very well at all but the idea is you don't want the one component of your architecture to pull the rest of your application down so if you separate them out it stops them from doing that and they fail alone rather than together and you know you search available on your site certainly better than not having a site available so this part is just I'm going to go over a couple of graphs like this this talks about some of the technologies or methods architectures that we can use and it's all this is not statistically sound like this is just how I score things out of five rather than I don't have any data to back this up it's completely my opinion so when you're buying servers there's three ways you can do that raw metal, virtualisation and cloud raw metal being you actually buy a proper server you've got to pay for that server you've got to pay for hosting that server and so you don't have a lot of fixability out of it if you need another server you've got to go buy a whole new one but you get a lot of performance out of that and I think some probably some experts in this area might tell you that the gains of performance in raw metal are not all that much more than what you might get in the cloud but usually the reason that someone would buy something in raw metal these days is either A, the security component if they're hosting it in their own data centre or B, because they think that it's more performant but what you usually find is because we have architecture that is designed to do horizontal scaling is that you don't, the problem with this is if you don't know how performant how much load is going to be on your servers you don't know how many you're going to need to buy it becomes a problem because you can spend too much money all the servers be underutilised where you can spend not enough money and then you go over and how do you have to go order a new server and it's a couple of weeks before it comes and it's all a whole bunch of overhead so alternatively cloud and virtualisation offer a lot more flexibility so virtualisation for those who are wondering how that differs from cloud virtualisation is where you might take a raw metal server and put a virtualisation layer on top of it so unlike cloud, a single instance to move across multiple machines virtualisation would continue to exist on that one machine and tax it and tax all the resources underneath it but you can say at the host machine that virtualised machine has only allowed x amount of resources so it gives you that compartmentalisation between each component so I think you probably, out of curiosity who's hosting the cloud these days it's a good chunk people who are doing virtualisation store fair chunk and people on raw metal wow it's a few of you guys cool so yeah, I think this is kind of a fair idea of what that is and so it's probably more than these things the reasons why you go to those areas a lot of people go to cloud because it's cost effective and it's flexible some people want to do virtualisation themselves and others do raw metal because they just for whatever security or the performance web server technology in most cases what I try to swing towards more in this talk is if you're here and you're thinking about what sort of web server should I use I just went to the speeding and apparently engineering is the way to go you're unsure about the web server to use but I'd say go with Apache even though it's really poor performer just because it does Drupal better and you can do other things like putting nginx and varnish in front of it which will help speed up the connection component so if you're, you know, the web server part is purely about running Drupal it's little to do with the performance component and when you're thinking about how many processes you can run of PHP on a server it's not massive it doesn't go to 128 or 64 concurrent PHP processes it goes more to like if you're in 20 you're kind of in your limit so why do you need to do more than 20 concurrent connections unless you're serving static files so the performance requirement is necessarily there therefore I weigh in and think that the community support is more vital because it does more of the work for you but certainly nginx is pretty fast and it's got a lot of community support in Drupal as well and there's definitely a lot of ways that you can do that stuff if you want to so by all means jump in there and do it I was going to talk about PHP runtime and then I realised that there's not really worth doing it because there's only one real option at the moment with the exception of the future state of things like hip-hop and the later versions of PHP at the moment we have three things we've got FPM fast CGI and mod PHP and the other two other than FPM are actually so flawed that they're not worth using so if you're using them at the moment consider changing I'll explain why PHP CGI is essentially well FPM came about because CGI wasn't doing a good enough job it's terrible at keeping itself active we have something monitoring it to keep it alive mod PHP is not rated is not limited by it doesn't have a limitation of how many PHP processes it can open up other than your max client set in Apache therefore it's quite possible that if you haven't configured Apache your server will expand out will expand out to up to 256 PHP processes which at 128 megabytes of memory limit means about a gig of RAM that it will need just to operate and of course you don't have that you're going to go to swap you don't have the CPU to deal with that and so you'll get overloaded and you'll have to restart the machine which means you have to have human resources to keep your site operational so you need to get rid of mod PHP migrate away from PHP CGI and start using FPM because it has so much more value to it you can even have multiple PHP applications running different pools or even have different pools with different purposes it's just fantastic also pro tip about APC case so if you're using FPM the APC case is stored at the parent process so that every child uses the same case which is not the case if you're using CGI which means you can use less amount of memory database software I think probably most people here will be using MySQL or a variant of it but just for there's other ones out there there's certainly people here who will suggest you use others but for the most part MySQL would probably be the general choice just because it has the most compatibility with Drupal I know some organisations will say no we standardise on Oracle or MySQL server and sorry MS SQL and if they force you into it it just makes it more complex and the reality is Drupal doesn't use those features, those advanced features of those servers that make it perform it so you don't get any advantage of using those kinds of servers so really the only choice in this case in most cases is MySQL or Postgres SQL Lite is a single you can't actually do concurrent connections with it so don't even think about that so we'll quickly move through the other abstraction layers so Solar Elastic and Google are some other options opposed to Core for Search Facebook caching there's certainly Nginx, Varnish and Squid you can do caching and Nginx if you like and a lot of people rate it over Varnish I think the configuration is a bit more trickier than writing VCL personally Varnish has really great support and it's almost like a standard now in our industry so that would be my recommendation Drupal Core does it out of the box if you don't really have the resources to do caching there are several reasons why that's a really bad idea but if you've got nothing else then at least have that caching software opposed to the front-end stuff so Memcache, Redis and APC are probably those options these aren't necessarily mutually exclusive in this instance so you could say use Memcache for a whole bunch of caching and you could use Redis where you need to do wildcard purging and you might do APC for your bootstrap cache where the localised lookup registry is really important for example Cuing software I just put ampq, that's the standard for messaging so things like sqs, rabidmq and activemq all support that and then Redis is an alternative which just does cuing as well and locking software Memcache and Redis again they're all high got all the support in the community, Drupal community and are pretty quick as well sorry I went a bit quick if you wanted to take a snap yeah sorry Memcache is under Redis there it's really bad graphing but the two are pretty much equal on par in that aspect so summary scale vertically first then scale out horizontally and use the software that best fits your abilities so if you think you're ready for the challenge of taking on a new piece of software then via all means jump into it but if you're thinking of using it purely because you've been told it was a good idea perhaps rethink that again and use something that you're more comfortable with using because ultimately when performance problem hits you need to know how to fix that and if it's a brand new technology and you're out of your depth then you're just left with an application that doesn't do what it's supposed to so the third part we're going to talk about is optimise and that's made up of out of three areas the first one is testing I think this is a really important one and one that people get wrong a lot so you know that's building accurate test plans and accurately mimicking a production environment then there's measuring so putting things into your architecture that allows you to measure what's going on in your system so that's things like the monitoring that you have whether that's like Nagios Pacti there's lots of options out there and logging as well so that you can look back through your logs and just replay what happened and figure out what went wrong and then analyse so figuring out looking through all of that data figuring out what the problem is identifying it fixing it and then reiterating right so going back to the load testing plan and trying to fix it and sort it out so if you can't stress this part enough and this is really a most important part for the business I think is that if you can't simulate your production load then you're not there's no point in doing your load testing so and the reason for that is a lot of this comes around back to the confidence of your customer customer wants to know is it safe to launch is all the money that we've put into this project is it ready to go live is it going to fall over and is it going to make us look bad so if we can't do a load test and say yes to the client say yes we're going to be fine then there's not really a point to it and so quite often that happens because we look at load testing the wrong things we as technical people get intrigued about what will happen if we try and do a thousand page reviews per second at this one URL and go oh wow look at that advantage case it's amazing and it doesn't really matter because we're never going to get to that rate but if there's some Ajax request that you didn't test in your load test it's actually to be a post or something it'll take your site down and you need to find that very quickly so don't measure don't measure technical metrics so things like page views per second hits per second or megabytes per second so not bandwidth, not how many pages impressions you get or Drupal requests that you get that's not how you want to do your metrics when you measure things at first it's fine to have them but at first when you figure out what your goals are these should not be your goals instead you want to be thinking about business metrics so how many users do you have at peak and you can usually find that from historical data on Google it tells you per hour you might want to ask them I asked my clients how many page views do you get per day and they'll tell me and say what percentage of that will go through in your peak hour and it makes them think it shows how well they know their user base and it gives me the number to work with and when I ask them that they've now told me what success looks like and I can test against that so I do that, I also then find out how long a user stays around on the site for I know one thing that is a constant concern is customers might say I want to serve 100 users concurrently and when they say that developers say oh they want 100 hits per second and that's never actually the case they just mean that if there's 100 users on the site at one time what does that look like and of course those 100 users are in several different states they're requesting they're reading content they're doing something else it's not necessarily that they're implying load on the servers so we figure out what the session duration is and how many page views they would hit during that session and once we have those three pieces of information that becomes trivial to be able to map those things over into technical metrics so here is an example I have 10,000 users at peak and each user stays on the site for 20 minutes of session and during that they request 10 pages during that session so mapping that over the technical metrics I take those 10,000 users and I divide it by 60 divided by 20 in other words I turn that 20 minutes and figure out how many times that reiterates over one hour period because 10,000 users was at peak during that hour so it's an average and it turned out to be 3,333 concurrent users so at any given time that's the concept of concurrent from a user perspective then the user request rate is 20 divided by 10 so take the session time and divide that by the number of page views so they request on average 0.5 pages per minute or in other words they only request one every 2 minutes we can convert that down to a second and then we also have a measurement here so that we know that our math is correct we have a check here so the number of pages that we should be seeing requested to the site in an hour would be 10 times 10,000 so 100,000 page views so then we can finally convert that over to a technical metric and say page views per second equals our user request rate times the number of concurrent users we have 8 per second times it by 60 seconds by 60 seconds to get the minute value and then by 60 again to get the hour value we would get a number very close to 100,000 it wouldn't be the same because of the rounding so get to know your tools this is about things like logging and monitoring we'll have questions at the end so perhaps we can do that then so now I'm just going to talk about a bunch of tools that I've used and that I kind of rate as well as ones that I don't use but I'll show them there nevertheless so the first one is New Relic it's a paid-for tool anyone use this to date? cool, cool so if you use this at the moment and you're continuing to pay these guys and you must value this product I certainly rate it here we've got a graph of your performance over time and this is really important to think of a number like a constant that represents the performance of your site but in the actual fact it comes and goes depending on the load that you have and at different load times things are going to behave in different ways and so New Relic has a really great way of showing that over time so this is a time graph and it's separated out into four different values there's a green value there which represents web traffic so how long it took to transfer the content the blue is memcache the orange is database and the blue the teal at the bottom is PHP so how much time was spent at PHP so this is really great because when you're looking at a page view that took three seconds you can immediately see the breakdown of where that three seconds was spent and if it looks really odd then you can immediately know where to look so if the database was the part that was spiking then you can say well I need to go look in the database and look for slow queries or something and what went wrong there if it was in PHP then you can start looking into the application and in fact the other thing that New Relic does is it does stack tracing so if you pay for the pro subscription you will get stack traces and be able to look inside of your application and then find the exact function that causes all of the time to be wasted away so you can look at optimizing that component or whether it is so it's really very valuable and you can afford the budget for this kind of thing so there are open source tools as well that will do close enough jobs to what New Relic does the first one is XhProf and XhProf just like New Relic does stack tracing as well it was contributed by Facebook they used it to stack trace their problems it's really quite powerful so if you trace it then New Relic is but it only does it per request whereas New Relic does it across your entire site of all your requests so you need to look into a couple of different profiles to understand truly what's going on it does have a all of these monitoring tools both New Relic and XhProf both have performance impacts to your page load but I would say if you really need to do a lot of performance optimization I have them at all and you take your days to try to figure out and theorize what could be the problem try to fix it and then it would not be the case so XhProf actually lets you order things by memory usage or wall time how long the CPU is spent on that task and that's really key because it lets you find very quickly what were the offending functions and you can stack trace out to find the appropriate area to mediate it also does call graphing so this is a map of the PHP call out and it shows in this particular image a clear path where the performance defects lie another one that I haven't really used all that much to be honest is Xdebug and you can take sampling from there and put it into an application called cache grind and sorry 3 as well and it lets you a very similar way to what XhProf does it lets you look at what happened during that page load and sort of dribble and dig into the hands of it and figure it out I think at the very least what you want to do in your application is log explicitly just dribble requests to a log file so by all means have your Apache access log or whatever access log it is that log every request that comes in but have a dedicated logging file for just dribble as well and the reason I say that is because it becomes really important to understand the requests that cause the most impact to your server so static assets are really cheap they're just file lookups and then serve them back and if you have IO issues you'll see those in things like Nagios graphs whereas PHP is always CPU bound and so that's where your web server is going to be spending a lot of its time so look at make sure you know how many dribble requests are coming through per second so that you can understand and match it up to how many PHP processes you're giving your application so if there's anything that you take away from today write this formula down this formula is what I use when I think about performance optimization and when I figure out if the application is in its current state is going to live up to what it wants it to do so in most cases with the exception of things like AJAX requests or private file system files you can almost say that a page view correlates to exactly one dribble request there's plenty of exceptions for that but let's just go with that for now and if they say they give us their business metrics and we figure out what that means in page views per second if we don't have that number of the page load time average then we don't know that we're going to be able to serve the site at capacity so we use this to first figure that out so page views per second in a theoretical sense because remember earlier we mapped it from a business sense so there are two very different metrics is the PHP processes that you have available on your infrastructure divided by your page load time average so let's say you have 10 PHP processes running on your application and you have a load time average of two seconds it takes two seconds to put your page together and send it back then that means you can only serve five concurrent page views per second because that PHP process is going to be occupied for two seconds before it can serve the next one so as soon as you go over five page views per second as an incoming request AJAX Apache will start queuing them up and you have to process them one at a time and your page load time average starts getting bigger and bigger right because it can't deal with the load so you need to spread out how many PHP processes you have that being said you can't simply just add more PHP processes they have to be scaled appropriately to your CPU so if you have too many processes against the CPU that you have then your page load time will go up simply because the CPU is burdened by the amount of load it has and it has to divide as time against how many processes you have I use a sort of rule of thumb of five PHP processes to any CPU that I have on the server so two CPUs means ten processes per server I don't need any more memory than what's going to facilitate that and then I scale out horizontally from there but that all depends on your application right if you've got a really it's usually the more memory your application uses the longer the page is going to take in general just because it takes longer to deal with the memory so it's like the shorter page load time you can get it without having to think about the CPU the better off you're going to be anyway my friend Mark as I'm a gave me this graph he did some profiling to test what would happen if we had too many PHP requests loaded so let's say this is like a graph that represents a server that has say actually he's got some numbers there say like 15 requests per second versus 32 sorry 15 PHP processes in the server versus 32 PHP processes in the server same size server but one has less PHP processes configured one has more and what he found from the result query here is that as the number of requests increased when you had a lesser amount of PHP prox your times became more sporadic and if you actually had more PHP processes you could deal with more concurrent connections and therefore your load times would also go up but they were more consistent so they all performed more badly across the board so this is something to consider my preference would be to keep that number low and keep your service safe in other words you don't want to get to a place where your load goes so high that it becomes unresponsive and you lose that server all together so I would rather scale things out horizontally and have the more sporadic load times and have some of them be returned quite quickly and always have a performing architecture so that I can never blame that Drupal got slow because the CPU was too over taxed then having the page load time for everything go up slowly but that's something for you guys to determine for yourself so just to sort of start summing this up I wanted to actually do a sort of open session and get you guys to basically brainstorm with me and diagnose an issue in an infrastructure so I think of this like sometimes I feel like I'm Gregory House not that I'm an asshole or anything but well I don't think I am but more that you have to think of these things from the day that I have observed what could be the problem and usually with experience this becomes more and more prominent in your mind but you have to use the tools to eventually basically theorise what the problem is going to be and figure out how to solve it when you're trying to eliminate performance defects so let's consider this architecture we have a single virus server in the front we have six Drupal servers running Memcache we have a single database server in the back that's running the file partitions for the database and for the files so the files are mounted onto the web servers over cluster let's say that's cluster we've got 28 PHP procs per server we have there's some numbers in there that aren't relevant two CPUs per server 5.7 gigabytes of RAM per server of each of that Memcache consumes 256 megabytes per server that's our infrastructure now when we look at the neuralog data that's back we're getting at times three seconds to load and there's a very sporadic pattern there and so my question to you is from the data that I just gave you both the infrastructure and the new relic graph that we see here what do you think the problem could be from an infrastructure perspective that might improve the situation yeah if we look at this graph here the orange part is the database line and you can see it's the smallest component of the page load yes so the so the answer there was use less Memcache servers right so reduce the number down or use sharding yeah so actually I probably should have I just realised this wasn't clear I should have explained that Memcache servers share the Memcache servers so one Drupal server over here on my left will connect to every single Memcache server so that it can distribute it out anyway but you're right so if you can see here that the Memcache line is exactly the same thickness as the CPU line in PHP now we know from this graph here that Memcache and Drupal are on the same servers so whenever that server is rebound Memcache is affected so your caching system is not as good when Drupal gets gets carried under so if we were to separate them out in this architecture it would decouple them and the caching system would continue to perform just as good when the load gets high and so it actually would mean that not only would the Memcache portion take less time but it would also mean that Drupal would not be waiting as well so Drupal part would actually take less time as well so it would get huge performance improvements if we were to make that change cool well that actually took a lot less time than I thought it might have so that's really cool so thank you that's all I had for my session thanks if you have any questions there's a microphone in the centre of the room great I guess like there is one what is the best way to use community sites as a site where you have locked in users where one is and all these other caching mechanism don't work that well right so the question so in most cases your community site is not going to have page caching so because of that every request that's a Drupal request normally the Drupal request always hits Drupal and has to be processed so that just means that you have a lot more request load and that might mean that you're going to have to ultimately have more servers to deal with it you know more PHP processes to be available there are some other ways that you can do you can optimise that in an application manner right like I was saying earlier that you can actually continue to serve anonymous pages in some circumstances and then render in with something like Ajax or ESIs the parts that are actually authenticated for a community site where you're talking about forums that may not be so beneficial I guess one thing you should probably always check off your checklist is make sure the DB log module is never enabled that's logging to the database that just creates unnecessary writes and you want to ensure that you have the entity cache module enabled usually only if you have if you're using something like memcache or redis as a cache store that allows you to store loaded entities in cache and they don't have to do redo that work so what that means is that it reduces your page load time average and the sooner that PHP process can be returned to the pool after a serving request the sooner that it can serve another request and so you're going to save money because you can serve a higher concurrency rate how can I know how many PHP processes are running on my server so if you're using LibMob PHP or what comes inside a standard lamp install that's 256 Apache doesn't rate, limit that but if you're using something like FPM it'll have a configuration and it's PHP I and I file I want to say I and I file but I think it's COMPD that is max underscore children FPM max underscore children and that'll tell you how many basically FPM will limit itself to that many children and you can also say how many are idle and how many you start with and how many are your max which is the children so you can say always have five but you can go up to 20 so you know so if you're under a short amount of load you can drop them all off again and remain idle for other things if you need to yes it's a great question I myself haven't had enough time to play with Redis to really answer that in full what I can tell you is that Memcache does a really terrible job at caching from a Drupal perspective it's a fast key value store but one thing that the Drupal system does and supports out of the box is wildcard caching and that's something that Memcache does not support so instead the Memcache module in the Drupal community has to find an application layer live a way of dealing with wildcard caching and that unfortunately means it has to do additional caching so if you can read a supports wildcard purging so I would definitely recommend giving it a go though I don't have any experience with it to be able to say right yeah like I said I would need to look into it a little bit more and of course you're talking about persistent storage which well you know if you want to do Master Slave you care about the data that's inside of Redis which is good for things like queuing perhaps not so relevant for things like locking or Cache Store yeah cool thanks come up to the microphone if you don't mind talking about stack traces I've used XA cross the main problem I see with that is that you've got lots of kind of multiplexer functions like module and voc or theme where everything's calling it and then this could be calling 50 different functions depending on where it's coming from and finding a way through that and working out where the track point is is really tricky any way going around that yeah that's a really good question actually because it's something that I had to figure out myself as well so I'm going to go back to the screenshot here so this is what you would see by default and like you say when you in Drupal a function like module and voc all or I can't help you calls a whole lot of different functions and it gets called itself a lot of times and there's things like call use of funk array what you'd end up doing is you'll start to understand what it is in Drupal that gets called a lot but it's not the fact that it gets called a lot that's the issue it's calling it a lot that might be the problem in most cases the hook systems aren't abused by the most well supported modules so you wouldn't find a module hook module module and voc all being an issue but there would be scenarios where other things might be that way in those cases you'll see on the side here that there's a column called calls and it tells you how many times that got called and in those instances I would look at in XHPROF you can look at a single function it'll tell you the children functions that it calls and it'll also tell you the parent function that calls it or the number of different parent functions that call it so normally if you see a function that gets called a lot of times and that's a problem you look at the parent functions and look at the parent functions in those places where there's a single function that a lot of different C tools is a good example of this all points to a single function and then they all sort of fork out again you can do that the other alternative would be to actually render a call graph and look at it that way so again you can see here that it has that yellow trail that tells you where the performance line is being caused so in those instances you can just use the call graph to find very quickly what part is really causing you to issue just to add one thing I did find it was quite useful was when you're running as administrator as user one it does a lot more checks when building the menu system than it does when you're not and we're not looking at it we're not a user one, it can cache it but in the user one it never caches it and if that's doing security checks on every single link that can really expand so that's an interesting one cool, thanks is there anybody else otherwise I can all release you for beer go forth thanks