 Thank you Todd for that wonderful introduction and thanks everyone for making it here. Today I'm going to be spending the next 15-20 minutes talking to you about capacity planning. So the idea is you want to, what I would really like to do is really talk to you about why capacity planning is important in the first place, because this is something that is often overlooked in our drive to actually roll out a whole bunch of product and feature enhancements to have happy satisfied customers. But we do neglect on this aspect of it so that's why I want to spend some time talking to you about why it's important to do this upfront and not in a reactive mode. The next thing I'll probably cover after that is what is it actually involved. So is it really something very complicated or is fairly straightforward and I'm sure I hope that by the end of this talk you'll know that it's really simple to do and anybody can do this. And the last thing I want to touch upon is basically how you can apply the learnings from your capacity planning exercise to set up auto-scaling policies for any application that you deploy in AWS. So moving on to why is capacity planning important. So I'm sure none of us want to be in that position, you know as you see a picture says a thousand words so I definitely don't want to be where this guy is at right now. So the idea is you want to have no side down times. You really want to make sure that your system is up and running and you're not scrambling to debug issues in real time with real customers waiting online for a site to come back up. So to cover more on the capacity planning benefits essentially the reason you want to do this is that it'll ensure high availability of your systems no side down times and no late nights. It also improves cost efficiencies so that you can end up with provisioning only as much as you need and maybe build in a buffer so that you're prepared for any spikes in traffic. The next thing you also want to do is it's actually really helpful for you to define upfront your SLAs and also figure out how you're going to meet them and have a proper process and plan in place. And some of the other benefits of doing a capacity planning exercise of that it really really provides very valuable inputs into your insights into your top application workflows. I'm sure all of you agree are familiar with the 90 to 10 rule wherein 90% of your customers are using 10% of your workflows. So this exercise would be one of those things which will help you drill down on those top 10 workflows and you make sure those are really fine tuned and ticking along well. And also it also helps you look at other workflows which may not be as critical but which may actually impact your top workflows. And so it helps you get a proper bird's eye view insight into your system so that you can plan well in advance. The other thing it also helps you with is understanding what is your average and what is your peak utilization. And are there any spikes in your usage patterns? How are you going to have your system scale for such spikes? And you obviously don't want to over provision your system so you know you only over provision it as much as you need to. So it'll actually help you to think about how much over provisioning do you need and how are you going to build that into your deployment systems. So that basically covers for overall the benefits that you can have from doing a capacity planning exercise. So the next step that I want to go over is what is capacity planning actually entails. So really what you want to do is I'm trying to break it down into just four high level simple steps and then we'll go over each of those steps in the subsequent slide. So at the very first you want to have a perfect environment set up which mimics your production environment to the largest extent possible. And so you're going to have to go back and understand your complement and component and deployment architectures which is always a useful thing to understand. And then you'll also start looking at once you have that environment set up you're going to actually determine what is your single host load pattern so that you can figure out what are your call distributions, what are the loads for each of those APIs or top APIs that are coming in through your site. And so that you can actually build scripts that mimic that pattern as well. So you end up next with developing your scripts which mimic your production distribution patterns and then you can actually run your load tests. And the last thing and the last step which is actually probably the most difficult and the most important is actually analyzing your results and then extra relating those results to your production systems because whatever you run will be on a scale down performance instance or a performance environment and then you have to figure out how that actually maps to your drill world system. So that's probably the most critical part of your entire exercise and it'll also help you figure out where your SLAs are degrading and then understand what are the bottlenecks or root causes and what kind of mitigation steps you can have in place. So to move on to the first step which is the setup of the perf environment. So there are some considerations I want you to drill down on. So the point I'm trying to make here. So here I have a picture of our application and the point is I'm not going to walk you through every single component that's actually present in this system. The point I'm trying to make here is that most applications are complex. They're composed of multiple components. Some of them they're interdependencies between these components some of which your team or you will own and some of which will be downstream dependency. So it's very very important that you take a good look at your system to understand what are the downstream dependencies so that you understand what are the SLAs defined for those downstream dependencies because anything you build on top is going to be a delta over and above those downstream SLAs which have been predefined. So it's very important for you to understand that before you communicate any SLAs from your side as well. And the other thing that you also this will help you with is in understanding your component architecture is also it'll help you drill down an environment specific configurations because a common mistake that most people make when doing their capacity planning exercises is they do not understand or they have overlooked that there is a downstream environment that they have a dependency on. So you may not be editing your application configurations to point to the downstream perf environments and you may accidentally or inadvertently end up pointing to a downstream production system or a beta cluster and you're going to actually have a huge blast radius every time you run your tests and so you want to be really careful that you've actually combed through your application configuration files to make sure you're not pointing to any downstream production or beta environment so that you don't impact anybody else in the process of your testing and hopefully you'll be doing this exercise on a regular basis. So you don't want to have any recurring problems that you're introducing into the system. The other thing you want to also understand is your deployment architecture. For example, what kind of load balances do you have? What kind of policies do you have set up? Are there any service gateways? How many AWS instances do you have provision in your production system? What is the size of those? What are your caching policies? What are your database management policies? What kind of replications do you have in place? What kind of failure work do you have in place? So all these things are something you want to really understand at least to some extent up front and at least call out the gaps in your knowledge or your understanding so that you can actually pull in the requisite experts on an as needed basis. So once you have your performance environment set up, the next thing that you want to do is actually analyze your load pattern and call distribution. So I actually work it into it and we until now have been using New Relic and Splunk to a large extent for monitoring the pattern and call distributions on individuals from team levels. So the idea that you want to do here is you go into New Relic and figure out what are your highest throughput top 4, 5, 10 API calls and then figure out what is the hourly, the highest throughput calls and what is the relative distribution across these APIs. And then you can move into Splunk where you look at your last 30-day peak and say that for every single API you figure out which was the peak call day and then from that day you can drill down to the peak hour, you can break down by hour and then eventually by minute so that you end up with the maximum number of requests per minute that each API is actually sending to your production systems. Once you have that call distribution for each and every one of your top identified APIs, you can actually go ahead and you build your load test, your scripts and then run your load test. So Jamie does like a commonly used tool for mimicking these distribution patterns and also building these API scripts. So a few things I want to call out here, you want to ensure that you have the right call assertions in place for error rate monitoring because as we all know a response of 200 is not necessarily indicator of success. So you want to actually have the right assertions in place to make sure that the API response actually contains the data that you are expecting and you want to have assertions in place for that as well. The other thing you also want to understand is where do you actually cap your load test, right? Where do you start and where do you stop? So a common rule of thumb is start with one X of your peak production load and then increment it linearly to 5X and maybe see how your system behaves until then and then go back and see if you still need to go up further than that. And as you're running your test you want to monitor throughput but you also want to look at metrics. For example, you can probably look at client side metrics such as your TP99 or your TP90 numbers, what are your error rates from an end-term client user perspective and also additionally you also want to look at server side metrics for example you can look at what your server side response times, what is your CPU usage, what is your memory usage, what are your thread counts at, how are they doing. These are some of the things that you can look at and one thing you can also keep an eye out as you're running your test is your AWS instance going down and a new one auto scaling so that means you've definitely reached your break point in your system. So that's something you can definitely keep an eye on. These are all the things that you might want to monitor as you're running your test. So which then brings us to what is an ideal system profile from an end user perspective. Basically you're looking at something that's a highly available system so that you don't have any side down times, you're looking at low latency so that SLAs are met, you have happy customers which are going to give you a higher NPS rating. You also look for graceful degradation under load so that you can have the proper alerting mechanisms in place to kind of see how your system is starting to keel over even before it actually does that so that you can mitigate for that. And the next thing you want to do is also make sure that there's no data corruption issues because you want to have atomic transactions so that if anything fails in an intermediate point everything rolls back. So those are the kind of considerations you want to keep from for a profile from an end user perspective. From a monitoring perspective what kind of usage profile are you looking at. So some of the key things that I mentioned were CPU usage. So are you actually seeing that your CPU usage is exceeding 50% for extended time intervals. You might want to drill down further if you're seeing something like that. Similarly your garbage frequency collection frequency as well as the duration of your garbage of your GC. So basically you want to have a GC cycle run every three to five seconds and you don't want it to take more than 50 to 100 milliseconds. So you want to keep an eye on that as well and you're also going to have to keep an eye out for your memory consumption. Are there any spikes in usage? You want to definitely drill down on any spikes that you see coming up. And the thread allocations. Is there too much frequent context switching between your threads? Are there any deadlocks or blocks in place? So these are kind of the things that you want to be monitoring on an ongoing basis for your application. So what happens when you actually see any of these markers or these alerts. For example if you have a CPU usage which is exceeding 50% for extended time period. So basically what you're seeing here is you may need to upgrade your CPU or add more processors or you might even have to go back and fine-tune your application. If you see lots of spikes in duration of GC or the frequency you know that that's going to have an SLA impact. So you need to drill down and further look at memory utilization. Is it increasing despite GC? That means you probably have a memory leak somewhere and you're going to eventually run out of run into out of memory error. So that's something you want to keep an eye out on. And lastly you want to keep an eye out on thread allocation as well. Watch out for two frequent thread context switches. Are you seeing anything in blocked state or then you may be actually running into race conditions. So these are the some of the things the KPI that you can monitor at peak load. Additionally once you finish that you come up with a bottleneck in your system. You place something in place and now you have something that you're pretty satisfied with. You can actually say okay how do I actually go and extrapolate those results or those numbers from a scale down performance set up performance environment to an actual production environment. For example in my case my performance environment was a 1 AWS instance which was say maybe taking x our request processing x request per minute within SLA. So our production system had 8 AWS instances. Do we directly scale that multiply that number by 8x? Probably not. I think to be on the safer side you would probably multiply by that 666x and say that here's some assumptions that we made in our setup and in terms of caching or any other stuff. And so you want to account for those when you actually make your predictions. So in my system actually what in our application what we saw was we had a typical load of about 1.5 million requests per day which is about 300 requests per minute which is obviously not really very great. But overall we were able to process 2,500 requests per minute across the entire production application system. So the test basically indicated that the peak load is exceeded when you went above 550 or 600 requests per minute which was with a composite API mix of about the top APIs that I told you about earlier. And so that equated to a 4,400 requests per minute overall which means that our production systems are running at 56 percent capacity which is not necessarily very bad. But we also knew that we had a predictable spike. We always saw an additional spike to about this 4,500 requests per minute at very specific times and for very specific time interval. And that's where we decided that this exercise would help us in setting up some sort of autoscaling policies in AWS. So that's where this comes in. So essentially what we had is we had set up alarms to monitor for CPU usage and once that alarm is triggered we have an autoscaling policy that kicks in and basically serves up additional or scales up additional instances of AWS to meet the incoming traffic and put them into the into service so that the rest of the request the additional incoming requests are now being shared across all the systems. And one thing you want to be aware of when you set up your autoscaling policy is that you want to make sure that it actually there is some time there is a time lag between when your alarm goes off and when your scaling policy autoscaling policy kicks in and your box is actually provisioned. So you want to actually account for that delta in time so that you don't want to so if you're anticipating a peak or a spike in traffic at a particular time interval you want to have these policies and these AWS instances scaling up well before or it's not well before but at least in advance of that spike so that your system is already up and running before the spike comes to in through your website and once you do that and then you can you can do the same thing for scaling down your policies your your instances so that you're only using them on an as needed basis it's an on-demand provisioning of systems so that's where your cost efficiencies come in as well. So that's basically about all I wanted to cover. Thank you very much for your time and I'm happy to answer any questions. Thanks. Hi Lakshmi. Can you hear me? Yeah so I just wanted to ask you know how have you defined the ideal parameters for any system like you said right CPU percentage has to be less than that and all those factors. Right so I guess it depends on your application it depends on what your SLAs are so there are some standard KPIs that we always monitor like the ones that we mentioned and then if you have something specific for your application you can always start add add monitors for those as well. Hi here so so you're talking about auto scaling right so so whenever you get a yeah so whenever you have a you know a peak spike in your you know you're detected and you flex up yeah so similarly can you do have any flexed-down mechanism in place where you identify that you know a few of the boxes are not taking just lying there doing yeah yeah yeah that's that's a good point yeah actually you these auto scaling policies can be set up so that you scale up or scale down for a specific time interval as well so for our case we saw that the spike in traffic was maybe for one or two hours at some particular time in the day so after that that interval was passed we would actually scale down our systems so it's not that once you scale up you're always in that scale up mode and I think typically you know you want to balance how much of that you know micromanaging you're going to do about from the provisioning perspective you at least want to have a certain x number of boxes always up and running even if they are not being used because if you have a good distribution your SLAs your response times are going to be much faster and you will be sending a response back to your customers well within your SLAs that you've defined so so you want to kind of balance that but yes you can do both you can provision up or provision down I think that there is there is a mechanism for both as well thank you yeah right in terms of defining the KPIs and keeping them documented in a specific location so everybody can get access to it is there a specific tool or something that we use or we just maintain on excel sheets and things like that yeah that's a good question no actually we don't we don't have any tools I'm sure there are tools out there that that are probably very useful for kind of having a but what we do have is we actually have some sort of CI CD based performance suites that run which will benchmark one of your all your key APIs and publish those parameters on an ongoing basis so report gets sent out to all the key stakeholders on on every release that goes out so that way people can keep an eye and see that is the baseline being maintained and is it going degrading or getting better and they're also organizational or team goals that say that okay we're at this percent but we want to improve by maybe 10 percent in the next release so that's kind of monitoring happens on an ongoing basis but I think we don't actually use any specific tool we just have these scripts that run in a CI CD fashion so that that information is the tests are run the information is collated and it is actually published to you know to some set of stakeholders thanks okay thanks everyone okay thank you all please