 the 32nd lecture in the course Design and Engineering of Computer Systems. In this lecture, we are going to understand a little bit more about how to analyze the performance of your computer system. So what is the workflow in building computer systems that we have seen in the course so far? You will first design and develop your system and you will put all the components together, you will run a load test and you will measure performance. So we have seen what this means in the previous lecture, what do we mean by measuring performance? You will vary certain input parameters, vary the incoming load into your system and you will measure various metrics like throughput, response time, errors, utilizations and so on. Now the next step is you want to understand if the performance is reasonable or if it can be improved. So you have to first look at these performance numbers and you know analyze them and understand them to you know see does this performance make sense, is it justified, is it some bug in my performance measurement or is this how it is supposed to be. So you have to kind of you know do some back of the envelope calculations to convince yourself that this performance measurement is correct. And then once you know that okay this is my performance measurement but I expect lot more load into the system then you will go ahead tune your system and you will repeat this process again. In this lecture we are going to see how you do performance analysis. What performance measurements do we expect from the system? We will use simple back of the envelope calculations. What does this mean? Simple rough calculations that you do on a rough paper to see that okay I am getting 100 requests per second is it correct, is it okay, is it not okay, you know that kind of rough estimates how do you do that on your performance that is what we are going to study in this lecture. And once these estimates match what you are seeing suppose you estimate to get a you know performance of 100 requests per second in your system because each query is taking 10 millisecond but you are seeing only a performance of 10 requests per second, a capacity of 10 requests per second, then you know something is wrong in your measurement setup that you have to fix right. So these back of the envelope calculations are very important to get a rough sense of is your performance measurement correct, is it roughly in the same ballpark range that you expect and so on okay. So in this lecture also once again I would like to clarify that I will only cover the high level concepts and I will try to give you only an intuitive feel for performance analysis but actually doing performance analysis involves a lot of math and notation and theory has to be introduced that we do not have the time for in this course. If you are interested in this you should take a course on you know queuing theory that is the area of computer science that actually studies you know how these queues build up due to performance bottlenecks, what are the throughput response times you get all of those are studied in detail in queuing theory. In this lecture I will not go into a lot of math or you know write scary equations I will just try to explain it in simple intuitive terms which is enough for most cases of doing a performance analysis in real systems. So now let us see what performance numbers we expect in an open loop load test okay let us begin with an open loop load test. Suppose you have a system with a capacity of C requests per second where is this capacity coming from at the bottleneck component the service demand is roughly 1 by C you know roughly every request is taking 10 milliseconds to process at the database therefore your database is able to handle 100 requests per second right we have seen this in the previous lecture. So now to this system with a capacity of C requests per second you send a stream of requests at varying rates you know suppose the rate of requests into the system is R requests per second say you know 10 requests per second, 20 requests per second, 100 requests per second and so on. So this is your open loop load test. Now what is the throughput how will the throughput look like so queuing theory tells us the following not just queuing theory even common sense will tell you this which is you know this is the value of C and this is rate is varying from you know some small value C beyond C and so on. Then the throughput that you will measure will look as follows. Initially the throughput will be equal to R okay if your input load is 10 requests per second your throughput will also be 10 input load is 20 throughput will also be 20. So your throughput will initially be linear will be equal to the offered load incoming load into the system until you reach capacity okay and once you reach capacity as your incoming rate exceeds the capacity your throughput will flatten out. This is intuitive you all understand why this happens right this is the measured throughput. What about the response time? So initially the response time as you vary the load into your system and this is your capacity initially your response time is low because there is no queuing all components are free whatever request comes it is quickly processed right. But for R less than C the response time is low as R approaches C your response time increases exponentially you know your graph will look something like this and as R is much much greater than C your response time almost goes to infinity you know if you have an infinite amount of load coming into your system your response time drastically increases it goes to very very high values and at some point you know queues build up queues overflow requests fail errors crashes all of these will happen. So your response time starts slowly then increases increasing as you reach capacity and as you exceed capacity your response time just rapidly increases okay. And what about the utilization of the bottleneck component utilization also initially if you know your below capacity your utilization will be approximately the incoming load divided by capacity you know utilization if you see on a scale of 0 to 1 initially it will be like proportional to the load and at some point your utilization will reach 100% or will reach if it is fraction it will reach a value of 1. And this will be of course the utilization of some hardware component if your work is mostly on the CPU this will be CPU utilization if your work is mostly on the disk this will be disk utilization whatever is your bottleneck resource that utilization will reach 100% as you reach capacity okay. These are all intuitive your common sense also will tell you this but you can use you know the theory in queuing theory the equations to derive similar values. So when you run an open loop load test this is what you should expect. Now what about closed loop systems you know you have once again let us consider a system with a capacity of C request per second and now in a closed loop system you are varying the number of concurrent users this is how you vary the load you are no longer saying send 10 requests per second into my system you are saying have 10 concurrent users to my system. And what are these concurrent users doing each user is you know sending a request getting a response and then thinking for some time this is called the think time you know this is why this think time this is how real users behave you know if you want to measure the performance of your system with when there are 1000 users connected to it you want to kind of be as close to real user behavior as possible real users are not just doing request request request they send a request you see a web page then you think for some time then you click on something so this think time kind of emulates real users. So the think time is between you get a response you make the next request again that is called your think time. So every user has a certain what is called turn around time of the user that is if I am a user I will what is the gap between my successive request is the time the response time of a system I send a request I get a response time plus my think time. So it takes for example 2 seconds for the system to respond and I will think for another 8 seconds then you know I have sent a request in 2 seconds response comes another 8 seconds to think so 2 plus 8 10 seconds is my turn around time of a user that is the time between successive request from a user will be this turn around time. So these are some of the parameters in a closed loop load test. Now let us see what happens if the number of concurrent users is very small like just like for open loop load test we have seen what happens if you vary the request rate here also let us see what happens for as you vary the number of concurrent users into your system. So suppose you have very low number of concurrent users you have this big system and you have only you know 1 or 2 users concurrently connected to it. Then what happens if there is no think time if each user is quickly, quickly sending request you know response comes again, again, again quickly send a request you are just bombarding with you know no other work in mind you are just continuously loading a system. Then even for a low number of users you can still generate enough load to saturate your system for example a response comes back in 1 millisecond immediately next millisecond I send a request again a response comes back again, again. So you know small number of users can also fully load a system but if you have non-zero think time if you want to be more realistic then what will happen if you have very small number of users then your bottleneck component will not be fully utilized your system will not be at maximum capacity because it could so happen that all these users are thinking you know suppose every user once he gets a response back things for you know 10 seconds before sending the next request. Then what will happen this user has got a response this user has got a response and both these users are now thinking and this system has no work to do you know. So it can so happen that if the number of users is very low all these users have you know gotten a response and are thinking have not yet come back have not yet turned around to the system again then your system will be underutilized will be poorly utilized your throughput will be low. So therefore when you are running a load test with a lot of think time between requests and all of that you need to have good enough value of n you cannot use a very low value of n ok. Now then what happens for large values of n you know if there are a large number of users into the system then your system will be fully utilized you know even if some fraction of those users are thinking there will be other users who are sending request they will all be at different phases right. So one user was sending a request waiting for response the other user is thinking. So once you have large enough number of users somebody or the other will not be thinking and will be making a request into the system. Therefore you will have enough work to fully utilize your bottleneck component and your system throughput. So as you that is why as you increase the value of n once again your system throughput will initially be low for low values of number of users but eventually it will flatten out and it will be your bottleneck will be fully utilized. And what happens to response time? Response time also as you increase the number of users it will be low initially and then it will increase but note that the increase in response time is somewhat linear with n ok. So if you look at response time versus n initially it will be low it will be low and then it will increase somewhat linearly. Why? Because what is your queuing due to your queuing is due to the extra users. Suppose your system has you know 10000 users you know you will have 10000 maximum 10000 users waiting in front of you in the queue and therefore your response time does not blow up the way it does with open loop systems but it is somewhat linear in n it increases somewhat slowly because there are only so many users into the system right you do not have unlimited load coming into the system. You have a finite number of users who can only generate so much load into the system. So with closed loop per system the response time also increases obviously as you increase load response time will always increase but it will increase not so drastically like in open loop systems ok. So basically the summary of all of this is if you have two low values of n you are not loading the system if you have two high values of n your throughput has flattened out and your response time will keep on increasing. Why? Because all of suppose you need only 10 users to keep the bottleneck busy and you have 20 users in your system the other people are just queuing up in front of the bottleneck ok. So the question then comes up what is the optimum number you know is there an n star you know optimum number of concurrent users which will which is just enough to load the system to you know let it reach maximum throughput but without too much queuing you know just enough number of users to fully load the system what is this value of n star you might ask. So let us see how to you know intuitively compute this n star of course there are lot of math and theory behind this but I will just try to try explain it in simple terms ok. So suppose you have a system with a capacity 100 requests per second as we have been seeing so far what does this mean? This means that the bottleneck takes 0.01 seconds to service each request ok. And now suppose the turnaround time of your users in the system is 10 seconds that is the user makes a request gets a response back by before he makes the next request this gap between request is 10 seconds ok. So then what is happening if you look at the bottleneck the bottleneck has you know handled a request for 0.01 seconds from one user then there is a gap of 10 seconds before this user comes once again to the bottleneck and consumes 0.01 seconds at the bottleneck right. So can you all visualize this here is our user 1 he came to the bottleneck the bottleneck serviced his request in 0.01 seconds then the turnaround time is 10 seconds you know the response time plus think time everything is 10 seconds after 10 seconds the user has come back again and the bottleneck is busy for 0.01 seconds again. If there was only one user in the system all of this gap you know this 10 second minus whatever the small value all of this gap is idle for the bottleneck. So how many such users do you need you know to keep this bottleneck fully occupied if each user is consuming 0.01 seconds out of this 10 seconds at the bottleneck you will need this 10 divided by 0.01 right turnaround time divided by service demand the turnaround time divided by service demand if you had that many users then this bottleneck will be fully utilized and your system will be running at full capacity. Can you all see this this bottleneck every user takes a small slice of the bottleneck and that user comes back again after a turnaround time. So how many such small slices do you need this turnaround time divided by the size of the service demand of each user. These many users if you have you know this 10 divided by 0.01 is if you had thousand users then the bottleneck will be doing the work of all of these thousand users before this first user comes back again this will ensure that the bottleneck is fully utilized and you are getting the maximum throughput maximum possible throughput from your system okay. So this is a rough formula to calculate the optimum value of N star in your load test if the number of concurrent users is less than this optimum value your system will be underutilized your throughput will be below the capacity and you are not you know correctly measuring the capacity of your system if the number of users is much more than N star of course your throughput flattens out but your response times will unnecessarily increase right. So if you want to measure at the optimum point what is the throughput and response time of my system you will basically in your load generator you will set this value of N star according to this formula okay. So this intuition about closed systems is actually very powerful you know seeing how to calculate the optimum number of users to keep the bottleneck busy. This intuition is actually very powerful and it can be used in many other scenarios. I will give you a couple of examples okay. So for example let us consider a completely different scenario okay we have seen this concept of a thread pool before there is a master thread and it is giving work it is putting various requests in a queue and there are several worker threads that will take these requests from the queue and service them right. We have seen this example before this design before of a multi-threaded server you know this multi-threaded TCP server it getting many requests from many new clients are connecting to it it will keep all of these requests in a queue and each worker thread will take a client handle the client's request send a response back and so on. So that the main master thread is just focusing on accepting new requests right. So now this worker threads can also block multiple times you know the client has requested something the worker thread will read the you know read the client request then you know do some processing on it then make some disk IO to read the file that the client has requested it will block and then it will once again run on the CPU and send a response back right. So your thread could be doing this it could be waiting for a long time for the client blocking for the client then doing some more work then maybe multiple disk IO's multiple times blocking. So each thread is using the CPU for some time blocking using a CPU for some time blocking. So if you had very low number of worker threads then what will happen you cannot utilize all your CPU cores you have a 8 core machine but you have only like 2 or 3 worker threads then your CPU is lying waste. So then question comes up how many threads do you need in your thread pool to fully utilize your CPU. So this is a very important question when you are building this multi threaded system with thread pools you have to decide what is the size of my thread pool how many threads do I want in my thread pool. So how do you go about deciding this here is the intuition. Suppose each thread performs 0.01 seconds of work on the CPU it does some computation on the CPU and then it waits for a long time one second and then again for the next request it will do you know 0.01 seconds here and then it will wait for one second here ok. So this is how a thread in a thread pool behaves. So note that now can you see you can map this to actually this is the service demand 0.01 seconds is the service demand on the CPU and this one second is simply the turnaround time ok. Thread is working at the bottleneck at the bottleneck resource at the CPU for 0.01 seconds then coming back again after one second 0.01 seconds come back again after one second. So in this scenario one thread of course is not enough to fully keep the CPU core busy why because the CPU core is doing some work for 0.01 second and then for the next one second there is nothing to do. So how many such threads do you need to fully occupy this CPU core the answer is once again turnaround time divided by service demand which is 100 in this case you know which is 1 divided by 0.01 that is 100. So if you had 100 threads each thread is doing work for 0.01 seconds then blocking for one second and so on then these 100 threads together will fully keep your CPU core occupied ok. So we have used the same intuition that we used for deriving the number of concurrent users in a load test we use the same intuition here to derive how many threads do you need to have in a thread pool right this is also a closed system the thread pool is also a closed system. So this analysis is very powerful and every time you have a thread pool you should actually do this analysis each thread is doing so much work and it is waiting so much time for IO therefore how many threads should I have in order to fully utilize my CPU ok. So another example I will give you another scenario where the same thinking can be applied which is in the case of sliding window protocols right. If you remember a few weeks back we have seen networking transport protocols uses sliding window packets right. So they if you just send one packet the packet reaches the receiver receiver will send back an acknowledgement then you send the next packet you are wasting a lot of time waiting for acknowledgements right. So this is your sender this is your receiver and you send data you get an act back and suppose all of this takes a long time then your sender is just sitting idle all your network links are idle while you wait that is not optimum. So such protocols stop and wait protocols are not optimal. So real life transport protocols uses sliding window packets that is you will send a window of packets and then you will wait and then when an act comes back for this first packet you will send the next packet when an act comes back you will send the next packet and so on right. So the question comes up what is the optimum number of packets in a window. If I have too few packets in a window what is happening I am unnecessarily waiting these packets have gone out I have no other work to do and I am still waiting for acts. If I have too many packets what is happening if I just dump you know million packets into the network some queue somewhere will overflow you know at the router at the NIC somewhere the queue will overflow packets will get dropped and TCP will once again reduce its congestion window slow down all sorts of bad things will happen we do not want that right. So what is the question comes up what is the optimum number of packets in the window. So back in the networking part of the course we have seen that the optimum number of packets is the bandwidth delay product. So now we will try and see a reasoning for why is this the answer okay. So in a computer system what is the bottleneck the bottleneck is the slowest link okay if you have multiple links in your computer system each with different rates the slowest link is where the queue will build up okay suppose this link can send 100 packets per second this can send 200 packets per second and this can send only 10 packets per second and this can send you know again 100 packets per second then packets flow quickly but they will build up here at this router they will build up you know at the beginning of this slow link is where they will build up because so many packets are coming but this router is only able to send one packet every one tenth of a second at the rate of you know 10 packets per second you are only able to send so much. So packets will build up here. So if your bandwidth of the bottleneck link is B packets per second each packet is taking 1 by B seconds you know if your bottleneck is bottleneck links rate is 10 packets per second each packet is taking one tenth of a second to be transmitted. So in some sense you can think of this as the service demand okay. So if you look at your bottleneck link it is taking 1 by B time to send a packet then 1 by B time to send the next packet to send the next packet and so on okay and once it transmits a packet this packet will reach the receiver and acknowledgement will come back and the next packet from the sender will come only after RTT okay. So the turnaround time is basically RTT if I the bottleneck link forwards a packet the packet will reach the receiver AC will come back the sender will send the next packet again this entire loop to complete when will this particular you know slot in the window come back again it will come back again after RTT okay. So therefore you have again you can map it to the same it is like a closed loop system you know how many such packets in a window do you need to ensure that while this bottleneck link is waiting for an acknowledgement it is fully utilized how many such packets do you need your turnaround time is RTT divided by your service demand is 1 by B. So this works out to RTT into the bandwidth delay product right. So basically your bandwidth delay product is nothing but the number of concurrent packets you need to have in order to ensure that your bottleneck link is fully utilized if you have fewer packets than this your bottleneck link will have some gaps you know why because all of these packets have been sent yet the RTT time is not done. So the next set of packets from the window have not come your bottleneck link is idle if you send more packets than this what will happen only so much your bottleneck link can forward you know more packets than this they will all be just piling up in the queue at the bottleneck router. So once again the same intuition of closed loop systems is useful to derive the size of a sliding window also okay. So now that we have seen how to estimate you know how many packets will be queued up and all of that let us understand a little bit more on queues also in any system say if you take any open loop system or anything how do you estimate how queues build up you know how many packets or how many requests are queued up at any component how do you go about estimating that. So there is a law in open loop systems which is called the little's law okay which lets us estimate the size of queues between various components okay. So this component is sending some request to this component it is sending some request to this component so between two components at what rate will this queue build up that is specified by the little's law. Note that for closed loop systems is easy you only have you know 100 concurrent users you know the maximum size of this queue can be 100 because there are only so many people who are in the system but with open loop systems you do not know how many requests are there in the system so you use little's law. So little's law tells you that n equals r into w this is a very famous law this is very popular widely applicable no matter what is your traffic arrival characteristics your service demand characteristics distributions it always applies you know it is a very beautiful law and what it says is the number of requests that are in your system either being queued or being served whatever it is the total in your system the total number of requests is equal to the rate at which requests arrive into your system and multiplied by the average time request spend in the system okay if every request spends on an average you know w amount of time in the system and the request arrive at the rate of r into your system then the number of requests in your system at any point of time is r into w right this is very intuitive for example requests arrive at the rate of 100 requests per second you know 100 requests per second are coming in and each request is taking you know 2 seconds in the system okay so over a period of 2 seconds 200 requests have come in and of course by the end of 2 seconds most of these requests start to leave so therefore at any point of time if you see there will be 200 requests you know this 100 into 2 200 requests queued up in your system why because the request that came in this second the request that came in the previous second they are all still being processed you know before that the request came in they would have left anyways by now so therefore this little's law gives you an intuitive way to calculate how many requests do I expect to be lying around in my system and this is used for many things you know if you want to suppose you measure amount of time your system is taking to process request then you can find you can calculate what should be the queue size you know there is a shared buffer between different threads in a pipeline or you know different processes in a pipeline there is a shared buffer what should the size of this buffer be I will see how many requests are coming in how long a request spending in the system therefore I know that there are so many requests at any point of time in the system I will make that my queue size right. So, given the waiting time in the system you can calculate queue size given the queue size you can also estimate the waiting time if this is my queue size and this is the rate of request coming in this is what I expect my waiting time to be and if the waiting time is more than that then you will think why is the waiting time so high. So, this is just a way for you to you know connect up many different metrics in your system and see if they all add up together or not right. So, little's law is very powerful it is very intuitive it is widely applicable in real life if you look around many places you know like people waiting at a counter for tickets or in any queue you know the rate at which people are coming in how long they are spending in the queue you multiply both of them you will roughly be able to estimate how many people are there in the queue right. So, this is a very general formula. So, finally we come to the summary of what all we have seen in this lecture any time you run a load test you have to do some simple back of the envelope calculations to ensure that the results of your load test they are all reasonable and from basic queuing theory this is what you expect your throughput graph should basically increase linearly with increasing load and should flatten out and the utilization of your hardware resource should increase with increasing load and then it should reach 100% at saturation capacity. Your response times will be low initially but they will start to increase as you reach capacity and go beyond capacity. This increases much more rapid for open loop systems and much more you know controllable for closed loop systems and your queues will build up at various components and anytime you see some queue building up between components measure the queue size and see if this queue size can be justified using Little's law and you know if you do not have enough of a buffer space to handle all the expected queuing you should increase your buffer space. So, use Little's law to check all the queue sizes and then use then see if the response time is roughly equal to the processing time plus queuing delays you know you know what the queuing delay should be from Little's law and you know what the processing time should be from your application is the sum of these two equal to response time or is my time getting wasted at some other place that I do not know about all of these things you should check. Then you should check that there are no failures when the system is under capacity when you are operating below capacity all request should be satisfied there is no reason for failures and after overload after you cross the capacity of the system you should see queue building up at the bottleneck component and errors occurring due to this queue build up. If you see errors occurring some other component that is not fully utilized then you cannot say that error is due to overload. So, just because a server crashes does not mean it is due to overload you should see is the crash due to this excessive queue build up at the bottleneck component or not. So, all of these are sort of some of the sanity checks that you can do on your load test results and once you do these sanity checks you can also tweak your system configuration a little bit if you see that something does not make sense. For example, at saturation you will expect that some hardware resource is the bottleneck you know some hardware resource is fully utilized and therefore that is the reason why you cannot do any more work at your bottleneck. For example, if a processing request takes 10 millisecond you know that your database cannot handle more than 100 requests per second and you know expecting it to get a throughput of 200 requests per second is not justified. But if your throughput is flattening your database you know you measure its throughput its throughput is only flattening at 50 requests per second then you know that and the CPU is also not fully utilized then you know that something is wrong you know this is not a correct behavior. This could be happening for many reasons for example you do not have enough threads in your thread pool we have seen this right if there are no enough threads in your thread pool all threads are somehow waiting for IO your CPU idle. In such cases your bottleneck CPU utilization will be low your throughput will be low and that is not the correct capacity you are measuring you should increase the number of threads you should you know increase various buffer sizes you should increase various queue sizes in accordance with little's law we have derived all of these things. So, you should increase all of these system resources such that you are fully utilizing your hardware. If some software resource like max file descriptors or something else is exhausted increase that limit and if threads are waiting for lock C if you can reduce unnecessary locks. So, all of these things are system level changes software changes that you can make so that your system performance can improve. So, you should always make these changes first and ensure that the final bottleneck is some hardware resource. If your CPU is fully utilized or some hardware resource is fully utilized then you can say fine I have extracted the maximum performance possible from my system I cannot do anything more, but if your system performance is low due to you know not having enough threads or not having enough file descriptors then these are trivial issues right you should simply increase that software resource and then your as more threads come in your CPU will be fully utilized. So, that is why use insights from queuing theory to see is my performance numbers making sense or my queue sizes correct or the number of threads in my thread pool correct or the number of file descriptors are there enough file descriptors to handle all the waiting requests in my queue. So, all of these you can somehow connect it up one way or the other to little's law and to ensure that all of these things are sized correctly. So, that at the end some hardware resources fully utilized and your system has reached a performance bottleneck. So, this is all I have for this lecture. In this lecture I have told you quite a few ways of doing a back of the envelope simple calculations estimates to understand your performance measurements to see if they make sense or not and some simple ways to tune some system parameters like threads in a thread pool buffer sizes queue sizes in order to optimize your system performance right. So, as a practical exercise if you have tried out running a load test with a patchy jmeter and a web server you should use some of the techniques we have studied in this lecture for example, to make sense of the performance measurements that you have seen. So, thank you all that is all I have for this lecture we will continue on this topic of performance in the next lecture. Thanks.