 first lecture in the course design and engineering of computer systems. So in this week we are going to start a new topic which is performance engineering. So let us get started. So so far in this course we have seen how to build, how to develop, how to design computer systems. In this week we are going to study how to engineer the performance of these computer systems. Okay, so right at the beginning of the course we have seen that of course a system has to satisfy certain functional requirements, you know it has to do the job that it is built for. In addition to that it also has to give good performance. This is an example of a non-functional requirement. So in this week the topics that we are going to study related to performance are as follows. Okay, so first we will study how to measure performance. Okay, given a computer system what does it mean to measure performance? What will you measure? How will you measure? What parameters will you change? What are the output metrics that you will observe? All of these things, the basics of what is performance, how to measure performance we will study. And then the next thing we will do is we will see how to do a simple performance analysis. Fine, you have measured performance you got some numbers but to the results make sense. How do you estimate what is the right value? Are you close to that right value? So some simple performance analysis how to do we will study. And then the next of course the next big chunk is how to optimize performance of a computer system. You know identify what are the performance bottlenecks? Why is the performance you know poor if it is poor? Which component is responsible for it and how to optimize that component and of course we will see various techniques to optimize performance both within a single machine as well as across the entire system. So the one thing I would like to point out is this performance engineering by itself is a full course it is a full fledged course. So in this one week what I am going to do is describe only the high level concepts and if you are interested in this you should go ahead and take a complete course to understand this fully. So now let us start with an example when we talk about performance instead of talking about it at a high level let us take an example application. So we have been seeing this you know multi-tier application for a while now and we have seen many examples of this last week also. So a typical multi-tier application how does it look like? There will be you know multiple say front end servers you know web servers that are getting traffic from a whole bunch of clients. Then these front end web servers will talk to various application servers to handle various types of requests and these application servers may contact various databases and in the end all of this happens and some responses sent back to the client. So this is a simple model let us keep this in mind. Now to this system how do you measure the incoming traffic what is the load you know if you want to say the system is under a lot of load you know the system is facing a lot of load what does that mean? You know what do we mean by the load? So this load or incoming traffic can be measured in multiple ways for example you can count the number of concurrent clients you know if this is any commerce website you can say that a million clients are connected to this website at this point and they are trying to do online transactions that is one way of measuring. The other way of measuring the load or incoming traffic into the system is the rate of request you know this system is getting a million requests per second or you know 10,000 requests per second. These requests can be anything of course there are various types of requests and this mix of requests can also be specified. For example it is getting 10,000 requests per second to search products and you know 20,000 requests per second to buy products something like that there are various types of requests and the mix of that can also be described. So all of these things describe the load or the incoming traffic into the system. Then what is the performance of the system given that all of this traffic is coming in how do you measure the performance of the system there are roughly two main metrics one is called throughput that is the amount of work done per second number of requests per second that have been handled successfully. So suppose your system is you know able to process 10,000 search requests per second or it is able to process 20,000 purchases online purchases per second. So much of traffic is coming in and the system is able to handle it ok. So that is called the throughput of the system and the other measure is the what is called the response time it is also called by different ways like RTT latency delay RTT stands for round trip time. There are many names which are more or less the same thing what it measures is the time ok the time taken by the system. So if I you know click on a button saying buy or you know if I click on a button saying you know search for a product how much time does it take for the system you know to go through all the components do the processing and return a response back to me that time is what is called response time. Note that these are two different metrics a system can have very high throughput and you know but also take a long time to respond it can have very low throughput but can be quick to respond any combination of these two are possible. So in general every system we measure it by throughput how much of work is it doing per second and how fast is it doing that work when a request comes in how fast is it returning a response to us. So now why does the system have a performance issue? Any system if you look at it there is always a performance bottleneck ok. So let us understand what we mean by a performance bottleneck. So all of you intuitively know what is a performance bottleneck right. If you have many components in a system the slowest component is usually called the bottleneck it limits the performance of the system. For example if you have many roads you know you have a very wide road and then you have a very narrow road through which very little traffic goes and then another wide road. Then all the traffic will get jammed here at the entry to this narrow road right. So this narrow road is your performance bottleneck and the traffic on this entire highway can only flow as fast as this narrow road. If this narrow road is sending you know 10 cars per second only that much traffic can also flow on the other roads also because you know 20 cars even if they come here they will stop over here right. So the performance of a system is limited by the slowest component and that slowest component is called the bottleneck. So if you take our example of a web application you know you have front end and you know you have multiple application servers and then you have a database and you know the users request goes to the front end the application server and the database and then a response is sent back. So suppose this front end can process 1000 requests per second this can process you know 5000 requests per second and this database can process 100 requests per second. So what will be the maximum throughput of the system you know if a lot of load is coming in the system will in the end be slowed down by the slowest component you know it can only do 100 requests per second. So that is called the capacity of the system ok. So the maximum throughput that the system can give is called its capacity and this is limited by the slowest component in the system. And when the system is you know handling a load of the maximum load it can handle of 100 requests per second what will happen this whatever is the slowest component this database component will be fully occupied if 100 requests per second are coming in the front end application server are not that busy you know because they can do lot more work they are only doing 100 requests per second but this database will be fully occupied ok. So that is why it is the performance bottleneck and any more load comes into the system the system will be overloaded because this database cannot do any more work than that. And the other terminology that I would like to introduce is what is called service demand that is how long does each request approximately take to serve. So why is this database only able to handle 100 requests per second because most probably each request is taking 100th of a second that is each request is taking 10 milliseconds to process most likely. So note that I am using very approximate terminology here in if you study this formally with all the math and everything you will use distribution you know each request may not take exactly fixed amount of time there will be a distribution around this average but I am you know taking a very simplistic view in these next few lectures to make you understand these concepts intuitively and not going into a lot of rigorous math ok. So please keep that in mind. So if a database is handling 100 requests per second most likely each request is taking 100th of a second that is why you are able to do 100 of these in a second and not more not less ok. So this time taken by each request to be processed this time to service each request is called the service demand of this component. And usually the capacity how much a component can process will be the inverse of this service demand ok. If each request is taking 10 milliseconds you can do 100 requests per second. And the response time of your system will be at least this 10 millisecond you know you will at least need 10 millisecond here plus whatever time is needed at the application server front end all of this if you add that will be the response time of your system ok. So this is what I would like you to visualize when we say that a system has a performance bottleneck what does it mean? There are there is a pipeline of many components the slowest component has a certain you know maximum throughput and that becomes your maximum throughput that the entire system can give you and that becomes your performance bottleneck that becomes the capacity of your system. And this capacity is dictated by the service demand how long is that bottleneck component taking to process a particular request ok. And all of this is of course assuming the system is the bottleneck sometimes it can so happen that between the client and the server some network switch can also become the bottleneck ok. Some switch on this path is only forwarding 50 requests per second then the question of the database becoming the performance bottleneck never arises the database never even sees more than 50 requests per second then of course it is not the bottleneck. So your bottleneck can be elsewhere also. So usually the network or the system or at the client itself may anywhere there can be a performance bottleneck in the system. So now let us see how do systems perform now that we have understood what is you know the basics of performance and bottleneck let us understand if we measure the performance of the system what kind of values will we get and how do we go about measuring performance. So again consider the same example of a web application which has a capacity you know it can only handle a maximum of 100 requests per second ok. So now there is this system that has many components and all of this together can handle 100 requests per second. Now suppose you have 10 requests per second coming into the system only 10 requests per second then what will be the throughput of the system how many requests per second can the system handle it will handle 10 requests per second of course it can do up to 100 you know but if you are only giving it 10 requests the throughput will also be there is 10 requests will be handled and sent back right. So the throughput will be equal to the incoming traffic as long as this traffic is somewhat low and of course the response time also if you measure it will be fairly low you know because no component is overloaded everybody is comfortably doing the job the response time will be low. Now as this load keeps on increasing so let us try and you know visualize this in a graph as your load keeps on increasing. So initially your throughput if your load is low the you know if your incoming load is 10 your throughput is 10 if your load is 20 throughput is 20 and so on you know the throughput will be increasing as you increase the load. But once you reach close to your system capacity still your throughput will be equal to load but what happens to the response time initially the response time is low you know there is very little work and you know all the components are returning responses fast but as you keep on increasing the load what happens now this bottleneck component is seeing 90 requests per second 95 requests per second then what is happening slight congestion starts to build up ok. These requests sometimes they come in a bunch and when the requests come the database is already busy due to the previous work so they end up getting queued up here the congestion starts to build up because of that the queuing delay the queues build up queuing delay increases and your response time slowly starts to increase ok. So this is your throughput and this is your response time. So as you reach capacity throughput is still increasing but your response time is also increasing and once your load exceeds capacity so this is your capacity of the system ok once your load exceeds capacity suppose you are sending a thousand requests into the system what will be your throughput be your throughput will kind of flatten out ok your throughput will not increase throughput was initially increasing 10 the throughput was 10 20 requests came in throughput was 20 requests 90 requests came in throughput was 90 requests but if 200 requests come in your throughput will not be 200 requests because your system can only do a maximum of 100 requests per second. So this is called throughput flattens out ok the maximum throughput is reached it flattens out and the response times of course continue to increase as you get closer and closer to capacity as your system becomes more and more overloaded your response times will keep on increasing because a queue is just piling up in front of this bottleneck component ok this is roughly what you will see when you increase load on your system ok this is the performance you observe ok. So now let us understand this overload a little bit more you know suppose a lot of traffic is coming into the system that greatly exceeds its capacity your capacity is 100 requests per second and say you know million requests per second are coming into the system. Then what of course your throughput of the system will always be flat it cannot increase beyond capacity if anything if things are getting very very you know your system is crashing and back things are happening throughput may even fall but it will never go beyond your capacity. So the throughput flattens that is given what else happens at overload your response time keeps on increasing you know whatever is your bottleneck component the queues builds up builds up builds up because you know you have so many requests coming in and this guy is only doing 100 requests per second so this queue keeps on building up and therefore every new request is having to wait a long time and you know it is getting queued up. Now when this queues are building up what can happen you can start seeing error messages ok. So if this is your web server ok and this web server has a lot of queue getting built up and it realizes it has no time to process these requests then in such cases HTTP servers can send a special response back saying service unavailable ok if this web server sees that I cannot handle all of this request there is no CPU time available no resources available it can send back an error message saying sorry you know come back later. Sometimes if you cannot the application does not even have time to you know politely send you a error message like this. Sometimes the application is not even you know you have your socket buffer and packets are sitting in the socket buffer the application has not even had the time to read the packets and send a polite response back. So the packets are just lying in some queue somewhere not getting process not getting acknowledgments and your TCP will time out connection will fail system calls will fail. So you will start to see even more worse errors ok. So this error is at least good the application at least has enough time to say sorry to you and even under if even more overload occurs the application would not even have time it would not even have time to read a request and send a sorry response back TCP itself will fail at the you know at the kernel level itself failures will occur the system calls will fail or you know some switch some you know network card some device driver some queue will overflow packets will start to get dropped you know all sorts of bad things will happen. So all of this is what you see as server overload you know when say a big sale opens on an e-commerce website or you know when a lot of people are booking tickets at the same time on a travel reservation website any of these cases. So much load is coming in that your server is not even able to handle it is not even able to you know send back a sorry response to you in such cases you are at your browser level itself whenever you try to write or read from the socket errors will occur the connection itself has failed and you will see this as a server crash okay. So this is the result of overload very high response times if at all your request complete of course you know if no errors occur in your request complete you are going to see very high response time because everywhere you are waiting in queues and sometimes you are not even this lucky to get response after long time you get no response at all and you will see various errors okay. So this is what happens under overload and the reason for all of this is simply the queue building up at the bottleneck component and the system not able to clear out this queue fast enough. So given all of this the ideal operating point if you have a system where do you want to operate it you want to operate it just below the max capacity you know your system throughput is increasing increasing and reaching a flat point you want to operate this is your throughput and this is your load you want to operate your system just about here you know you do not want to operate it here because there is so much more work the system can do you are not using it but you do not want to operate it here also why because then there will be lot of overload and errors and all of that. So just below just at the max capacity this is called the saturation point and you know at that point also your response times are also low and as you reach the saturation point your response times start to increase a lot. So before the response times increase too much and when your throughput is high enough that is where that is the amount of load ideally you want to your system and once you hit saturation what happens at the bottleneck some hardware resource will be fully 100% utilized. For example if the component is doing a lot of CPU computation for each request may be all CPU cores have reached 100% utilization there is no spare CPU available to do more work or if the system is doing a lot of IO the hard disk is running at full capacity right. So some hardware resource will be fully utilized because if it were not fully utilized the system could have done more work you know if the CPU was free the system could have handled more request why is it hitting a bottleneck because some resources fully consumed. So how do you improve system capacity you either increase hardware resources you know if you have 4 CPU cores give 8 CPU cores then you will be able to do more work your capacity will increase or given the amount of resources optimize your system you know write better algorithms that do not consume as much CPU for example you know cache thing so that you do not use the disk so much for example right. You can do all of these optimization these are the only two ways in which you can improve capacity and one thing to keep in mind is sometimes this bottleneck can also happen due to a software issue no hardware resources fully utilized but still so you see that your system throughput is kind of flattening out but if you see the CPU utilization utilization of any hardware resource it is not fully utilized it is not at 100% then what is happening why is your system throughput flattening out it could be due to some software issues for example your process can open only say 1000 files and it has opened it is trying to open more than that then it cannot do more work than that then the OS will stop number of open files number of file descriptors or you know threads are all waiting for a lock you have you know 4 CPU cores and but all the threads want only one lock so only one of them can run on the 4 CPU cores so things like that there could be some you know design issues some software issues some system configuration limits in your system due to which even though all the hardware resources are available you are not able your throughput is still flattening out you are not able to do more work even if more load is offered to you. So such issues they should be fixed by either rewriting the code or tuning the OS and all of that so if your performance is limited by a software issue then you should tweak the software in order to make this issue go away but if your performance is limited by a hardware issue then you either upgrade your hardware or two other things. So in this way whenever your performance is kind of not increasing beyond a point you have to see what exactly is the issue there. So now this is a summary of what we have seen given a computer system you know system will have certain capacity and each component will have certain configurations it will have some service demands which will decide the capacity or given this system what you will do is how do you measure performance you will vary certain input parameters on which performance depends for example you will increase the number of concurrent users you will increase the rate of request coming in you will change the mix of various type of request you know some easy request more hard request you will do all of this you will change the load the incoming load into the system you will change you will change the load and then you will measure various performance metrics you know for example you will measure throughput as you increase the load the throughput increases flattens up we have seen that. So you will measure throughput you will measure response time the other metrics you will measure is you will measure the utilization of the resources at the components especially the bottleneck component ok why because you want to see as my performance is kind of hitting this flat 2 as throughput is flattening out is some you know resource at the bottleneck has it reached 100% utilization if not why is my performance still flattening out all of these things you want to analyze. So these are all the things you will measure throughput response time utilization of course errors you know when overload is occurring what are the errors if there is no overload why are still errors occurring all of these things you will measure. So this whole process of varying input and measuring the output matrix this is called load testing of a system. So once you have built a system you will do a load test you will vary input parameters which determine performance and you will measure the various performance metrics. So one thing I would like to point out is throughput all of these things are usually averaged out ok because at any one second your throughput might be 105 requests per second at some other second it can be you know 95 requests per second. So it can be slightly varying so what you will do is you will do an average over some time window and you can also measure you know for each link you know this front end to app server or app server to database on each link you can measure latencies throughputs errors everything you can also measure end to end ok. So these are the various things different metrics in different ways you can measure and all of this is called load testing of a system. So once you build a system you will give varying amount of input load to the system and measure the throughput latency and various other performance metrics. Now there are two ways of doing the load test to a system. There are of course many different variations these are roughly the two ways. One is what is called open loop load testing that is here you vary the rate of traffic if you have a system you give it you know say 100 you know 10 requests per second then later on you give it 100 requests per second then later on you give it 500 requests per second right you vary the rate of incoming traffic into your system that is called an open loop load test. The other load test is what is called a closed loop load test that is you vary the number of concurrent users who are using the system ok the number of concurrent users you vary. So these are both there is no one right way to do a load test these are both different ways in which you can increase load vary load into your system. So how do you do an open loop load test you will you know pick some value of the rate of request into the system and you will generate you know if you say you want to generate n requests per second you will pump in that much load into the system. For increasing values of n you might start small keep on increasing rate of request and see where does the throughput flatten out and so on and how do you implement it you can simply have a software program that fires a request every 1 by nth of a second you know if you want to send 10 requests per second every 100 milliseconds or so or you know with some randomization you can send these requests and in a closed loop test you vary the number of concurrent users and each user will send a request get a response then you know typically wait for some time. So you have this what is called a think time the user will send a request get a response then think for some time you know pause for some time then send the next request that is how usually closed loop testing is done in order to emulate this is what real users would do if there are real users in a system this is how they would do they would send a request get a response think send the next request. So such users you will emulate some n of these users into the system and you can you know write a program to do this you can have n threads which is each making a request getting a response thinking and so on right. So both these are valid ways of varying the input load into a system and note that open loop testing is usually leads to higher number of concurrent users because so much load you can easily increase the load overload the system more easily because you know so many requests are coming in and whereas here you have some limit there are only some n concurrent users there is a limit to how much load can come in. So usually closed loop load tests are slightly easier on the system they do not crash the system as often but both of these are valid ways of testing the system. So in general for any system you have certain specific pieces of software or even hardware you can get a hardware box whose only purpose is to generate load these are called load generators. So where you get a program or you get along with the hardware everything where you connect it to your system its only job is to just generate load either in an open loop fashion or a closed loop fashion and you can connect this to your system and you can see you can do a performance test how is your system perform and these load generators they will provide various knobs to you you know you can vary the rate of request if it is an open loop load generator number of concurrent users think time the mix of the different kinds of requests all of these you can configure and you can send in a certain amount of load into your system. So finally how do you run a load test given that you have built a system you will do what is called a load test to your system that is here is your system under test you know it has many components it is deployed somewhere and then you have a load generator this is either you know this software program running on a separate server or it is a hardware box itself that generates load you will connect these two up the load generator will keep sending load into your system incoming load at some varying rate and your system will handle that load and then you will measure various performance metrics you will see what is the throughput what is the latency errors utilizations all of these you will measure and you will plot some graphs these performance metrics versus incoming load you will plot the graphs that is the result of your load test and of course while doing load test you want to take care for example you want to ensure that your load generator is not the bottleneck okay if your system can handle 1000 requests per second but your load generator is only generating 500 requests per second then it is not useful you know it is not good enough to test the performance of your system completely. So you have to ensure that the load generator or this network that is connecting them or none of these are the bottleneck you want your system to be fully loaded therefore the bottleneck should be here you have to take care of that and you also usually want to eliminate sources of non-determinism because you know for example oh this thread was running on this core then it performed differently oh no all of these threads had you know some other behavior in this time therefore they perform badly you do not want to give such excuses therefore you want to have as much determinism as much reproducibility so fix all the parameters fix your the number of threads number of CPU cores keep everything constant and run a load test so that you can fully and correctly measure the performance of your system and after the load test what do you do you do you analyze your performance and then you do some optimizations you do some performance engineering. So first you will see what performance have I measured is it making sense or not okay fine it is making sense you identify a bottleneck then you optimize the bottleneck repeat the load test and you can keep on doing this this is a iterative process okay you keep on optimizing your system again measuring performance again see the results optimize measure optimize and so on you keep on doing this till you are satisfied till you say okay fine my system has good enough performance that I can deploy it on the field right. So this is a summary of what we have seen in today's lecture we have understood the basics of what is performance what is the input load to your system that you can vary and what is the performance metrics that you can observe about your system and how do you measure performance how do you run a load test and what are the things you have to keep in mind when you are doing all of this. So the basic concepts pertaining to performance measurement we have covered in this lecture. So as a programming exercise you can actually run a simple load test yourself you know you can install the Apache web server and the there is also a load generator that can actually generate load to this web server that can actually you know at whatever rate you want send HTTP request to this web server and measure how the performance of the web server. So you can try and use these tools on your own to understand what is a load generator and how to run a load test. So that is all we have for this lecture let us continue this discussion in the next lecture. Thank you.