 Welcome to the 23rd lecture in the course design and engineering of computer systems. In this lecture we are going to understand a little bit more detail about what happens at the transport layer of the internet. So let us get started. So the IP layer that we have seen in the previous lecture basically provides a host to host delivery of IP datagrams that is one host on the internet sends an IP datagram to some other host on the internet, all the IP routers along the path look at the destination IP address and forward this IP datagram to the other end host, right. We have seen various mechanisms for how this is done using routers that run routing protocols and between two IP routers also how the link layer is present all of that we have seen in the previous lecture. And we have also seen that the IP layer there is no mechanism to provide any guarantees that is there are no reliability guarantees anywhere in the IP layer if an IP datagram comes if you can forward you forward otherwise you can drop it. So what the transport layer does is the transport layer runs above the IP layer. So the IP layer provides a host to host delivery of IP datagrams and with no reliability then the transport layer takes this mechanism and on top of it it builds other mechanisms in order to do a process to process delivery of messages that is there is one process running on one host another process on another host. So the transport layer deals with this process to process delivery and in addition to this it tries to provide various other guarantees like you know in order delivery, reliable delivery and so on. Of course there are many different protocols at the transport layer TCP provides all of these guarantees. TCP is a much simpler transport layer that only deals with process to process and no other guarantees. And you have other transport layer protocols like sctp which provides multiple streams over a connection TCP is just one in order reliable stream sctp is multiple streams but we will not study this in this lecture. So there are many transport layer protocols and on top of this transport layer you have your application layer and an application can choose between different transport layer protocols that it wants. You open a TCP socket you will get TCP guarantees if you open UDP socket you will get UDP guarantees. And note that the transport layer only runs at the end host this is the end to end argument. The end host the client and the server will run TCP processing all these routers in between they are just dumb routers that are just forwarding IP datagrams they do not know anything about this end to end mechanisms. So and this transport layer runs inside operating systems as we have seen when you write something into a socket the transport layer processing is done in the OS and then the packet is sent over the network. Similarly, when you receive a packet the OS does all the transport layer processing and then gives the application layer message alone to the application that is reading from the socket. So now let us understand a little bit more detail about the TCP protocol which is the most widely used transport layer protocol on the internet. So what is TCP? TCP is a connection based protocol that is when you if you recollect TCP sockets you will connect two TCP sockets with each other. So what is this connection? What is set up when you do this connection is that and end to end connection is established between the client and the server even though no router along the path is aware of this connection. So what happens when during this end to end connection is that you have a what is called the three way TCP handshake between your client and your server. Your client and server exchange some messages in order to establish this connection. How is this done? The server has opened a listen socket and it is waiting for new connections and the client starts a connect system call and at this point what happens is the client machine will send a special packet to the server called the SYN packet saying hello I want to talk to you. Then the server the operating system transport layer code at the server will respond with a SYN act saying okay hi I have heard your request I am here let us talk. Then the client will once again send a SYN act act and this completes your connection set up at this point the server this accept system call will return at the server the connect system call will return at the client and both the client and server sockets are connected with each other. This is called the three way TCP handshake okay why is this three way needed the client has to you know call the server server has to say okay I hear you then when the server sends a message the server also needs to know that the client got its message right. So if you think about it this is common sense between two people talking hello are you there yes I am there can you hear me yes I can hear you with this this connection is established both these hosts can now confidently talk to each other. When the data transfer is done also you will basically have a similar handshake using what are called SYN and SYN act messages from each side in order to tear down a connection that is also there note that the UDP has no such concept of any connection any socket can send to any socket you just send packets directly but with TCP since you are maintaining some kind of a reliability and everything you will do this TCP three way handshake in order to connect two different sockets and once this connection setup is done then the transport layer will start sending segments you know the client and server will start sending segments to each other transport layer segments to each other what is a segment a segment is nothing but whatever application layer message you receive plus some TCP headers added to it okay. So the message that is written into the socket is first split into smaller chunks and this chunk size is called the maximum segment size okay so piece you cannot if the application writes a large message you cannot send all of it in one transport layer segment because you know the underlying link layer technology is the Ethernet has certain constraints on how big these messages can be therefore what you will do is the transport layer will first split a message into smaller chunks of size MSS or the maximum segment size and note that message boundaries are not preserved when you make these segments that is if you you know write 64 kilobytes into a socket in a write system call all these 64 bytes would not be in one segment if your maximum segment size is 1 kb then it will be split into 64 segments of 1 kb each and sent out over the network. So that is one thing that the transport layer does it creates these segments which later on become packets over the network and what else will you add to the segment of course one thing you will add is port numbers you know you have multiple processes on computer which are identified different sockets are distinguished by port numbers therefore in the transport layer TCP UDP both of them will add port numbers and the IP layer will add the IP addresses together between the transport and IP layers you have the source destination port numbers as well as the source destination IP addresses added to the headers in the packet and the other fields that are added are various things like your packet size check sum and all of that which again both TCP and UDP will do but in addition to these port numbers IP address packet size check sum which are the bare minimum things TCP also adds a few extra fields to the packet header for the purpose of reliability which is the sequence number and the acknowledgement number note that this is not added by UDP which does not care about reliability but only by TCP. So what is the sequence number? Sequence number basically tells you the starting byte that is present in the packet packets are given sequence numbers so that you can keep track this is the first packet this is the second packet this packet is lost this is received in order to be able to do this you need to be able to identify the packets in some way number them in some way that is what you use sequence numbers for. So the sender puts the sequence number of the starting byte in a packet if a packet has the first 100 bytes the starting sequence number will be 0 the next 100 bytes the starting sequence number will be 100 the next 100 bytes the starting sequence number will be 200 and so on. So we have byte based sequence numbers put in packet headers by TCP and when the receiver receives a packet it will send back an acknowledgement number which is the sequence number of the next byte it is expecting. So when the sender sends the 0 to the first 100 bytes byte 0 to 99 then the receiver will send an acknowledgement with a number equal to 100 which is saying I got everything before 100 send me 100 next I am waiting for 100 when this packet is received the receiver will send an acknowledgement saying 200 which is I got everything up to 200 send me byte number 200 next. So that is the of course you could have put you know sequence numbers to be packet based sequence number acknowledgement numbers to be packet level acknowledgement numbers but TCP prefers to use this byte wise semantics. The other thing to note is the receiver's acknowledgement is cumulative it indicates that everything up to this byte has been received and these sequence numbers and acknowledgement numbers are there in both directions TCP is a bidirectional stream once a client and the server have connected with each other the client can send bytes the server can also send bytes. So for this direction you will have a sequence number and for this direction also you will have a sequence number similarly for this direction you will have acknowledgement for this direction you will have acknowledgement and these acknowledgement numbers can be sent with data in the reverse direction or as separate packets. So this concept of sequence numbers and acknowledgement numbers is used by TCP to guarantee reliability. So now the next question comes up the sender is sending a packet waiting for an acknowledgement then you know how do you do this do you wait for every packet if I send one packet do I wait for an acknowledgement for every packet note that waiting for acknowledgement has to be done at some point that is essential for reliability otherwise if you simply send packets do not see what is being acknowledged then you would not have reliability but the question is when do you wait do I send a bunch of packets and wait or do I wait for each packet. So there are two different ways of doing this you can do what is called a stop and wait design which is sender sends one packet waits for acknowledgement then sends the next packet waits for acknowledgement but this is very inefficient because your packets take a long time to reach the other side and you know you are wasting a lot of time waiting for acknowledgements instead what TCP does is it does what is called a sliding window protocol that is the sender will send a bunch of packets a window of packets and then wait for acknowledgement instead of just sending each packet and waiting you send a bunch of packets and then wait for acknowledgement and of course you cannot keep sending forever you will have to put some limit on you know this window size and then by the time you have sent all of these packets suppose an acknowledgement for this packet comes then your window of packets has moved then you will send one more packet now this is also acknowledged then you will send these next set of packets these set of packets right you keep sliding your window as acknowledgments come ok that is why it is called sliding window you send a window of packets as the window packets keep getting acknowledged you keep moving your window forward ok this edge of the window keeps moving forward. Now the question comes up what is this maximum window size that you should use you know you cannot clearly keeps sending forever you have to at some point wait for acknowledgement what is this value this maximum window size that we have to use that is the question that TCP tries to answer if you use too larger window size then what will happen you are just you know dumping packets into the network you can cause congestion bad things can happen if you use too small a window size like stop and wait then what happens you are not using your resources properly you are sending a packet then for a long time you are not doing anything because you are waiting for acknowledgement therefore your window size should be optimally tuned. So now that TCP uses a sliding window protocol let us fully understand how it handles reliability ok. So the sender sends multiple segments with increasing sequence numbers and it has some notion of what is the maximum window size that I have to use and until that window size is hit it will keep on sending segments. Now when a receiver receives a segment it will send an acknowledgement back to the sender. So this is a basic mechanism needed for reliability and these acknowledgement sequence number will indicate the next in order byte expected. So suppose this sequence numbers are 0, 100, 200, 300 and so on when the receiver gets this packet it will send an acknowledgement number of 100 if it gets after it gets this packet it will send an acknowledgement number of 200, 300 and so on. And of course if you receive any jumbled in packets you will not send those acknowledgement numbers you will only send a cumulative acknowledgement number if the receiver has received some packet over here and you know then it will still continue and not this packet then it will still send this acknowledgement number only it will not acknowledge out of order packets that easily NTCP it is not possible to do. Now when the sender so once the receiver sends this acknowledgement when the sender gets this acknowledgement it will advance its window size and the window will keep moving forward. Now what happens when data is lost how are we guaranteeing reliability here? Now suppose some segment this segment is lost ok. So and then this segment is received if the segment starting at 100 byte number 100 is lost but segment at byte number 200 is received then what will the receiver do? It will still send an acknowledgement back for byte 100 saying I am waiting for byte 100. One when segment 300 is received it will once again send an acknowledgement saying I want byte 100 the next segment is received it will once again send an acknowledgement saying I want byte 100 that is we will have duplicate acknowledgements. Every time the receiver receives a packet it will send back an acknowledgement for the next in sequence byte it is expecting and therefore with these duplicate acknowledgements a single duplicate acknowledgement can be due to some you know reordering of packets that is ok. But once you get 3 duplicate acknowledgements the sender thinks ok something is lost something bad has happened and it will retransmit that lost segment. And once you retransmit the segment when the receiver gets it then all of these segments are also received then it might send back you know one big the next acknowledgement can be over here you can skip all of these and say oh I got all of these send with this byte next right the receiver can do that. So of course this duplicate acknowledgement to detect loss can only happen if your some packets are going through some are lost some are going through. But what if everything is lost the sender has sent some 10 packets everything is lost segments acknowledgement everything. In that case you would not get these duplicate acknowledgement then what will the sender do how will it the sender realize that a loss has happened the sender that is why the sender also maintains a timer for every segment you send you will maintain a timer. Within that timer if duplicate acknowledgements are received you will realize the segment is lost you will transmit it. If no duplicate acknowledgements are received and no acknowledgement has come at all then when the timer expires you will time out and you will retransmit everything. So in this way using some combination of timeouts and duplicate acts losses are detected and the sender will retransmit and eventually after sending multiple times it will eventually reach the end you will get an acknowledgement. The other thing that can happen is your data can go through but your acknowledgements can be lost. In which case the sender will think oh something bad has happened and retransmit but that is okay the receivers TCP receivers identify these duplicate packets and they want you know deliver duplicate packets to the application the TCP layer will filter out these duplicates using sequence numbers. In the end the receiver all the packets it will assemble them sort them in order of sequence number and when an application reads from a socket this in order stream is delivered to the application. In this way TCP takes care of reliability when you write and read using TCP sockets you are getting this reliability you are guaranteed an in order reliable byte stream. So the next thing is of course we have conveniently skipped this question of what is your window size you know your window size cannot be too big it cannot be too small what should it be. So let us just see a small example okay suppose your network speed is such that you can send 10 packets per second okay anything faster than that your network kind of gets blocked. Then your round trip time when you send a packet it goes to the other side traverses all the routers on the internet and comes back your round trip time is say 2 seconds. So what does that mean once the sender starts sending packets once the sender sends 20 packets you know you can only send 10 packets per second in 2 seconds that is the product of this bandwidth and delay that is called the bandwidth delay product once you send 20 packets in 2 seconds what has happened the acknowledgement for your first packet has come back okay your first packet then 19 more packets at the end of which your first packet has been acknowledged therefore you can send one more then your next packet is acknowledged you can send one more. So if you are sending at the rate that your network supports and you know you have a certain delay then by the time you send your bandwidth delay product or BDP worth of packets the act for the first one would have come through and therefore your ideal window size in this ideal world is basically your bandwidth delay product you take the bandwidth of your network the rate at which your network is able to send your traffic you multiply it with the round trip time you get your BDP use this as your BDP if you use this as your BDP then everything will be perfect you are going at a smooth rate if you send more than this BDP what will happen your network is not able to handle that your network will get congested if you send less than your BDP what will happen your network is ideal your sender is ideal you know if you send one packet and wait for 2 seconds then you are just wasting the network but if you send at your BDP at this exact rate at which your network can take it then you have an ideal situation. So this is what TCP will want to do however finding out this BDP is very hard in real life and it is not possible because the what do you know the internet is large complex you do not know what the bandwidth available to you is the delays are highly varying so this BDP is not in general in real life it is not possible to know it this toy example we could calculate but real life is not like that therefore in real life what TCP will do is it somehow tries to estimate this BDP using very rough heuristics okay. So TCP calculates what is called the congestion window or the sea wind using some heuristics to approximate the BDP in some sense. So that is called congestion control you know calculating this congestion window and limiting yourself to this congestion window in order to not cause too much congestion in the network at the same time to send as much data as possible that algorithm inside the TCP logic that does this is called the congestion control algorithm okay. So this congestion control algorithm relies on some feedback from the network okay how do you adjust your congestion window you do not know what is happening in the network but you can only infer. So if packets are going through then maybe you know you are sending below the BDP you can probably increase your congestion window you can increase your window size if packets are getting lost that means you know something bad is happening the network is congested some switch is not able to some router is not able to handle all this traffic and it is dropping packets therefore you may have to reduce your congestion window you may have to slow down. So most congestion control algorithms today simply rely on packet loss okay if packets are getting lost slow down if they are not getting lost continue to send. So this is called the congestion control algorithm inside the operating system inside the TCP layer of the operating system. So very simple congestion control algorithm could look like this start with you know sending one segment and initially you know you can ramp up quickly every RTT double your congestion window. I have sent one segment in the next round trips and two in the next round trips and four segments in this way you keep on doubling your congestion window that is called the slow start but of course this is initially to ramp up quickly but after some time you have to be more careful what if you know congestion will happen then you do not want to be sending a large window size and realizing packets are getting lost. So instead what you will do is you will be more careful after some time you will get into what is called the additive increase phase where you will only increase your congestion window by one segment every round trip time the if your round trip time is 2 seconds send you know send 5 packets next 2 seconds after 2 seconds send 6 packets after 2 seconds send 7 packets in a window that is how you will increase your window size more slowly that is if you plot the congestion window as a function of time initially you will rise rapidly then you will slowly increase linear increase then if something bad happens if a due pack comes you know packet has been lost then you half your congestion window you come down then again you increase slowly then again some loss happens you slow down then again you increase slowly slow down right this is called the additive increase multiplicative decrease when you decrease you will decrease drastically you will half it when you increase you will increase slowly and of course if some time out or something happens then you will restart all of this again you will go back to slow start you will start from the beginning and this is a very simple algorithm but different TCP variants will use different congestion control algorithms like this and of course there is no one standard method to do this but there are different heuristics that are adopted by different algorithms. So now let us just look inside a router to actually understand what is happening when there is congestion a router it gets packets from multiple links and it will look up some routing table the forwarding table look up the destination IP address and find out you know should I send the next packet to this other guy or this guy or this guy the router makes that decision and the packet comes in datagrams are coming in the decision of forwarding decision has been made and then if this link is free the packet will be sent on the outgoing link but if the link is not free then the packet will be buffered up right you will have some small buffer here where you are storing all the packets and this Q is of a finite size. So if this link is very slow you know in any network there will be one link if the network is composed of multiple links if this link can send 100 packets per second this can only send 10 then at this point the Q will build up. So on a road on a narrow road is where the traffic will stop right we all know this it becomes the bottleneck. So the slowest link in the network becomes the bottleneck that is called the bottleneck link and at that bottleneck router at the Q will build up that router will no longer a lot of data is coming in the router is no longer able to send it out on the link and the packets will keep getting buffered they will keep getting stored at the router at some point the space at the router the storage space at the router runs out and packets start getting dropped and this is how congestion happens and when packets get dropped then the senders will get that signal and the congestion window will reduce and this congestion will ease that is the sort of the inner details of how congestion is happening on the network. So you might say you know why should we wait for packets to get dropped and then slow down you know when this Q is building up itself cannot we realize you know if you are going into a traffic junction you see a lot of vehicles backed up you might just take a different route beforehand itself like you do not sit there for 1 hour and then realize there is congestion. So this is possible there are modern routers today where when the Q size starts to increase itself they will detect it and you know that is called random early detection. So before the Q overflows itself you can either drop packets or you can set a mark on packets you know you can there is something called explicit congestion notification support available in routers. So you know you can say oh the Q is building up and you can set that signal on the packet so that if your TCP is aware of all of these things it can slow down even before packets are lost. So especially in networks like data center networks where you know performance is very important you can use all of these optimizations. So now let us understand end to end you know you are sending a window of packets from the sender to the receiver every packet what are all the delays that this packet experiences. So when you take a packet and you have to transmit it there is a basic transmission delay okay your link has a certain speed it might take you know a few microseconds to actually transmit the packet on your link that is called the transmission delay. Then you have the propagation delay once you have you know put out a signal on the wire there is you know the speed of light it takes some time for the signal to reach the other end that is the propagation delay. Then every router has to process a packet you know once the packet is transmitted propagates receives the other endpoint has to process the packet look up the forwarding table make some decisions all of that is the processing delay. Then finally if the packet cannot be sent out on any link you know because the link is slow then the packet will get queued up and it will wait in a queue that is called the queuing delay. So for every packet you add up all of these delays on every link that is your round trip time okay when you send a packet for all the data for all the links of the data packet you add up these delays then for all the links of the act you add up these delays that is the round trip time in your network that is the amount of time if you send a packet it will take this much time to get an acknowledgement back and your BDP is nothing but your bottleneck link bandwidth multiplied by this RTT. And note that different networks will have different values of you know bottleneck bandwidth than RTT and so on. In data centers you might have very high bandwidth links connecting different servers and you can have very low RTTs like you know few milliseconds. But on the internet you know if you are talking to if you have a network a wide area network and your client is on one side of the earth and the server is on the other side then your RTT can be very large it can be tensed to hundreds of milliseconds and you can have lower bandwidth also because there are there could be many some slow link along the path. So different networks will have different characteristics the BDP value is different there is no easy way to find out what this BDP is and TCP uses some heuristics in its congestion control algorithm to estimate what is the best window size. And the final concept with respect to transport layers that I would like to discuss is what is called the flow control. So far we have just been concerned about the network you know the network is getting congested network is dropping packets. But what of the network is fast but your receiver is slow. If you recollect what is happening at the receiver when a packet comes in the packet is put into you know your device driver has TXRX rings you put the packet in here then the OS handles the interrupt processes the packet and then in your socket queue you will add this packet to the socket queue and then your application will read from the socket and that is when the socket queue is empty we have seen all of this last week. Now if your receiver application is too slow or your operating system is too slow then all of these queues also get filled up. For example if your application is not reading packets fast enough your application is very slow then even if your network is very fast all the packets will come and wait here which is in that case also the sender has to slow down that is called flow control. The sender slowing down in response to a slow receiver is called flow control whereas a sender slowing down in response to a slow network is called congestion control. So how does the sender know that the receiver is slow? In every acknowledgement you will tell to the sender how much space is left in your socket received queue. You know if your socket received queue has very few bytes left you will tell that in the acknowledgement so that then the sender will send set the window size to be the minimum of the congestion window and the receiver window. If the network is slow the congestion window will be low you will use this value if the network is fast your congestion window can be very high but this receive window will be smaller therefore the minimum of these two values will be used. So TCP does both congestion control as well as flow control and the other thing that you should do is you must set your socket queues you know receive queue size to be at least as big as your bandwidth delay product okay. If your receiver is fast but then this queue does not have enough space then that is not good enough therefore if sometimes you know you might find that your connection is very slow but your network is fast in that case the problem could actually be your receive buffer. Your receive queue is so small that your acknowledgement is telling oh I have no space left and your sender is slowing down even though your network is fast that can also happen. So we have tools like iPerf for example in Linux which lets you measure the bandwidth between you know a sender and a receiver a client and a server in a system and if you find that you know when you measure this iPerf bandwidth and you see that the bandwidth is low but your network is very fast then this receive buffer size is something that in computer systems you should always be mindful about and see if you want to tune it if that is actually limiting your sender slowing down your sender in a network. So that is all I have in today's lecture we have discussed transport protocols we have seen how transport protocols like TCP use a sliding window of packets and they use acknowledgments for reliability, estimate the congestion window size in order to not cause congestion in the network, slow down if the receiver is slow using flow control mechanism. So all of these together will ensure an end to end reliable in order byte stream delivery of packets on the internet. So to understand these concepts better please use tools like iPerf try to measure the TCP bandwidth or the delay between two hosts in your network change the receive window size and see what happens to TCP throughput. So these are all things you can try out to understand the concept of TCP congestion control and flow control better. Thank you all that is all I have for this lecture. Let us continue our discussion in the next lecture.