 to the 19th lecture in the course design and engineering of computer systems. So in the previous lecture we have studied what is the socket API that is provided by the kernel to use a processes to help them communicate with each other. In this lecture we are going to understand how this socket API is implemented inside the operating system and this understanding is crucial for us to understand how applications run in real life and what kind of performance they get and so on. So let us get started. So what is the socket API? This is a small recap, we have seen that the socket is an abstraction that is provided as a way for two processes to communicate with each other. So there is one process that opens a socket, a server socket, another process that has a client socket connects to this server socket and then they can exchange messages with each other. What you send here will be received here and vice versa and you can also send messages to other sockets that you are not connected to. So we have seen that there are local sockets for communication this way between processes in the same machine as well as internet sockets or you know TCP or UDP sockets for connect for communicating with the processes across machines. And we have seen various components of the socket API, how you open a socket, how you bind it to a well known address like an IP address plus port number at a server. Then how you connect a client socket to a server socket, how you send and receive data, this slightly varies between connection less and connection based sockets. Then how we use even driven APIs for managing multiple sockets at the same time if a server has to handle multiple clients concurrently. There are also other functions that are provided in any communication a socket like API or any other programming language API. Like for example, the data that you send over the network is in a specific format and it could be different from the format in which the data is stored at the host. So you have functions to convert from the host format to the network format and vice versa. There are many other helper functions that we have not gone into more detail but when you are using any library for communication you should understand these things. So this is the API in this lecture we want to understand how this system calls are implemented inside the operating system. So before we understand how these system calls or the socket based system calls are implemented inside an OS, I would like to give you a brief overview of how network communication itself works. So this is a topic that we will study in more detail next week but here I would like to give you a brief introduction to how network communication works. Of course if you have taken a networking course before you should be familiar with it but if you are not also here is a summary. So over the network like the internet data is exchanged in units of packets that is you will take some number of bytes put them into a packet and you will send this packet. So byte you will send in another packet right. So data flows in the form of a series of packets between multiple machines on the internet. And these communicating processes that are you know sending and receiving these packets they all have unique network addresses. So every machine has an IP address and inside a machine the different processes when they open sockets they will get different port numbers. So this combination of this IP address and port number uniquely identifies a server socket uniquely identifies a server process. So therefore whenever you have to send a message to any server you will put the servers IP address and port number as part of your packet so that the message reaches the server. So every packet has the sender's addresses IP address port number as well as the receiver's IP address port number and using this information the network has a series of routers. So this machine is sending some series of packets then all of these routers will look at this packet has to go to this person to this address and they will route the packet and finally your packet reaches the server. So you will have both the sender's address as well as the receiver's address in the packet. And machines on the internet they communicate using various protocols. What is a protocol is nothing but some rules you know we have to exchange messages like this, this is how the message should look like, this is the format of the message right you have to agree on some things otherwise if this guy sends some random gibberish bytes then this guy cannot understand what he is saying. So there is a common language that these processes used to communicate over the network which is called a protocol. There are many protocols for example the TCP IP protocol is used for reliable communication the UDP IP protocol is used for unreliable communication the HTTP protocol is used to exchange information about websites we will study all of this next week. But for now you need to understand that there are some protocols and each protocol will have its own headers. For example the IP protocol may add its IP address the TCP protocol may add port number. So there is all of this other protocol specific information that is there in a packet. So every packet has what is called the payload which is the actual data that the user sends into the socket you know you are sending some message to the other side you will write some message into a socket that is your payload and once you write the payload into your socket before it is actually sent out over the network a few other extra pieces of information are added to the packet which are called the headers. This headers are usually added by the operating system when it does this protocol processing like TCP processing or UDP processing it will add these headers here and on the other side you will remove the headers. So the user program only sees the payload but the OS takes care of adding and removing these headers and finally the packet that gets sent over the network has both payload as well as headers. So what these protocols are what headers do they add what information is there in the headers all of this we are going to see next week when we study the design of the internet in more detail. So now with this knowledge that there are certain packets that are sent and received which have certain payload and certain headers corresponding to various protocols now that we have this information. Now let us understand how this socket API is implemented in operating systems. So when you open a socket we have seen this before you will get a socket file descriptor this is nothing but your index into the file descriptor array which points to open file table all of it is the same that high level sockets are treated the same as files just that instead of inodes for a file for a socket you have separate structures. Socket does not need inodes it has some other pieces of data associated with it which is kept track of in the open file table. So every socket it has queues of socket buffers. So a socket has a queue a transmit queue and a receive queue and this queue has what are called socket buffers this is the socket buffers you are transmitting and this is the socket buffers you are receiving. So when you write socket buffers go here when you read socket buffers come from here but what is a socket buffer a socket buffer is nothing but a data structure that is used to store network packets it is called SKB or SK buff in Linux and it is nothing but some payload as well as headers a socket buffer will have variable size they will find many different headers that are added to a the payload and all of these together is called the socket buffer or SKB in Linux and every socket stores two queues of all the socket buffers or packets that it is transmitting and all the socket buffers or packets that it is receiving. Every socket is a bidirectional communication right you can send as well as receive. So when you use the send or the write system called to write into a socket what happens a new SKB is added to the transmit queue and in this SKB you have some payload and some headers you have headers and payload into this SKB when you write you are giving some message to the socket API to the write API this message is copied into the payload and these headers are added later by the operating system. So when you write into a socket a new SKB is created the payload is copied from the message the user space message the buffer that you have given to the write system call from that you create the payload then the OS will later you know based on whatever protocols it is running it will add some headers and finally this SKB is given to the device driver and then the device driver will send it out over the network you know you will have some network card like an Ethernet card or a Wi-Fi card which will have its own device driver and then this SKB finally will go over the network as a packet at a high level this of course a simplified description but this is roughly what happens and once you send this bits in this SKB are sent over the wire or wireless over the network they are sent then this SKB is freed up. So the SKB is allocated when you write into the socket it is constructed using the payload of the user and the headers and when it is sent out this SKB when it is no longer needed it is freed up. Then what happens when you receive when you read from a socket there are already SKBs that have been received that are queued up from this queue you will take the payload information that is there in the SKB and copy it into the user provided buffer. Once again there is an SKB this payload the read or the receive system call gives some buffer some message as an argument you will copy this payload into that read system calls argument. But what if your receive queue does not have any packets then what if the user says read from a socket but the receive queue of the socket is empty in which case the process will block you will block until some data is received from the network then the OS will you know put an SKB queue it up here until that is done this process will block. So this is roughly how sending and receiving data through a socket happens where during the send time you will create SKBs send it out during the receive time the OS when the packets are coming OS will queue up the SKBs here and the socket read system call or receive system call will copy the data into user space from the SKB. So next is how are actually packets sent over the network for that we have device drivers. So a device driver is a piece of software that talks to IO devices we have seen this before and network cards also called NICs or Network Interface Cards have their own device drivers which basically exchange packets with the network. So every device driver maintains what are called transmit and receive rings. So the device driver will have a transmit ring and a receive ring. So ring is nothing but a circular array. So that is you will write in the array like this when you reach the end you will once again overwrite. So it is nothing but a fixed size circular array and this is the transmit ring and this is the receive ring. What do these rings contain? So these are maintained by the device driver that is the software inside the OS. These rings will contain pointers to the socket buffers. These rings contain packet descriptors which are nothing but pointers to SKBs socket buffers. These are all the socket buffers that have to be transmitted. These are all the socket buffers that have been received. So you have a ring of socket buffers the TX ring and the RX ring and the actual hardware device itself the NIC itself knows the location of these transmit and receive rings in memory. So now all of these rings this is all just memory. It is located in main memory these are all OS data structures the data structures maintained by the device driver. So all of this is there in RAM but the NIC knows how to access this transmit and receive rings. For example the NIC knows here is the first packet that have to transmit here is the next packet after transmit it can read it can access these transmit and receive rings. It knows so every ring has a head and tail pointer that is these are all the occupied slot these are all the packets that are there to be transmitted all of that information is known to the NIC. So once the NIC knows the transmit and receive rings and you know how to access it where are the packets where is the queue starting where is the queue ending it can access these rings in order to do the transmission and reception. So let us look at it in a little bit more detail what is happening at the transmit ring. The transmit ring is nothing but the queue of socket buffers that are waiting to be transmitted ok. Whenever the user has created packets to be transmitted they get queued up at these transmit ring. Now the device driver it will add the SKBs to this transmit ring when the packet is ready to be sent and then the NIC whenever it is free to send it will access the SKB do a DMA of the packet from the SKB into the actual device hardware. So the NIC has its own hardware where you know it converts these bits into electrical signals to send it over the wire and all of that. So the NIC is doing its own processing on the packet. So the first thing the NIC will do is it will DMA the packet from RAM into its own hardware memory and from there it will go ahead and transmit the packet. So how does the NIC know DMA, how to where in memory this SKB is located that is where this transmit ring that is why the transmit ring has all the pointers. You know this is the address of the first packet to send this is the address of the second packet to send all of that is there in the transmit ring. So the NIC accesses the transmit ring sees the address of the first SKB to send then gives that address to its DMA hardware which will copy the packet into the device and then the packet is sent. And once the packet is transmitted then the NIC raises an interrupt then this SKB can be freed up. You are done transmitting this packet it is done, similarly you will be done with the next packet, the next packet and so on. Is that clear? So the transmit ring is a queue of SKB to be transmitted. The device driver will add SKBs and the NIC will DMA these SKBs and raise an interrupt at which point thus SKBs will be freed up on the transmitter. Now what about the receive ring? The receive ring is a queue of empty SKBs that are waiting to be filled. So the device driver it will create a bunch of empty packets and keep them in the RX ring and whenever a packet arrives at the NIC it will find the first empty packet, find the address of this packet, DMA the received packet into this SKB and then raise an interrupt. So this RX ring is used, is looked up by the NIC when it has a packet. Now that it has a packet it needs to know where to DMA it into. It cannot just randomly dump it anywhere in memory. So it will read the RX ring, find out which is the first free slot, put the packet then the next packet it will put it in the next free slot in the next free slot and so on. And then when the OS handles the packet of course it will take this received packet do whatever processing and take it back to the socket queue. You remember the socket queue in the previous slide? So from the RX ring this SKB has to go to the socket queue that is where the user program will read it from. So once the NIC receives the packet it will DMA it into the SKB that is there on the ring and then the OS will handle this interrupt. Once that the DMA is done the interrupt is raised, the OS will handle the interrupt take this SKB off to the RX queue. And when you pull an SKB filled up SKB from this RX ring of course you will replenish it with a new MT SKB why so that the next time later on in the future another packet can be received. This RX ring has a fixed number of slots. So when you take a packet away you will replenish that with a new MT SKB. So this is about the device driver and the TX and the RX rings. Now let us put everything together. Let us see how packet transmission and packet reception happens end to end. So to summarize you have your socket and this socket has TX and RX queues which are SKBs that are queued up at the socket for transmission and reception and then underneath this TX and RX queues at the device driver level you have TX and RX rings. And then finally you have your device your NIC itself. This is the driver. So how is a network packet transmitted when the user does a write or ascend system call a new SKB is allocated in the TX queue and the data is copied from the user space into this SKB. Then the OS performs all the network protocol processing like you know the user has filled in the payload the OS will add the various headers and then this SKB is finally pulled out from this TX queue and queued up at the TX ring of the device driver when the OS is ready to send the packet note that as soon as the user writes you may not give the packet to the device driver to the TX ring because as we will see next week some protocols might want to slow down transmission when there is congestion when too many packets are getting lost or some for various reasons you might want to also slow down. So the OS will wait for a little while when it thinks it is ready to send the packet it will take it from this TX queue and put it into the TX ring that the device has access to. Then the device will DMA the packet from the TX ring and when it has sent the packet over the wire or wireless whatever the case may be then it will raise an interrupt at which point this SKB is freed up. Once the OS handles interrupt it frees up the SKB because the packet has been transmitted fully. This is the life cycle of a packet when you send it over the socket interface. Of course what these protocols are and all of that you do not have to understand now this we will see next week. The next thing is packet reception, how is a network packet received? We start with once again there is your socket has TX and RX queues and then you have your device driver that has TX and RX rings and finally you have your NIC, your hardware. So initially the device driver on the RX ring it will put a bunch of empty SKBs which are just you know place holders where receive data can be put in. Now when the process makes a read system call if this receive queue is empty some packets have been received before well and good otherwise if this receive queue is empty anybody who makes a read system call will block until there is receive queue has packets, until the socket receive queue has packets. Now when a packet arrives the NIC will find the RX ring, find an empty buffer in the RX ring and DMA the packet into this SKB and it will raise an interrupt. Now when the OS has received this interrupt what it has to do? It has to process this received packet. It has to do various protocol processing you know look at the headers everything this will usually take time. So what the OS does is it will split its interrupt processing into two parts. First is it will run what is called the top half which is which does the bare minimum work which is telling the NIC okay got it. If the NIC knocks at your door you say wait I am coming I heard you I am coming that is all it does it will first acknowledge the interrupt. Why because you have interrupted a process and you do not want to do a lot of work as part of handling the interrupt. So the OS as part of handling the interrupt it will first acknowledge the interrupt in a piece of code that is called the top half and later on when it is free it will run the remaining part of the interrupt handling which is called the bottom half. That is the interrupt handling is split in modern operating systems into two parts called the top half and the bottom half okay and when this bottom half interrupt handler runs that will remove the SKB from this RX ring run it through whatever protocol processing needs to be done and finally queue it up at the RX queue of the socket okay. And of course when you pluck out a slot from here you will replenish it with an empty SKB so that in the future once again when the circular buffer wraps around another packet can be put over there okay. And when the receive system call returns now once something comes in the RX queue this process that block will be woken up and it will copy data from the SKB into its user memory and then the SKB is freed up okay. So this is the life cycle of a received packet the NIC, DMAs into one of the socket buffers that are whose pointer is there in the RX ring then the OS interrupt handler the top half and bottom half together handle this SKB queue it up at the RX queue of the socket and then when the read system call or the receive system call of the process returns it will copy from this SKB to user memory. Now you might have many sockets in a system then how do you identify which socket receive queue should I dump this packet into that is where your IP address port number will be useful. If you have connectionless sockets of course you will only have one socket at a particular IP address and port number so you will uniquely identify the socket but if you have connected sockets then what at the same port number you have a listen socket and you have multiple connected sockets all of these are at the same server address at the same port number. So then which of these socket queues will you add the packet to this is where you will also look at the sender IP address okay all of these connected sockets have different senders based on that information you will distinguish and based on which sender it came from you will find the suitable socket that is you will use what is called the four tuple of a connection that is four pieces of information which is the sender IP address sender port number receiver IP address receiver port number. These four pieces of information will be used to identify one of the connected sockets and then you will queue up the SKB at that receive queue. So this is the life cycle of packet reception. Now in general this packet reception is a very overhead field process why because you know there is a lot of protocols you have to look at headers you have to do various computations calculations checksums we will see all of this later. So when all of this is being done your OS can get overwhelmed and therefore there are many techniques that are available to speed this up okay especially today you have multi Gbps you know tends to hundreds of gigabits per second lot of packets coming in. So how do you efficiently process all of these packets that is the question. So some techniques that the operating system uses are one we have seen which is splitting the interrupt processing into a top half and a bottom half. As soon as the packet comes when an interrupt is raised you are stopping an existing running process to handle this interrupt therefore you do not want to take much time. You will do a minimal processing which is called the top half which just involves acknowledging the interrupt then you will schedule a separate process whose only job is to run these bottom half interrupt handlers and when that process runs the bottom half runs which is also called the soft IRQ when that process runs you will schedule it whenever the CPU is free when nobody else is running and that will do all the protocol related processing and take the SKB from the RX ring to the RXQ of the socket okay. So this splitting will help you avoid interruptions to existing processes then the other optimization we do is that one CPU core may not be able to keep up with all the interrupt processing normally every device raises an interrupt on a particular CPU core but if you have hundreds of GBPS you know millions of packets coming in per second then one CPU core will be like overwhelmed in doing all this interrupt processing. Therefore what modern systems do is they use a feature of NICs called receive site scaling what is RSS instead of one TXRXQ you have multiple TXRXQs and the NIC when it gets packets it will split the packets into these multiple queues and each of these queues will be handled by a different CPU core C0, C1, C2 and so on okay. There are multiple TXRXQs and each queue so NIC will DNA one packet into this queue then this CPU core will be interrupted next packet goes here this core will handle the interrupt so you are splitting the load of interrupt processing across multiple CPU cores and how do you split packets you try to ensure that the packets of the same connection are in the same ring so that you do not jumble things around too much. So you will typically use the hash of this four tuples source, receiver, source destination IP port numbers in order to split packets into these receive queues at the RX ring level and each RX ring these interrupts are handled by a separate CPU core that way you can use your multiple CPU cores to handle this large number of interrupts that are coming in on high speed network cards and the other optimization that is commonly used is what is called the nappy optimization. What this means is that suppose at your RX ring you have multiple slots and your NIC has put a packet here, DMA to packet here it has raised an interrupt. Now the CPU is very busy it has not yet handled this packet then another packet has come again you raise an interrupt, again you raise an interrupt instead of constantly knocking on the door when the other person is not opening the door you should stop knocking because you should realize that they are busy so that is what this nappy does once an interrupt is raised until the bottom half runs and handles is interrupt all future interrupts are disabled that does not mean packets won't stop coming right packets will keep coming you will just keep adding them to the RX ring but you will not constantly keep raising interrupts and interrupting the CPU that is already overbuilt the NIC will just silently keep DMAing packets without raising any further interrupts you told the OS once you wait until it responds and once the bottom half runs then it will not just look at this one packet but all the packets that have come until then it was interrupted when one packet came but when it actually runs many more packets could have arrived and it will handle all of them and once it has cleared the backlog it will re-enable interrupts again this avoids too much interrupt load in a system that is receiving packets at a very high speed so now we have multiple queues in the system okay and therefore it you might have the question what should I set these queue sizes to you know you have at the socket you have these TXRX queues then you have these TXRX rings at the device driver and what should these sizes be so you have to set these sizes very carefully okay because packets are coming over the network first they are sitting over here then they are sitting over here then the application is reading them so there is a pipeline okay and in the pipeline you need to exactly match the speed of all the stages in the pipeline if anybody is fast the other person is slow then what will happen your queue will build up for example if packets are coming very fast on the NIC and this RX ring is getting filled up but the OS is not able to take the packets out and put them into the socket queues then what will happen your RX ring after you DMA packets into the entire ring once you go around the ring you will start overwriting packets you will start dropping packets you know there is no place to put them anymore you will just drop the packets that are coming in over the network that can happen or the other hand if the application is reading packets from the RX queue very slowly then you know the OS is putting packets into the RX queue but the application is not reading them only when the application reads from the socket queue can you free up that slot so that can also happen it can also happen that the device is very slow you could be writing very quickly into this TX ring but the device is not sending packets so lot of things can happen right queue can build up here queue can build up here queue can build up at many places and whenever this queue builds up packets will be dropped you will get performance issues you know you you have sent some message to the server the server is not getting your message and you have performance issues so later on in the course when we study about performance when we study about networking we are going to see how to fix all of these sizes based on the speeds of the various components if your network is too fast if your network is too slow if your application is too fast too slow based on all of these things how do you set all of these queues how do you harmonize all of these stages of a pipeline such that you get good performance that is what we are going to study later in the course you must adjust your sending speed you must adjust your queue sizes your device driver ring sizes all of these must be tuned for optimal performance and how you do this is something that we will see later on in the course for now it is enough for you to understand that there are many queues and this is a pipeline one after the other and anytime somebody is slow and somebody is fast in a pipeline queue will build up so the final topic that I would like to touch upon today is that modern computer systems are doing IO at increasing speed so initially when the operating systems were designed you are doing IO at you know few mbps or gbps but today you are doing IO at tens to hundreds of gbps especially in modern computer systems in data centers in the cloud you have very high speed IO and the operating systems are not were not really designed to you know handle these millions of packets coming in per second you know for every packet there are inefficiencies for every packet there is a system call there is an interrupt you have to switch from user mode kernel mode context switch and you know every packet you have to allocate SKB one one by one you have to copy the packet data how the device will DMA the data into SKB which is there in the RX ring and this SKB you will take you will put it into the RX queue and from here the data is copied into user whatever buffer user has given to the read system called the data is copied there so you are copying data twice so they are all of these inefficiencies and therefore sometimes your computer cannot handle all the network data that is coming in so in such cases there are techniques available in modern computer systems which are called kernel bypass techniques which basically say okay let me get rid of this OS layer in between I will directly access the network card and I will do my own work so some applications that need high speed IO can bypass the operating system so for example you can use your own special device driver that directly accesses the NIC and the NIC will directly DMA packets into your user space okay this device driver special device driver can do that so there are frameworks like what is called the data plane development kit or DPDK this is a one popular example where the device driver does this it will directly access the NIC you are no longer using the kernel interrupt system call all of that is gone you directly access the NIC take all your data put it into a user space buffer so that there is zero copy there is no copying across the kernel and user memory you will pre-allocate buffers instead of doing SKB one for every packet you will allocate a large pool of buffers at the same time and you will use optimizations like huge pages we have seen this before large page size means good TLB performance you will use all of those concepts you will avoid interrupts whenever the application is free it will directly pull the NIC instead of the NIC constantly interrupting the process and disturbing it you might use polling you might use techniques like batching where whenever you go to the NIC you will handle a batch of packets at a time your protocol processing will happen on a batch of packets at a time instead of doing it one by one so frameworks like DPDK today which are called kernel bypass techniques employ all of these optimizations so that you can you are able to handle hundreds of gbps of packets in your application okay so these are all advanced topics that I would cover in more detail here but I just want you to know that this is an active area of research and if you are building a real-life system you might have to you know employ these kernel bypass techniques sometimes of course not everything is you know very easy with kernel bypass techniques you know the operating system is providing you various isolation mechanisms and you know there are various tools in the operating system that are useful so all of those are gone the OS is out of the picture here directly your application is using a device driver talking to the NIC the OS is not involved so this has its own complications with respect to isolation security or portability of your code and all of that but that aside if your application really needs high speed IO this is a technique that a lot of people are considering today in real-life systems. So that is all I have for this lecture I have talked to you about how the socket API is implemented inside the operating system and I have also spoken to you about what are some of the performance issues that can arise and what are some of the techniques used to optimize these performance issues in modern computer systems. So as a small exercise you can do this you can try to find out the size of your rxtx rings and your you know transmit received use of sockets in your system there are simple commands in Linux that let you find out these numbers to tune them so play around with them to understand what these various things are in your system to understand the concepts of this lecture better. Thank you all that is all I have for this lecture and see you in the next lecture. Thanks.