 I really go deep into any of the devices just to show you how are you really happens in the data path from the CPU down to the CPU that is, that is upon the related topics that is pretty much it. So here is the agenda. So look at communication protocols to dynamic devices, talk a little bit about magnetic disk that is known by the device that will be used most. Talk about how to make disks reliable and dependable. It's a great, it stands for the development area of the expensive disks. There is a lot of talk about dynamic memory access and virtual news. We have seen a lot of virtual memory and we look at one way of doing Io and how it interacts with what you do. Talk a little bit about asynchronous Io. It interacts with hash signals. So a little bit about the hardware Io buses. There will be the only slide that talks about the standards and all those things. There are many, so that's here in the summary. Here it lists only a few. It's all the common ones that you see. So you must have heard of the PCI bus. So that stands for very parallel component interpolate. So by the way, before that a particular computer system looks something like this. So let's say this is your last level cache controller. So you have seen that side very much. And most of the same is to look at that side. The process of pipeline and the cache and all these things. We have talked a little bit about memory controllers as well. The name cards that are connected to the memory controller. And then you have your Io controller and this is typically the PCI bus. It connects the Io controller to the memory controller. It also connects to other things. For example, your data interface card. It also connects to the disk controller and to the array of disks that hang from here, etc. This one is often, this particular bus uses some model to the bus like SCSI or ATTA. Also you could be hanging your graphics from the PCI bus. Although nothing stops you from connecting the memory controller directly to the bus. It used to be the case with old internal graphics. They actually connected to the bus. And your Io controller also connects to the other Io devices. Like you can gain an Io device and use it frequently. That's where the most important thing is. Keyboard. Mouse. Little drives. Sorry, what? The printer. The printer exactly? Yes. So where is other things? So this is what roughly looks like. And often this particular controller is called the South Bridge. It's an old moment later. Originally when the personal computers came out, this one was called the North Bridge. The memory controller. And this one was called the South Bridge just based on the location of this. The memory controller today has actually got closer to the LC. In fact, this whole thing will be in a simulchip. So often referred to this particular bus as the frontside bus. This is again of old name coming from the Intel chipsets. When memory controller was the North Bridge, it would connect to the processor through the frontside bus. But anyway, what I really mean is the bus that connects the LC controller to the memory controller. So this is what it looks like. So the PCI. So this is the PCI bus. So the PCI and PCIX. PCIX stands for PCI Express. So it's a faster, it's the next generation PCI bus. Faster than a PCI. It connects the memory controller to various other devices like a network card. So IO bus adapters. So whenever you want to connect, you usually require an adapter or a bridge. That's what it's called actually. When you're connecting a device to the PCI slot of your computer, you need a bridge to translate the protocol from the PCI protocol to whatever protocol you're connecting. For example, here you translate the PCI to SCSI for example. So here are some buses that are often used with the magnetic disks. So IDE or Ultra Hata. So does anybody know what these things stand for? Integrated electronics. Integrated device electronics. What does this one stand for? Advanced technology attachment. What does this one stand for? SCSI. Small computer system interface. So essentially these are popular parallel IO bus standards for connecting to storage devices like this conceding. So you'll often find ATTA comes with two names. Serial ATTA like SATA and parallel ATTA. So you'll probably never see parallel ATTA because when you say just ATTA, that actually means parallel ATTA. Otherwise you'll actually mention SATA. Serial. Atmostic. So PCI buses are much wider. 30 to 60 full bits compared to ATTA or SCSI. PCI and PCI-X are simple as buses and you can see that the frequency difference. PCI-X is the next generation. It can be clocked at a higher rate. And today pretty much all machines routine these PCI-X. ATTA and SCSI are SCI-X buses. So what this means is that these are not only accessed through an SCSI cross handshake. There is no clock that SCSI-X is accessed through. So when it is an SCSI cross, we have clocked through to the top of the frequency. So ATTA throughput can be at most 100 meters whereas SCSI throughput can be 10 to 160 meters. So it's very flexible in terms of the handshake speed. And ATTA can have only one bus faster while others can't have one. So I'll actually stop here about these protocols because they are very important. You can just go on talking about these. The standards, definitions and all these things are there. You can look them up wherever you want. So let's see how actually you do it. There are normally two ways of doing it. One is called a memory mapped I.O. And the other one is called I.O.Map I.O. So let's look like memory mapped I.O. first of all. So I.O. device registers, in this case a map to this CPU memory space. Essentially whenever you talk about, let's say, Fritter's command register, that will actually have an address. So if we map this, we can give some memory address, which the CPU can directly address. So these addresses are usually marked unmapped and these are actually physical addresses, not virtual addresses. So you cannot even translate this. And un-cached is the daily address. So these are never cached for obvious reasons because probably you won't have any use of these protocols. You have to set the command and just fire the device. And of course there's a point caching these, because you really want to talk to the device directly. You don't want to put the store in the cache and be done. That's not going to be done. So if you map the command register to the printer, you would send the bytes to the command register to the printer. So load store to these addresses initiate the actual I.O. operation. So the paths are exactly the same. It will actually start from the processor pipeline. These load store instructions. But since these are un-cached, we will bypass the caches and directly go on this bus. So the memory controller will collect them, these requests. Look at the address and decode it. Find that well. They do not belong to this particular space. It will first find the address. And we'll hand over the address to the I.O. controller to figure out what to do next. So memory controller decodes the address and usually follows it to the PCI adapter for further inspection. And of course on further inspection we'll find out which device it actually addresses. Whether it addresses your GPU or it addresses your mouse or keyboard or whatever it is. And these load store instructions would normally be part of some system called hangers. For example, when you're trying to initiate a print command you'll probably be doing some system calling which would be holding these load store operations to initiate the printer. So system called hangers would write to the data and control registers of the target I.O. device to initiate the process. So for example, to start a printing operation you'll probably send a particular command to the printer saying that start printing. Before that you'll probably send the data. So that the data streams in the printer will start printing. So this is fairly simple. It doesn't require much change in the processor. They look like load store operations. Only thing is that these load store operations when they look up the data for the first time they will know that these are uncached addresses and these are unmapped addresses. So you can allow them to bypass the cache, go to the rest of the bus and the rest of the things you can handle with the remote control I.O. The second type is I.O.Map I.O. This used to be supported in old x86 machines and even supported today in certain machines. So here you expose the I.O. devices to the I.S. So essentially what this means is that you have special opcodes that distinguish I.O. redirect from normal load store. So instead of saying load world some value to some address you probably have a new instruction that would say you know write this particular word to this particular command register. So there will be different instructions for doing that. So the path will still be same, identical exactly. Except that the processor will not really look up the TLB to figure out what to do. Instead the instruction will tell it to bypass caches and the rest of the things will be same on this model. So how do you write or read a stream of bytes? So usually the sys write system call writes a stream of bytes to I.O. device and sys read leads a stream of bytes from I.O. device. For example when you say scan it at some point it gets transferred to a read system from your standard input device. Similarly when you want to bring something to your monitor at some point that will be transferred to a write system call which would write to the video. So the question now is that how do you really synchronize with the CPU? The I.O. device that brings certain things how do you synchronize with the CPU? So first solution is to pull that is you need I.O. device to register continuously. To know whether the particular for example let's say you are reading certain bytes you have to know when it is done. So you pull certain I.O. device to register continuously and find out that the register will change the status whenever the read completes. So that was a lot of cycles especially in multi-programming supported which is always true. Second solution is to send an interrupt. I.O. device sends a hardware interrupt to CPU. So what really happens in this case is that whenever the interrupt drives the program counter will change to something a constant value which where the first few instructions of the interrupt handler has to be written. And from there you can jump to your chosen subroutine which actually do certain things. So that's how we call interrupt subroutine I.O. So the interrupt handler decides what to do based on interrupt number and this is most popular efficient solution and normally used with all I.O. devices. Polling is almost never used. So often for transferring data from into memory you will use direct memory access which does not interrupt the CPU because interrupting the CPU for every binary transfer may be too much. So often I.O. devices come with a mass master known as DMA engine or I.O. processor. So how do we operate the DMA engine? So CPU writes to control registers of the DMA engine with the starting address of the target or source memory block and number of bytes to transfer. So you specify the source memory address this is the target memory access and then it triggers the DMA engine to copy from here to there. So the CPU can work on something else. So DMA engine arbitrates from the front side bus and transfers data on its own. This is also known as bus cycle stream. Do you remember that this arbitration is needed because the DMA engine may have to pay any the caches to make sure that the most accurate data is copied from the source address to the target address. Because certain data may actually be designed in the cache on the processor which may not really be here. Certain things may actually be on this side. So DMA engine will actually send interventions down to the caches retrieve the data right to them and when it is done it sends an interrupt to the CPU to define the completion of the DMA transfer. And makes a panel with probably to some sort of work that completes the copy. Some DMA engines can transfer multiple addresses also known as scattered data. For example, you may specify a list of addresses as a source and another list of addresses at the target. So it will actually copy board from this address to that, this address to there, this address to there and so on. So it can scatter certain bytes and also gather some bytes. Okay, so one of the audio devices that you often use is the magnetic disk. Access latency for a disk is normally broken down into these five components. The first one is the waiting time where the request comes and waits in the queue. Then you have a seek time. So how does the disk look like? So we have a bunch of letters and if I look at a particular track here and if I take each of these tracks on the different surfaces so each surface will have a head. So this is normally called a cylinder that connects not really physically connect logically that connects the same track on each of the surfaces. And each track is normally divided into blocks of data called sectors. This could be called as I-12 bytes or 1 kilobyte depending on how the operating system configures some sector size. So the first thing that you have to do to read a particular data somewhere so essentially there are two things. Suppose you have to read a data here. So the first thing that you have to do is you have to read the head on this particular circle and then make sure that the head comes to this particular point to read the data. So moving the head to the position on this particular circle is called a sector. So this particular mechanical assembly will move to make sure that the head is positioned somewhere on this circle. And after the head does not move what happens is that your disk assembly will rotate to bring this cross under the head. So that is called the rotation. And then you start reading there is some transfer time that takes to get the data from the disk to the disk controller. And then the controller will take some time to send the data back to your memory. So through this particular mass. So that is how you copy some data from this to the memory. So usually among these five components this is the largest sector. So that is what normally comes because this is a purely mechanical activity that causes your head to move to be positioned on this particular track. And there are certain optimizations that if you do one of the common ones is read ahead. So essentially what you do is that if you are currently reading this sector you might want to read the next sequential sector as well. And depending on your data layout the next sequential sector may be on the next service on the same track or may be on the service. A team was next sector on the same track. Depending on how you read the layout the sectors and on the cylinders. Nonetheless, what the idea is that it takes points especially on time. And normally when you read ahead what you do is that you put the data from disk to an area called the disk cache. So that is still here. An reserve area in the memory that is configured to be the disk cache. So this controller would also have some small amount of cache but that is not much actually. And transfer of disk cache is normally much faster. Is there a different locality? There is just one cache. Let us assume that. We are not really worried about the exact configuration. There is just one blob of data. What I am asking is that is there a use that is typical of the calculator. So I am currently reading this particular sector and first it will be read the disk cache and then it will be transferred to the CPU. So if I have a disk cache inside my disk controller should I be doing something smart about the policy of the disk cache fixed by a temporary mechanic? Is there a chance that the disk cache will see some reuse of this? The reason is that the CPU will be pulling out data from here. It does not really talk to the disk actually. And most likely today that even the memory sizes the CPU will make full use of that particular sector of data before it gets edited from the memory back to the disk. So it is most unlikely that the disk cache will actually see any reuse of the data. So this is just solving as a spatial locality agent nothing else. It is just streaming in data putting it staging it there so that before it sees the next request it just pre-pages nothing else. So that is there are many other optimizations for improving disk performance both at the hardware level and at the operating system level. What will focus our attention on a little bit is on disk arrays. Because this is important for reliability as you can guess because if you have an array of disks you can probably produce some redundancy of data. So that even if one disk fails I can get the data from somewhere else. However this adds one more level of control that is the array control. So on top of the disk control there will be an array control which will send commands to disk controllers the array of disk controllers. So what is reliability? Reliability of a system is measured in terms of mean time to failure. So essentially how frequently my system fails that is the measure of reliability of the system. So this measure is quantified as mean time to failure over mean time between failure. And mean time between failure is mean time to failure plus mean time to recover. So here the hope is that the recovery time will be small. So that so that this number is going to be very large. Which means that most of the time my system is large. But if the recovery time is large then essentially during the recovery time my system will not be available. Which is not good. Dependibility of a system can be derived from its reliability and availability. And large storage systems are prone to failure. However user is happy as long as the industry does not use any data. So that is the primary goal of a storage system that a user should not use data. So that's why reliability is very important. Presence of redundancy in storage structure can be recovered from failure risk. So given that storage systems fail we must have some mechanism to recover data in the top case. And we are going to see some other mechanisms. So redundancy is used in other cases also to improve reliability dependability. Not just limited to risks. So again an example from processor design. IBM 390 mainframe offers a good example. So this system is an old system. There are three processors. Some are built in spares to be used in case of processor fails. So not all of them are all the time used actually. Some of them are actually spares. Within each processor certain pipeline resources are duplicated. Like a fetcher, decoder, the name of a function you need to save on gas for instance. Essentially you have two pipelines. And both the pipelines will compute the same instruction. Two results coming out of the duplicate instruction pipelines are compared to text fail. Of course here the assumption is that what is the assumption? Why it is not going to work? I compare two results. They don't match. I compute that there is a fail. Of course one must have failed. That's pretty obvious. I compare both of them match. I compute that there is no fail. Is that correct all the time? Why doesn't it work? So if both the pipelines fail in the same way they will produce the same result. Both of which are not. Although the probability of that happening is very, very low. Which is why most of the time it will work. So if the results match the processor state is checkpointed in case the next instruction fails. So that it can roll back. If the results don't match the processor rolls back to the previous checkpoint which is the previous instruction. And we try the failed instructions many times and it doesn't seem to be permanent. So if really the fault was permanent then hot-sopping takes a second. Hot-sopping means that you solve the failed processor with a spare one. And that happens automatically. That takes about a second. Which is actually huge. So that's the time when your system will not be available. Anyway, so that was just an equation to show that redundancy helps in all cases. Could we improve this a little bit instead of having two pipelands? If I give you three pipelands, does it help? Sorry, say it. I mean to pipeland you know there is a failure but you can't with three pipelands. No, we can recover. With two pipelands we can detect a failure and recover in a transient period. If both of the pipelands we cannot detect the question of the problem. With three pipelands what else can we do? We cannot do with three pipelands. How does it know which pipelands we need? It doesn't know. Does it help in any way? So with three pipelands what can I do? I get three results and then what if they don't match? So suppose the three results are not identical. If two of them match then I can take a majority I can take a majority so if one result doesn't match with the other two I can still go ahead and assume that the other two are correct and take the majority but the question is what way does it help? I will give you more redundancy. Does it help? We don't have time. We don't have time. We don't have time. We don't have to re-execute the instruction. Exactly. So in this case if one fails if they don't match I would actually have to go back and re-execute to get the correct result. If I have three then I can take the majority vote and believe the vote and go ahead. I can save some over it. So regardless of the array of inexpensive risks this is a proposal from University of California and over time it has actually become a disaster standard. So as the name suggests it's basically an array of risks with some redundancy. So the array of risks provides parallelism. That was the main goal of if you just remove the R from the beginning if you just have 8 then it's parallelism. It gives you more trouble. So what you can do is you can strike data across risks to get more trouble. So what will be my strategy? So let's suppose these are my risks. So I would actually disrupt putting sequential data in one disk fully and then move it to the other disk. So I will go like this. I will put a block of hydrolysis here the next block here the next block here and so on. So that I can make sequential accesses to parallel disks and I can get a through-code. I have made a series of such disks because none of the disks are actually any faster than them. So the array doesn't really improve the speed of the disks. They still remain as slow as they were. What's the design advantage? An array of risks has more dependable than a single disk. Why is that? The probability of failure of a single disk is p. What is it for n disks? P by n. P by n? P by n. What options do I have? P to the power of n. P to the power of n. Is it smaller or larger? Smaller. Why? Because it is smaller. So as you raise it to above it the value actually goes down. So that's why it's mentioned here. If you have an array of disks the probability so the probability of failure of one disk is p. So if I ask you if I give an array of n disks what is the probability that at least one disk will fail? This is the probability of failure. Failure of a single disk. If you end disks what is the probability that at least one disk will fail? n into p to the power of n. p into 1 minus p to the power of n. Sorry, second? What is the probability that nothing fails? Sorry, second? 1? 1 minus p to the power of 1 minus p to the power of n. This should be the answer. What is the probability that at least one disk fails? An array of n disks. 1 minus p to the power of n. 1 minus p? 1 minus p to the power of n. 1 minus 1 minus p to the power of n. n here? 1 to the power of n. Which one is larger? p or this one? So this one is failure of at least one. Which one is larger? This one is larger? 1 minus p to the power of n. 1 minus p is q. This is 1 minus q. Which one is larger? q n is less than q. q n is less than q? 1 less q n is larger. 1 minus p to the power of n is bigger than 1 minus p. So this probability is much larger. So the probability that at least one disk fails is much higher than failure of a single disk. So that is what is mentioned here. So, retirement that is tries to bridge this particular gap. It says that well you have an array of disks. So you have so good but you have this good with the problem. The probability of failure goes up. So can I bridge the gap with variables? So essentially what I mean is that there must be a possibility of reconstructing lost data. So good news is that MPTR is much less than MPTF. So let us see can make the probability of 100 disks much higher than a single one. So question remains how to discover faults. So disk sectors called error detection and error collection codes. So a single one would be single error correct, double error detect. So how do you do this? Which means I must be 45 days and such a way that I will correct one error and detect two errors. Can you suggest a simple way of doing that? A simple coding scheme. So what is about this? So let us actually have the details. Let us assume that I will transmit say n bits over a noisy channel. And I want the receiver to be able to do second error correct, double error detect. And I am not worried about how much extra bits I have to transmit to be able to do this. Can you suggest a simple protocol? Counting the number of ones. Can you detect double errors? I flip one particular bit from one to zero other one from zero to one. I mean exactly same number of ones. You can calculate the hash of that pattern. You have to give a guarantee here. I possibly will do that, second. Multiple packets come in parallel. If we have n packets, they come in the number of ones in the first bit of each packet. But I am transmitting just one packet of data. I don't have n packets. The receiver should be able to correct one error detect two errors. So that is one of the simple things. You transmit every bit thrice. And then the receiver checks three bits in blocks. If all of them are not same, then you are sure that there is an error. And if there is one error, you can correct it. You can look at the other two and just know what the correct one is. If there are two errors, can you detect that? Only if you look at the... Sorry? Only if you look at the... Only if you look at the... Only if you look at the... Only if you compare all the bits... No, there is no this. I am just transmitting n bits over a channel. So if I look at... Sorry? You can... Yeah, no, I want to guarantee that. You cannot, right? But can you... You need one more copy. You need one more copy, huh? Yes. Sorry? So does everybody see that repeating the bit thrice doesn't help? You cannot detect double errors. You can detect an error, you can correct it. If there is one error, what else can you do? What if I translate the XOR of all the bits also? I repeat every bit thrice. So I said three bits. I said one more extra bit, which is XOR of all the bits. Will that help? Detecting two errors? We can have that error in that XOR bit also. Then like if there is an error, we send one... Yes, okay. So let's assume that the parity bit cannot... Exert it and correct it. Then you can do that, right? Yeah. So, okay. So there are many other ways of doing this. But a lot... A lot would do that. The point here is that you can... If you attach error detection, correction codes, you can recover from errors. And hot-solving with hot-spaired variables, where you have availability and dependability, which is important for file research. So let's look at some of the rate levels. And also we'll look at some of the coding examples. So rate offers several levels of coordinates. The rate zero is... No redundancy. Plain-striking is used to improve throughput. So this is just very similar to memory-pack. There is no redundancy at all, right? Rate one uses mirroring. Requires twice as many disks as rate zero. So it can reach from the disk with a smaller signal. So essentially what I'm doing is that I duplicate every disk, all right? So that's my mirror, all right? And since I have now the option of reading the data from two possible disks, I'll read the one that has the smaller signal. Now, depending on how you organize the disks, you may get rate one zero, which is straight mirror, so rate zero one, which is mirror stripes. So that's how I have just two databases. And then I mirror them. So this is essentially mirror stripes. I first try the data, and then mirror that, all right? So essentially what I do is I force you rate zero. That's all right? I need rate zero first. And then apply rate one, all right? The other one, which I could do is so I mirror first, and then strike it. So that's basically strike zero. I first apply rate one. Whenever I create a disk, I mirror it first. And then strike. Rate two adds mirror correction codes on top of this. Now here, the original application, there was really no constraint on what kind of error correction codes you can use. So for example, if you use the reputation thing that you have mentioned, if you get twice. And different kinds of codes would give you different kinds of recovery. For example, what I can do is also that, so here's a scheme, which gives you a 50% over it. So let's say I have four data disks. And what I do is I take the pairs and store the parity in a disk. So I have two error correction disks. So this one exhausts the data here and puts it here. And this one exhausts the data here and puts it here. So what kind of error correction detection can I do with this? So let me number these. Maybe A, B, C, D, Q. So B is A XOR B, Q is C XOR B. Can I do single error correction? I can, right? Can I detect errors? What if both A and B fail? It's okay, right? Because the XOR will be same. So I can do single error correction here. I can detect some error. That's all I can do. So there is something called hamming codes. How many of you know them? Hamming codes. I'm not talking about the hamming systems. Okay, so I won't get into details of hamming code. I'll just tell you what it is. So the original data, two people actually mentioned hamming code for this. So what hamming code does is that, normally it's mentioned as a typical income or gain. Where A is the size of the total amount of data, that is, data plus your error correction code, which is typically for R minus 1. So it's an R bit. So you can, you have to be for R minus 1 which you can transmit. And out of which, the number of actual data bits is this, R minus R minus 1. So essentially what it means is that, if you have four data bits, you need three more bits to construct a hamming code. Then those three bits will be essentially error-connected codes. So seven more, for example, in a possible hamming code. And that's all that. On top of hamming code, if you add one parity bit, that is, that exhausts all the data bits, you can actually do double error detection and single error correction. So anyway, so this is a more like a theoretical proposal. And industry doesn't really use this, in any of the commercial disks. Although rate 1-0 and rate 0 are very popular. And of course, the subsequent ones are also. So rate 3 uses a built-in parity. Here you add one extra disk that maintains the sum of all disks, and you add a new two, it's an example. If two disks fail, at the same time it is impossible to do it. So that's what we have already mentioned, in this case also. So that's rate 3. So here's an example for rate 3. So let's suppose I want to do a 512 byte read, which is my sector space. So you read the corresponding 512 bytes from the target disk, where it will be rotated. So you reconstruct the required data from parity disks in all of the disks that you can do. So only in case of failure, you have to access all the disks of this particular sector of data. 512 byte write is actually more expensive. You first read the corresponding 512 bytes from all the disks, except the target one. You copy parity and write parity as well as the original data. So here, for computing the parity, you need all the disks. But actually that's not necessary, because of the property of XOR. So that's what is recognized in the way forward. When you produce the right over it, you observe that XOR is self-manifolding. So for computing the new parity, you could just do the old parity, XOR with the it talks old data. XOR with target blocks new data. That would be new parity. So still parity disk remains a bottleneck for parity websites. Because that will be access, wherever the parity is stored. Rate 4 is here. So this is what is rate 4. So these are my three data disks. And this is one parity disk, which XORs all these data data. So rate 5 uses distributed block, intelligent parity. So essentially what it does is that, it doesn't fix the parity disk. It distributes the parity over it. So P0 is here. P1 would be here. P2 would be here. P4 would be here. Oh, is it good? What does it bind? Sorry? The parity disk it fills and becomes a problem in rate 4. Oh, that's still a problem. If P0 gets corrupted. So that block can't be fixed. Sorry? So P0... So there are some of these disk space. I lose P0, right? Yeah. Do you see the overhead on the parity disk? How? On every access, you need to read from the parity disk. Over here, you don't need that. Like, over here, it's been distributed. Right, so actually no. So reading is not the real problem. The problem was the right. So here, if you wanted to do... to back to that right, when you go in this space, this would be the model. Right? Here, you could allow them to go concurrently. Here, you can go... Yes, you're right. So for reads also, you could do that. But actually, the reads can be read from the disk action. That was not really the main model. At rate 6, it maintains two different parity disks, known as Q plus Q difference. It gets recovery from two errors possible. So here, how we do this? Yeah. So, let's take rate 1. How we do rate 6? Zero? So similarly, it's distributed. P and Q parity. Now, P could be exorbitant of all the data. Q should be a stronger code. If you really want to recover from two errors, because Q is also an exorbitant. You get nothing, but you have a Q. So Q should be a different piece of code, which is code and code independent of P. Right? So I'm not getting the mathematics of that. If you're really interested in how to generate this... the second pattern, you'll be able to understand some papers. Okay. So that's about rate. So, when you do DMA, the one question that we didn't address is, does DMA use virtual address or physical address? So the way we have described it seems that DMA should be physical address, because the CPU sends the physical address of the source and target and the rest happens without any intervention of ways or anything. So you copy the physical address of the source to the target. So let's see the pros and cons of both. If you want to do virtual DMA, it's DMA based on virtual addresses, you'll require a DMA in the DMA agent so that you can generate the corresponding physical addresses. And this DMA must be under the control of ways and it should be kept coherent with your processor DMA. On the other hand, if you have physical DMA, it has several issues including security. So here are just some questions that you should think about to deal with physical data. So the first one of the questions is what if the amount of data to be transferred is more than a page? When I'm done with the page, where should I go now? The physical pages are not continuous, they're normally scattered. And normally when you go DMA, the CPU sets the starting address only. So when you cross the page boundary, you cannot really start right into the next page. You have to correct me now. This is the question. Ways must copy the entire digital distribution page prints before you go to DMA. What if Ways replaces or if you look at physical page prints which is being used by DMA? That means to avoid that, you must please DMA this. So that while a DMA is going on, a page should not be missed. And Ways must copy entire user data to kernel space before DMA is involved with the security issue. A particular user can actually start writing to some other space because once you start the DMA, there is no security check. There is no check done, actually. It's a byte copy from one after space to another. That's it. So to avoid this problem, Ways must copy the entire user data to kernel space before DMA is involved. So DMA also has to be re-filled the caches or data transfer. So that could also the caches are virtually indexed. That may also be a problem in physical data. This is not a special problem. Even in multi processors, this problem may arise because when processor 1 sends a message to processor 2, normally it comes in terms of physical items. And the processor caches are virtually indexed. How do you do it? So we didn't discuss any of these things and we really cannot do that in this course. It requires a lot of prior exposure to certain concepts. But anyway, so these are the problems of virtual versus physical DMA. Seems that virtual DMA may be easier to support as long as you have DMA under risk control. In physical DMA, as long as you restrict your data size to a page, you are probably fine. So until now, we have been mostly talking about synchronous IO that is, whenever you start an IO operation, the calling process gets switched out and some of you have come see that the calling process must wait until IO completes. However, you could do a synchronous IO as well and it has the same philosophy as one of the most. So essentially here, what we do is we allow our process to continue even after making an IO review first. Only if it tries to access some bytes, whatever part of that, how to do this? The process is a program. So let us suppose that I am the program has this code, you have to scan it and then I do x plus plus. If I allow a synchronous IO for the STD you read, the program will continue execution down here and here what we ask is that what is the point where you should do the context switch, not really. How do I know that x is not available? Compilers generate a virtual address and look up the tlv and make it a page font and go to a translation and go and access the physical page and get some value, which is part of the problem. So how do I really achieve this? This is a particular thing. Deep type of IO completion. Yeah, but the process is not a load instruction. How do you trap it actually? The page font can depend on the process that this is IO. This memory map goes to a IO path. So I say the request to the it will create a page font and here it is. Why do you say so? Because this doesn't have a tlv and this is the IO path. Which is the IO path? This one here. This is the IO request. This will generate a IO request. It will generate a system. So that will be via a page font. Why should there be a page font? So how will the IO request work in this case? There will be a system path which will read. So at the lowest hardware level, the bytes will be read from the keyboard buffer to some memory area. They will read the load function. The load instruction. Now, I am talking about part of the system path. System path handle will read from keyboard buffer into some memory area and then copy from the memory area to the user's page. This is the system path handle. This is the IO request. Yes. So it can block all the components. Which components? The components on this there is a list. How does it do that? So process will generate this load instruction which comes out in the address. Now what do we do? You may remember having the in the page table remember that in the page in there was a IO fending to this page and the next time when the IO has completed only then that bit would be 50. On the next at this translation that bit is set or not? If it is set then we have to do a context. So inside the system called handle at some point the handler has to copy the data from the IO of offering the current space to this particular address in physical memory that has to happen because processor will be generating this particular load eventually which will go to that But am I making sense? So what I am saying is that in the sequence of things that will happen there is a need system called which will essentially copy from the IO offer whatever that is and then the next step is copy from IO offer to add in some x. Yes So previously what happened is that the process would get switched off immediately as soon as you execute the system Here what I am doing is that I am allowing the process to continue while this is happening the processor will generate that load instruction eventually So what we have to do is that as you have mentioned in the page table this particular virtual page has an IO pending But to make sure that we actually get to that point before we allow the the process to continue So if you want to do a sequence of IO the part of the system called handle will also have to be nulled to clean with So that the tlv piece will happen and then only the operating system can change the page table and that would be other way also So now the process can have multiple authentic IO requests and sequence of IO to switch off the process immediately after the system as soon as the system is called actually like this The other one is talk here this is the last slide we will finish up next time and then we will next week what we will do is we will talk about hyper-study which is surprising So that's what we are going to invest our time on It's kind of a hardware trading mechanism We have lot about trading at the operating system left So we will see how to bring that down to hardware So that's how we will spend on SP