 So, let us start with the lowest level of the internals, which is how is data stored in a database and we are going to start from an even lower level, which is the database itself sits on top of a storage device, which could be a magnetic disk flash memory or something else. And the question is how is data stored in these devices and how is data retrieved from these devices. And one of the reasons we pay particular attention is that certain characteristics of the device play an enormous role in the design of database systems. And we need to understand the technology because it impacts how database systems are designed greatly and some of these things are changing rapidly. So, I will also mention how things are changing today. So, we are going to start with an overview of storage media, then focus on magnetic disk a little bit on flash, read and later we will move to higher levels and on top of the storage system, what does the database system do to store records. So, storage media physical storage media can be categorized in several ways. One is how fast can data be accessed, what is the cost per unit of data stored, how reliable is it and most important is it volatile or non-volatile. Volatile is like memory in RAM, the moment you switch off your computer it is gone. Whereas for so while you are running your program it has to be in RAM temporarily, because that is the only thing the CPU can access directly. But to save the data persistently you have to put it in non-volatile storage and typically this is a disk or flash storage, occasionally a few other forms like battery backed up main memory. So, here is a thing depicting the hierarchy, the highest level is cache which is typically in your processor and it is really fast accessing it. If your processor runs at 1 gigahertz or more typically 2 to 3 gigahertz your cache can be accessed at the same clock speed as your processor or almost as fast depending on whether it is L1 or L2 cache, but it is very limited in size. So, the next level is main memory which these days is really big gigabytes, but access time is significant. So, typical main memories even today will take something like 10 to 100 nanoseconds to fetch data. Whereas your processor will run at a rate of 2 instructions per nanosecond. So, typically you can run almost if it were 100 nanoseconds you can run 200 instructions before you read one single byte from memory. There is a whole bunch of tricks used to make sure that data is as far as possible in the cache and you fetch data from main memory in large blocks. So, there is a bunch of tricks which go on we do not need to worry about it, but I should mention that anyone building a database system internal has to deal with the fact that today cache memories are much faster than memory and this is likely to continue for the future. So, when they code the database programs or the data structures they usually have to be aware of the fact that there is a cache. So, it has an impact on the database system design and coding we would not get into it, but it is important to realize that this is there. Then the next level is flash memory which is very fast compared to magnetic disk although it is slower than main memory. Per byte cost is more than disk, but much less than main memory. So, it is in between beyond magnetic disk you have optical disk DVD and the writable forms of DVD or the Blu-ray DVD and then you have the highest capacity in magnetic tapes. The cost per byte of tape is low, but it turns out these days the cost of the tape drive is very high. So, people are not using tapes that much these days as they used to for backup. People are using magnetic disk as the lowest level very often because this have become so big and cheap that you can archive your data on disks. So, there is a notion of primary, secondary and tertiary. Primary is main memory, immediate access. Secondary is disk or flash which takes some time to access. So, you cannot directly read data from it you have to load it into primary memory. Tertiary is data memory which is used for backup which is typically tape or DVD and so forth. So, even today people do need to for audit and legal reasons they need to backup data on to tape or optical disk in tertiary memory. So, magnetic disks are by far the most widely used today although flash storage is gaining speed quite fast. So, magnetic disks have driven the design of database system. So, we need to understand how they work. This is a schematic diagram they do not actually look like this, but we have expanded things to in cartoon style to show you what is relevant. So, first of all a disk has stack of magnetic platters. Each platter has a recording read write surface which is coated with some magnetic material. And then there is an arm which this figure does not show it very well, but it actually swings and it can swing. So, that the read write head which is in the tip of the arm that can swing to the innermost part of the disk over here or it can swing to an outer part of the disk. So, remember the surface is two dimensional. If you need to get to a certain point on the surface the arm has to swing to the right radial position and the disk has to spin. So, that the particular piece of the disk surface comes under the read write head. So, there are two parts the arm swinging is called seeking. So, that is some time when you ask for a particular piece of data from disk the arm has to swing to the correct position, but at this point the actual data may be on the other side. So, after the arm is in place the disk has to turn till the data which you want comes under the arm and that is when you can read or write it. Now, the data on the surface of a disk is divided into tracks. So, when the arm is stationary the disk is spinning. So, all the data that goes under a particular arm is one track on one disk. Now, a typical disk has a few platters and a few arms correspondingly all of which swing together incidentally. So, when the arm has gone to a particular position you can read the corresponding tracks of all the platters and this is called a cylinder. So, that is shown here a cylinder and within a track a track is a circle on the disk it is broken up into sectors. So, when you read or write data the basic unit for reading or writing is the sector. So, even if you read one byte the disk system will read an entire sector if you write one byte it will write an entire sector. In fact, these days the sectors is not really the unit for reading when you read the disk system will actually read a lot of sectors at once when you write it will write only what is required, but read will read a whole track may be at a time and cache it in the disk. So, that if you then ask for the next sector and then the next sector and so on it is already cached. So, that is the disk. Now, the disk is a mechanical device. So, for the arm to swing to a particular position takes time in the human scale it is very fast it moves in something like 5 milliseconds or less time that is amazingly fast you cannot see it moving it is like a blur if you see if you could actually see the arm moving back and forth it would just be a blur that is so fast, but compared to clock speeds it is painfully slow clock speeds are you know gigahertz. So, one clock is 1 in 10 power 9 whereas to move it takes 5 into 1 by whatever 5 by 10 power 6. So, there is a difference between a nanosecond and millisecond millisecond sorry 10 power 3 for a millisecond. So, there is almost the factor of a million difference between the CPU clock and the time to read data on this that is a huge difference and as a result the mechanical delay in going to the disk dominates the time to read data sorry that is one part the other part is the disk should spin the disks actually spin enormously fast their typical rates of the slowest disk are 5500 rpm that is 90 times per second and faster ones run at you know twice that 180 times per second that is extremely fast spinning. However, even that is very very slow because even at that speed it takes a lot of time it takes again of the order of 5 so milliseconds on average for data to come under the thing. So, totally on average 5 to 10 milliseconds is the total time to read a random piece of data and that is a major limiting factor and database systems have worked hard to ensure that this delay can be massed using a variety of tricks query processing algorithms indexing everything has to deal with the fact that reading a random byte of data takes a lot of time. This slide is already covered I am going to skip it. So, that is a individual magnetic disk these days any high volume system requires not just one disk but many disks and typically these disks sometimes you know they are packaged into a unit which is a box by itself and you can connect it over a network to multiple computers such a box is called a storage area network. So, it is a box which looks like a disk but it actually has multiple disks and does a whole bunch of clever stuff inside that is network attached storage. There is another form where the box does not act like a disk it acts like a file server but otherwise it is identical it is a box connected on the network. The individual machines use it as a file server not as a raw disk that is called network attached storage both are used widely in the industry. So, most large systems do not have a disk attached to a single CPU at least for enterprise applications what web companies do is slightly different they actually keep data along with CPU. But many enterprise applications store data on these separate standalone disk subsystems and then have a number of processors which run the actual database code. So, I said that certain things are slow on disks so there is some terminology to deal with this access time is the time it takes from when a request read or write request is issued to when it is satisfied when the transfer begins or finishes. So, it consists of two things one is seek time the arm swinging the other is rotational latency that is the time it takes for the disk to spin and bring the sector underneath. So, typically seek time is 4 to 10 milliseconds and the rotational latency can be again 4 to 11 milliseconds worst case again average case can be reduced somewhat and a common rule of the thumb is the access time on average is something of the order of 5 to 10 milliseconds and this works across a variety of this range 5 to 10 milliseconds a good rule of thumb. Then there is a data transfer rate which is when you start reading how fast can you go this number is typically of the order of 25 to 100 megabytes per second today it has been growing steadily. It turns out that access time has not grown that much it has gone from 20 milliseconds to you know maybe 5 milliseconds over the span of 2 decades it is improved by a factor of 4 which is ridiculous compared to how everything else has grown everything else has gone up by a factor of 1000 or more in 2 decades. So, access time has become more and more important if you need to get from hard disk magnetic disk. I am going to skip some details of controllers and come to another very important metric which is how often does the device fail? Story devices do fail question is how often can you quantify this? So, somebody may say it is likely to last a few years what is few now these have been quantified by the manufacturers and people have actually verified what the manufacturers say and by and large they are honest and typically you can expect a disk to last 3 to 5 years which used continuously in kept in decent surroundings if you throw the disk around it is going to fail faster, but the way the disk fail is actually slightly different. When you have a brand new disk due to any slight manufacturing flaws the chance of it failing is highest when it is brand new, but if it lives for a little while then everything is okay then it will live for till middle age. So, for example, infant mortality is very high in India that is infant once they cross infancy kids tend to live for a good amount of time and old age is then the next highest point when people start dying this are similar after if they survive infanthood they live peacefully for a long time and then in old age they will fail, but still there is a probability that a given disk will fail in between infancy and old age and we are interested in this because if a disk is old you better replace it if a disk is young these days what manufacturers do is they will burn in in their facility before shipping it to you. So, when you receive it is already likely to be very good. So, the point is in this intermediate period there is a small but significant chance of failure which is viewed as a random event and so you will say that on average you know one disk out of so many thousand will fail at a given day let us say or another way of saying it if I give you an arbitrary randomly picked disk based on the current rate of failure it is mean time to failure is something that means if I have one disk and I say the mean time to failure is 500,000 hours if everything in a perfect world it will live for 500,000 hours on average, but that is not really what they mean because by then it may be an very old disk and it will fail. What is better way of looking at it is if I have a thousand disks and I run them now they are all neither infant nor very aged they are all in the prime of their life if I have a thousand disks then on average in 500 hours one of them will fail that is a better way of looking at it. So, 500,000 hours seems like a lot of time it is several years, but if you have a thousand disks every 500 hours one of them is going to fail what is 500 hours it is roughly 20 days. So, if you are going to lose data every 20 days that is a terrible situation so you need to do something about it. So, 500 is low end with better disk you can get a little over a million hours, but even that will if you have a thousand disks 1200 hours is the on average you will have one of them failing. So, we will see how to deal with failures in a little bit using rate, but before that let me say a little bit about flash memory because that is becoming increasingly common. All of us use these pen drives or USB keys or call it what you want they are ubiquitous these days and they are very cheap you can get you know 4 gigabyte flash USB stick for few hundred rupees that is very cheap compared to what they were sometime back, but these they are based on flash, but it turns out that your typical USB stick its data transfer rate is not that high it is limited. So, it may be of the order of a few megabytes per second which is much much slower than a hard disk hard disk and transfer data at 50 megabytes 100 megabytes a second. So, these days what people are using is something called a solid state disk which actually has a number of these flash memory pieces which run in parallel and even though individual ones may not be able to transfer data so fast together they have very very high bandwidth they can transfer at 100 to 200 even faster than solid state disk they can transfer data. So, they are very fast most important their random access feed is much much better than hard disk what does this mean on a hard disk if I ask you to read a particular piece of data randomly picked. So, some customer comes in we say read the customer's profile or read the balance it is going to take 5 to 10 milliseconds on flash it is going to take 1 or 2 microseconds that is 1 microsecond versus 5 milliseconds is a factor of 5000 difference. So, that is a huge huge benefit over hard disk. So, every manufacturer of database software today is now focusing on how to get their systems to run faster if they run on flash disk solid state disks are still expensive for on the other hand for 10000 rupees I can get a solid state disk which is 64 gigabytes it can store all of IIT's data comfortably. So, it is very very cheap people are little hesitant still because they are not sure about long term failure rates if this that solid state disk start failing more than hard disk that is a big problem. But these days they are becoming quite reliable and they will take over eventually. But they are not going to be able to store certain kinds of data cost effectively video for example, YouTube has so much data to store it on flash should be unthinkable. Now, just the lectures which were recording here has so much data that putting it on flash would be fairly expensive it is pushing the purse. So, there is going to be a combination of hard disk and flash which is going to be used in all storage systems and already companies have come out with systems which combine flash and hard disk hybrid storage systems. So, every manufacturer is coming out with these these days and any high end system will use these hybrid storage systems. A couple of things to do so far I told you that flash disk flash memory is great. But there are a few problems also one of the problems is you have you cannot just overwrite like on disk you have to erase before you can write and that erase is slow and erase can take 1 to 2 milliseconds which is almost as slow as a disk. Now, if every time you have to write a piece of data you first erase and then write that is as slow as disk almost there is no benefit. So, what people figured out is instead of writing on the same spot I have written something here I cannot wipe it because it is going to take some time. So, I will say that this page which used to reside here now resides there and I will go and write it there. So, that is a fresh clean board I will write there this board is now been written upon it is now junk because my page is remapped there eventually I have to go and erase this part of the board and that erase will take 1 or 2 milliseconds. But the good news is I can have a board with many pages and erase the entire board it is called a block in 1 or 2 milliseconds. Now, the fact that I am erasing an entire block means that any data which is in that block has to be copied over somewhere else. So, there is a lot of trickery going on to make sure that erases do not have too much of an impact and a layer of software which does all this trickery is called a flash translation layer it is included in every operating system today. So, when you plug your USB key in that is actually a file system with a flash translation layer which is doing a lot of these tricks. So, that writes appear to be fast even though erases are slow. The other bad news is if you keep writing and erasing and writing and erasing eventually after about 100000 erases for typical flash a million for some higher quality flash your page will no longer be able to store any data the erasing would have damaged the page totally. So, there is a limit to how many times you can erase a particular block of flash. So, if you keep writing and erasing the same part of flash again and again you can actually destroy it very soon 100000 erases in one with 1 millisecond per erase how much is that 100 seconds you give me a piece of flash memory if I want to destroy one piece of one block of that flash I can do it in 100 seconds that is terrible I have lost that block forever I can remap and avoid using it. But the better strategy is to move the data around so that after a few times of being erased this block is not used or it has data which is not being actively written. So, it is not no longer getting erased. So, the erases are spread across the memory now memory has millions of blocks. So, if I want to destroy all of memory it will not take me 100 seconds it will take 100 million seconds which is a long time well the million I cooked up. But people have figured out if you actively seek to destroy a piece of flash memory by writing erasing writing erasing continuously it will take you a few years to do this of course in reality you are not trying to destroy it. So, it will live much longer than that. So, time for a couple of quiz questions remote centers please make sure your software is working participants press the sticky and be ready. The time is up the answer to that question was straightforward as I just discussed the typical access time for a disk is of the order of 10 milliseconds it may be as slow as 5, but it is certainly not 1 millisecond no disk gives you 1 millisecond. So, the answer was C q 2 is the same question, but this time it is from flash it is not from hard disk. So, the options are again the same question is the same, but it is hard flash instead of hard disk. So, just wait time is up for that question while the answers are tabulated the answer to this question is choice 1 flash disk actually reads randomly chosen piece of data in 1 microsecond typically. In fact, 1 microsecond is still much slower than memory access. So, what happens is you do not just read 1 byte in 1 microsecond you can read a number of bytes from flash even in memory in 100 nanoseconds you typically do not read 1 byte you will read what is called a cache line of 16 to 64 bytes can be read in 100 nanoseconds with flash something like 1 kilobyte is read in 1 microsecond. How many people have answered this time? This time we had 190 people answering that is good that is about the best response we have had so far and the choices unfortunately 1 millisecond no it is not 1 millisecond it is 1 microsecond. So, flash is 1000 times or actually more like 5 to 10000 times faster than disk if you read a random piece of data. On the other hand if you are reading data continuously that is I have a video image let us say a video movie and I am reading all of it is called a sequential write or read rather. Now, when I have data which is accessed sequentially flash memory no longer has such a benefit I told you that if you have a pen drive it is much much slower than hard disk. If you have a solid state disk it is maybe twice as fast as hard disk, but not 5000 times. So, there is a huge difference between random I O and sequential I O for sequential I O hard disk is perfectly good for random I O hard disk is slow flash is much better which brings us to an interesting issue for many applications such as you know YouTube videos and such like random I O is not an issue you are going to go fetch a large amount of data and the bandwidth how fast can it transfer data is what is important. So, a due to flash makes no sense whatever it would be hard disk on the other hand if you have random I O when do you have random I O you go to a bank and website and look up your account balance you may go do it any time I may go do it any time the bank gets it at arbitrary points of time and our account data will be scattered on disk. So, whenever I do a look up like this it is random I O when I send a query to Google Google has to take that word find out information about that word and then or the set of words and answer the query. Now, there are billions of words out there. So, each user may give a different word and so again there is a lot of random access to words going on. So, when you have an application which is dominated by random I O it turns out that if stuff is on disk the number of times you can transfer data from different pieces of data from this dominates. So, given one disk if I have 10 milliseconds to access a piece of data from disk how many different pieces of data can I get from that disk in one second 10 milliseconds per access means 100 different things can be read from this in one second. So, if I get 100 people looking up the bank account per second and assuming I can go to the disk page for that person with one disk access which is not correct actually it requires several disk accesses. If I have one disk in the best situation I can serve 100 requests per second and reality there are multiple accesses. So, maybe with one disk I can serve something like you know 20 requests a second 20 requests a second is fine for a bank branch, but if you take SBI overall it is core banking how many do you think it gets per second I am not very sure, but they have I think 10000 branches and at each branch you can assume that one query is submitted every 10 seconds. So, in 10 seconds you have 10000 queries. So, in one second you have 1000 queries it is way off one disk is going to satisfy 20 whereas SBI has 1000 in a second. So, that is not going to be handled by one disk. So, to handle that load you need to have a very large number of disks the CPUs are not an issue disk number of disks is an issue. So, SBI would have a lot of disks if you had flash on the other hand one flash device takes one microsecond for this in one second it can satisfy a million requests. So, peacefully it can handle all of SBI request from one flash disk. So, that is the kind of difference we see between flash and hard disk that is the reason flash is growing in usage. But coming back if you have really high volume let us say Google the number of queries it gets per second is really really large at one time they stored most of their data in memory and occasionally they would go to disk for rarely used pieces of data. But after sometime they realize that even these rarely used things are accessed so often that this will not work. So, today when you search on Google the entire index is stored in memory. Now you will say how can it be stored in one memory Google indexes several billion pages of data each pages a few kilobytes. So, that is terabytes of data you cannot store that in one machine. In fact Google's indices are spread over tens of thousands of machines. So, depending on the you know if you are maybe not tens thousands of machines and your query will actually be farmed out to maybe not thousand I think it is not really I think it is still hundreds it is not yet up to thousands there are thousands of machines. But a group of hundred is enough to hold all the data in the Google index and so your query will be sent to 100 machines each of which will operate locally send the results back and then that is packaged and sent back to you all in how much time. If you use Google you know that it is a fraction of a second it says 200 milliseconds it answered your query. So, all of this has to be in memory to have that kind of speed with that and Google has I do not know how many tens of thousands of requests a second. So, that is as far as storage devices let us move to the next issue which is that storage devices tend to fail disc do fail flash disc also will fail. What happens if there is a failure if you lose data somebody is going to be very unhappy a bank cannot afford to lose data because of this fail. Now, many of us have a PC at home with a single disc and we do not back it up and this do not fail that often. So, we may not have experienced failure, but within a few years we are very likely to have one failure at least and if that is the only copy of all the photos which you have clicked and stored on the disc you are going to be very unhappy all your lovely photos of your little one are gone. So, you cannot afford that. So, one way is to back up data on an external disc keep it somewhere else, but that is required also, but if a disc fails we want the data back immediately in a online application like a bank. So, the only way to do it is to have multiple copies of that data on different disc and if a disc fails the data is there on another disc. Well, there are a few tricks here which can reduce the overheads, but a rate system basically does this its job is to have redundant arrays of independent disc as it is called to have multiple discs. So, that if one of the discs fails you can recover the information from the remaining discs that is the basic idea. So, now if you had an array of 100 disc if you know if the meantime to failure is 100,000 hours or 11 years for one disc the system will lose data every 1000 hours or 41 days if you do nothing, but with rate you can do much, much better. So, you can get high reliability rate was also thought of as important for high speed and capacity, but these days that is become less relevant. So, the key ideas redundancy storing extra information the simplest way which is called rate 1 is to have a mirror every disc has a mirror means mirror is an exact copy probably wrong to say mirror because mirror reverse is left and right here mirror is an identical copy. So, every disc has a mirror every right is done on both disc read can go to either disc, but if one disc fails the other is still there you can read from it. Of course, if you do not detect that the disc has failed and you live off one disc eventually that disc will also fail and you will lose data. So, it is very important to detect this failure and go and replace that disc when you put a new disc in its place all the data is on the other disc. So, it is important to copy the data from the other disc into the new disc and once it is copied both the disc are in sync again writes happen to both the disc all of this should be done by the rate system. A rate system will actually you know it will detect a disc has failed a really good rate system will let you online pull out that disc plug in a new disc and it will automatically rebuild the data on that disc. In fact, even better rate systems will do even more clever things. So, what they do is they will have let us say a raid with multiple disc each of which has a mirror supposing one disc fails they will keep a few spare disc which are not normally used, but the moment a disc has failed they will automatically and transparently use that flash that spare disc to make a copy of the in place of the failed disc. And eventually they will tell humans operators hey this disc has failed we are now using this spare disc at some point please take out this failed disc and put a new one in its place. And when the human does that it will recover on that disc and then the spare disc can go back to being spare. So, they do a number of tricks like this. So, that even though there is a failure the system can run completely uninterrupted that is really nice is very very important for any critical system and so everybody uses rate these days. Now, what is the benefit how much have we reduced the mean time to data loss by keeping a mirror. So, it depends critically on how fast you can repair the failed disc. If you take a long time to repair it the chance of losing data because the second disc fail goes up. So, the mean time to data loss depends on the mean time to failure of one disc and the mean time to repair. Now, you can understand why it is important to have a spare disc because then the repair starts immediately. If you have to wait for a human to come and plug in a disc then the repair may take a day or the human may not notice it it may take a few days. So, to give you a rough idea if mean time to failure is 100,000 hours mean time to repair is 10 hours which is pretty good. And for a single pair of disc the mean time to failure goes up from mean time to data loss rather goes from 100,000 hours to which is what was it 10 years up to 57,000 years. So, the chance that you will lose data assuming disc fail independently is exceedingly low 57,000 years means the probabilities of data loss at any point in time is very slow. Well that sounds like a enormous time, but surely there are other risks your building may catch fire you may have an earthquake and building collapses. So, then both the discs are going to fail. So, the limiting factor is no longer independent disc failure, but correlated disc failure you may have a lightning strike on the data center and a spike comes and burns both the discs. So, correlated failure actually makes the chance of failure much worse than the 57,000 years calculation shows. So, in reality it is not so wonderful and to deal with correlated failures people now keep an extra copy of the data somewhere else. So, SBI for example, has a data center in Navi Mumbai, but every piece of data which is written in Navi Mumbai is also replicated in I believe it is Chennai. So, even if the Navi Mumbai center for whatever reason goes for a toss the data is lost it is there in Chennai. More common kind of thing is it is unlikely the Navi Mumbai center will catch fire they take care. So, unlikely it will collapse in an earthquake it is built safely, but there may be power problems and so forth in which case the Chennai center can take over immediately. So, coming back rate systems also provide ability to write multiple discs in parallel. So, that if a single disc can give you 100 megabytes per second you can get gigabyte per second by spreading a write across multiple 10 discs. And there are different ways of which block levels striping is a common way. I am going to skip these details partly because you know it is not that critical these days and you can read it up, but what I want to focus on is there are two there are multiple levels of rate starting from rate 0 which does not do any mirroring no redundancy if you lose a disc you lose data. So, obviously that is not used wherever data is critical rate 1 is mirroring coupled with block striping there is a lot of confusion in the terminology many manufacturers will say that mirroring is just the mirror of a disc with no striping and rate 1 0 is mirroring plus striping and so forth do not worry too much about those things, but the bottom line is for us rate 1 means there is mirroring every piece of data has to be written on two places. Then there are some other levels called rate 2, 3 and 4 which are kind of irrelevant at this point they were defined for pedagogical reasons rather than industry reasons rate level 5 is the other important one which is used in practice what is rate level 5. So, rate level 5 says that if you keep a mirror of the disc you are doubling your disc storage cost there is a way to reduce the cost by not keeping a complete mirror, but for each block of data. So, here is a situation there are 5 discs now for the first 4 blocks of data I stole them on disc let us say this is disc 1, 2, 3, 4. So, the data is stored on disc 2, 3, 4 and 5 while a parity computed from these 4 blocks. So, bitwise parity is computed for these 4 blocks if you do not know what is parity it is the XOR function. So, each bit is XORed. So, take the first bit of each of those 4 blocks here compute a XOR 2 at a time and the final XOR is stored on disc 1 here P 0 is the parity block for these 4 blocks. Now, what is the use of that parity block let us say that this disc fails disc 2 fails the data in disc 2 can actually be recovered by a very simple process. I will take the XOR of P 0 1, 2 and 3 these 4 blocks I will compute the XOR. The process is very similar initially I computed P 0 by computing the XOR of 0, 1, 2, 3. Now, if I need to recover 0 I will compute the XOR of P 0, 1, 2, 3 that will give me 0. So, now I can recompute the data which used to be there and write it to a new disc which is kept in place of the failed disc or if the new disc is not yet there if I am just reading that data I can compute it on the fly. If I am writing it well I cannot actually write block 0, but I can update the parity to reflect the write. So, I can continue reading and writing even though the disc containing block 0 this column here is the disc with block 0 that disc is dead I can continue operating even though it is dead as if it is still alive without losing data. Eventually I will replace the disc and recompute it and then the system is back in original non-degraded mode. Now, the next trick is in rate 5 is for the next set of blocks I am going to store the parity not in disc 1, but in disc 2. There are 4 more blocks the blocks are shown like this 4, 5, 6, 7 and the parity is on disc 2 and then for the next 4 the parity is on disc 3 then 4 and 5 and then it repeats. So, the idea here is the parity blocks are distributed the reason for distributing the parity blocks is that parity blocks a every time you write 0, 1, 2 or 3 you have to write p 0. So, the write load on the parity block is 4 times the write load on individual disc blocks assuming random writes. On the other hand when you read data you never read the parity block you will only read blocks 0, 1, 2, 3 unless there is been a failure. So, if you have the parity block on a separate disc by itself the write load on the disc is very high the read load is very low whereas, rate 5 distributes the parity blocks across the disc. So, the load is more uniform across the disc. So, that is better. So, that is rate level 5 what is the benefit of rate level 5? Now, with 4 disc and 1 parity the overhead is 25 percent instead of 4 disc I have to buy 5 disc node and still I am able to predict from failure. What is the chance of losing data? Totally it is much less because even if a disc fails data will not be lost only if 2 disc fail will you lose data and you can work out the chance of data loss it is a little bit higher for rate 5 than for rate 1, but you know maybe it is not 57000 years, but it is still much much better it is like probably 10 years or something. So, it is as good as not 10 sorry 10000 years. So, it is more than good enough. So, in terms of protecting data it is good enough the drawback is when you write data rate 5 has a lot of overhead you are doing if you are doing random writes. So, if you are storing video data rate 5 is the way to go surely if you are storing database data which is randomly updated rate 1 is the way to go. So, this slide talks of the choice of rate level it just says what I told you for cold large data rate 5 is preferable for hot not. So, large data rate 1 is the choice today again there are some issues in building rate systems there is something called software rate where there is no hardware support at all everything is done in the operating system. In contrast there is something called hardware rate where you have a controller for the rate system which does a lot of the work and there is some benefits to hardware rate which I think I am going to skip, but let me just mention this briefly with software rate one of the correlated modes of failure which can happen is the following. If power fails while I am doing a write supposing I am writing both the blocks in parallel where if power fails I may have written half of this block half of that block power is gone both the blocks are corrupted because they were half written that is a big risk. So, what software rate or any other kind of rate does is it will first write block 1 copy 1 and then write copy 2 after the copy 1 has finished writing. So, even if power fails what is the worst case that happen if power fails one option is it failed while writing copy 1 before we even started writing copy 2. So, copy 1 is damaged we can recover it from copy 2 how do we detect the damage well there are checksums and other mechanisms to detect damage. The other possibility is the failure happened after writing 1, but before or while writing 2. So, there are two situations one is where it happened before writing 2 after writing 1. So, now we have two consistent copies of the block which differ one was before writing one was after writing. So, one is new one is old and they are different after I recover from power failure I if I do not detect the situation depending on which one I read I will get old or new data that is a bad situation a better situation is if power failed while writing the second one. So, it is corrupted when I read it I will detect it is corrupted the problem is if I do not detect this. So, what is important is that when I recover from power failure I detect any such half done writes and clean up how do I clean up that is actually easy. If the two blocks differ I will copy from 1 to 2 if one of the blocks is corrupted I copy it from the other block. So, that is very easy, but importantly I have to detect it on coming up from failure and the problem with a lot of software rate implementations is they do not detect it they let you go ahead and get into trouble. Whereas hardware rate implementations will keep track of what writes were in progress in a piece of battery backed up RAM. And when the system recovers it will say these things were in progress they may or may not have completed I will clean up those blocks and come to a consistent state before I start operating the this subsystem. So, that is why hardware rate is a good idea. So, that finishes up part one of this chapter I have split it in two parts because A view could not load very large files. So, let us move to part two which deals with storage. So, with that I am wrapped up on the physical storage devices and now some of this stuff rate and so on you may also see in operating systems courses. So, depending on how it is done in your university you may choose to skip rate completely you may choose to skip this completely if all of this is done in an operating systems course, but we kept it in the book because certain courses may or may not cover it. But the second part is a core to databases and it deals with file organization. So, how do you store records in files? If you have fixed length records it is not very difficult and most early databases did this. So, here is an example all records are of the same size. So, even if a name is variable length I will pad the name with spaces. So, there is some upper limit to this size if you remember if I say wire care 20, 20 is the upper limit. So, I will use 20 bytes regardless every record is the same size. So, I will just write the records one behind the other in a file I know the record size is say 100 bytes totally. If I want to find the 50th record I will go to 50 times 100 bytes I will go to offset 5000 in the file and read from there I will get the 100 50th record. So, that is one way of storing it and most early databases did something like this. There are a few problems if I want to delete a record what happens? One way is simply mark the record is deleted and leave the space unused. But if I do a lot of deletes and inserts the space is going waste. So, another option is to shift all the records in the file which is crazy you may do a lot of I O to shift the records. A third option if you do not care about the order of records is to copy the last record into the deleted record and decrease the size of the file by one. And yet another alternative is to keep a linked list of free pages and use them. So, these are shown in these things in this option I compacted by moving everything up by one. If you notice the order of the IDs is preserved when I compacted, but this is very slow. This second alternative took the last record which was this one record 11 and moved it in place of record 3 which was deleted and decreased the number of records by one. And finally, the free list approach has a header which says which is the first free record and it is null if there is no free record and it keeps the address of the first free record. The first free record keeps the address of the next free record and so on. This is a traditional linked list. So, when I delete I build a linked list when I insert I will find the first element of the linked list use it for the new record and update the linked list by removing it from the list. These are standard data structure operations and can be used easily. So, all this was very good for many years this is what was used till people got fed up with the fixed length limitation. They said why are you telling us that names should be fixed length? I may have a name which is 100 characters. If I say it is 100 fixed length you will accept it, but now every record even one with 10 character names stores 90 bytes of junk that is very silly. We do not want to waste space like this. So, why do not you store variable length records? Use as many bytes as required for that record. I may allow names to go to 100, but if the name is 8 characters I will not use the other 92 bytes. I will use variable length records. So, current generation database systems all support variable length records and how do they represent the record? How do they represent variable length strings? There is some stuff here not get into it in too much detail because I think it is not very critical to understanding whatever else we cover, but what is important to note is this structure does two things. It can say if a particular attribute is null by having a small big map which says which of the attribute values are null. If it has a fixed length field like salary integer it stores it here. If it has a variable length field like let us say name or department name it stores something at the beginning which says this value starts at 26 and is 10 characters long. So, you will go to offset 26 of the record and read 10 characters. So, the length information and the position information of variable length field is all represented at the beginning of the record. If that was too fast do not worry read it offline. And if you have multiple such variable length records how do you store them in the file? The trick is the file is divided into what are called blocks or pages. In each page you can have multiple variable length records and again these are organized by compacting them together at the end of the page. But at the beginning of the page we have a header which says the first record in the page is here and it is of length 20. The second record of the page is here and it is of length 10 and so forth. So, the header tells us what are all the records in the page and there is one more piece of information which says this is the end of the header and this is the beginning of where records exist in between everything is free. If I delete this record I will actually move all these records to compact the space. But now compaction is relatively cheap because a page is fixed size a page is may be 16 kilobytes. So, moving data within 16 kilobytes to compact is cheap. Moving data of a 1 gigabyte file to compact is ridiculously expensive. So, this is commonly used structures called the slotted page structure. Most databases use this kind of a structure internally to store variable length records. And finally how do you decide which record goes to which page? In the simplest way to do this is every time a record is inserted put it anywhere in the file. You can put it at the end of the file in the last block or if there is some block which has free space put it there put it anywhere that is called a heap organization. In contrast some applications want the data records to be sorted. So, in a sequential file the file is maintained in sorted order. If you insert it will go at the correct position in the file moving records around as required inserting pages as required. And finally there is hashing where the location is computed by hash value and then the record is stored appropriately. I do not have time to get into the details. So, all of these are used in practice. We will see a bit more about how to implement sequential files later when we look at indexing. Finally, what we have done is we have said how to store records in a page, how to store fields in a record and which records go to which pages many systems will create a file and store all records of a particular relation in one file and then they have multiple files. But if I ask for a record from a relation the system should know which of these files has the data for that relation and what are the attributes of that relation, how many bytes, how many fields does it have, what are the sizes of the fields all of this metadata is required to access the data properly. Therefore, the database system also has something called a data dictionary also called system catalog which stores all such metadata. What are the metadata? First of all, what are the relations? What are the names, types and lengths of attributes of each relation? Length meaning maximum length typically. What are the names and definitions of the views which are stored? What are the integrity constraints? Primary key, foreign key, check constraints, etc. All that is for relations. Then there is information about who are the users, what are their passwords. Then you have statistical data such as how many records are there in each relation approximately which are used for query optimization. Then there is physical file organization information. How is this particular relation stored? Is it a heap? Is it a sequential? And which file is it in? And so forth. And then there is information about indices. So all of this is part of the data dictionary storage. And you can actually represent the data dictionary as a set of relations. In fact, pretty much every database allows you to view the data dictionary as through a set of relation names. So the set of relations varies by database, but you can try finding this out for yourself. And you can do select star on those relations to find out the information. This is a toy catalog. So there is one relation called relation metadata, one called attribute metadata, one called user metadata, one called index metadata, and one called view metadata. In reality you need some more authorization and so forth. We have not shown it here. So for example, relation metadata has relation names, number of attributes, storage organization and where it is on this, the first page of that relation. Attribute metadata has relation name, attribute name, the type, position and length. So all of this is a set of relations which is itself stored in the database. And that real information is usually read into memory when the database starts up before accessing the remaining relations. So it is read into memory because it is accessed very frequently. So we do not want to keep going back to disk for this. And the last topic for this section, indexing is the next chapter, but we have run out of time today. So I am going to cover indexing tomorrow. So the last topic is how do you access data? So we said we have records in blocks, blocks in files, metadata to tell us where the file is. So now if I want to read a particular block, I can go through all of this, get the block, read that one record and then throw the block away. This is very inefficient. Most of the time when I read one block, if I am doing random IO maybe I would not read anything else from the block. But usually I will read other records from that block or even if it is random IO, I may not read anymore. Somebody else coming soon after me may read another record from that block. So given that main memory is abundant these days, you really want to cache whatever blocks you have read in main memory. So that next time you do not have to go to disk if it is already in main memory. Now every operating system has such a cache. There is a file system buffer. There is virtual memory and so forth. Now database systems could use the file system buffer but there are some good reasons related to recovery for them to maintain their own buffer rather than use the operating systems buffer. So every database system has its own buffer space and its own buffer manager and data is stored in disk units of blocks and read into memory in units of blocks. So the buffer manager basically manages a number of blocks which are currently cached in memory. Some of which are exactly the same as what is on disk. Some of which may have been updated. Eventually they have to be written back to disk. So the way any piece of information is read. First you find out which block you need to get it from. Then you tell the buffer manager please read this block and it will block into memory if it is not already there. If it is already there it will say yes I have it you can go ahead and here is where it is in memory. Now you can go and you tell the buffer manager I am going to read the data from memory for now please make sure you do not throw out that block. This is called pinning the block in memory. So you will tell the buffer manager you meaning the database code says keep the block in memory now then it will go read the data from that block which is in memory write to it as required when it is done it will tell the buffer manager I am done with this I am unpinning it meaning at this point if you need that space you can replace that page. If I modified it may have to be written back to disk. If it was only read it is just discarded to make space for some other page. So all this functionality is provided by the buffer manager to fetch a block to see if a block is already there and pinning and unpinning. There is one more thing which it needs to do if I say here read one block pin it read it unpin it read another block pin it read it read the data unpin it and keep going like this. Eventually the buffer is full when I request one more block something has to be evicted from the buffer. The question is which of the many blocks which are in the buffer should be evicted. This is a standard operating system problem and operating systems typically use a mechanism like LRU or least recently used for this. It turns out in the context of a database system LRU is not necessarily a good idea and why is that for in the context of operating systems the system has no idea what you are doing your program could be doing anything. In a database on the other hand the database system is running code it is running a query the query processor is executing. The query processor knows what it is going to need in the near future it also knows that something which is just used it does not need anymore it can be evicted immediately if required. So it is very useful for the buffer manager to interact with the query processor and the query processor can give hints saying read this file sequentially but do not bother to keep the pages in LRU mode because I am not going to use it again. I am just scanning the whole file and get a block process it do not keep it around you can discard it immediately if required. So this corresponds to a most recently used block being discarded strategy. So this hint can be given by the query processor to the buffer manager and as a result you can get much better buffer performance. So most database systems have a very tight coupling between the buffer manager and the query processor to improve buffer management which can do much better than an operating system can. So this slide summarizes things which I told you recently one was the notion of pin block another was toss immediate which corresponds roughly to the discard the most recently used block as opposed to the least recently used block. One other thing which buffer managers do is occasionally to make sure a transaction has completed and this data is safe you will have to tell it at this point write this block to disk called forced output so that is one more feature which they have to support with that we are done with chapter 10.