 Good morning to everybody out there, I see people are still trickling in. So, what we will do is as usual we will start with the questions, Lurie Institute, Varangal please go ahead. What is FTP servlets? Servlet can support many different protocols and FTP could potentially be one of them. So, we use HTTP servlet as the class from which we inherited. So, that particular servlet implements the HTTP protocol. You can have other servlets for other protocols. FTP would be one of them. So, these protocols differ in exactly what kind of messages are exchanged, what exactly happens when a message is received and so forth. Sir, next question sir. Yeah. Which is the database that will support type 2 drivers? Can you give an example for that? You are talking about type 2 JDBC drivers, type 1, type 2. This is not really an issue which you need to worry about. You know unless you are getting deep into the internals of how things are implemented, you do not care whether it is a type 1 or type 2 driver. So, you are best not worrying about which database is supported. But since you asked when JDBC was first introduced, ODBC was already very widely supported. And the initial drivers for JDBC were actually bridges meaning that they would talk ODBC protocol with the back end. And you could have a database with an ODBC driver and then a JDBC driver which would connect to the ODBC service. Later on, there were native JDBC drivers which would avoid this overhead of going via ODBC and directly communicate with the driver would be part of the database system and could support JDBC. Do not remember which is type 1, which is type 2. If I am not mistaken, type 1 is the old bridge and type 2 is the direct. I do not remember this. It has been a while partly because it really does not matter these days. We are with KIT college, please go ahead. Sir, my question is about ER diagram. The most important question in ER diagram is to finding out entities where most of the students get confused. So, can you suggest any such mechanism which will make them much more easier to find out which are entities, which are relations and when they are forming this your schema diagram, then what can be converted into tables, what can be converted into columns? That is a good question. How do you decide what is an entity, what is a relationship? I do not know if there is a science to it. It is more of an art, but there are some guiding principles. So, when you want to model a particular enterprise, if you think of what are the things that you want to model at an abstract level? People, maybe let us take an example like Facebook. What are the things that Facebook stores? It stores people. It stores postings made by people. It stores photographs uploaded by people. I believe now it has not just people, but also companies now, which are Facebook users. So, then it has likes, who likes what and so on. So, you can first put down all of these things in just text, English or whatever language you choose. Just note down whatever you want to model without starting on the ER modeling phase. First write down what all are the things that you want to model and what are all the interfaces which you have. So, if you see our sample format which we are given for the project, it also starts off that way. It does not jump straight into ER design, because you say here is an enterprise, create an ER design. You are at a loss as to what is to be done. So, do not get into ER design. Work at an even higher level first. Describe what are the things that you want to model in English. At this point, do not worry about whether they are entities or relationships. You will get a good handle on what all you want to model. By the time you are done with this stage, it will come more naturally to you whether something should be an entity or the relationship. So, first of all the entities can be recognized as things that have an existence of their own independent of others. So, in the Facebook example, user whether it is a person or a company or whoever would be an entity. A photo would also be an entity. A post made by a user would probably be modeled as an entity also. What about someone liking something? Should that be an entity? Should that be a relationship? That is something which you know should be reasonably clear in this case. A person is commenting on a post. So, it is a connection between the person and the post. So, that is probably a relationship. What about the friendship? A person is a friend with somebody else. That is probably a relationship. What about a friend request? When somebody sends out a request to somebody else saying, I want to be friends with you. Is that a relationship or is that an entity? When somebody sends it out, it is not you could model it perhaps as a relationship which is want to be friends relationship. And when the other person replies, maybe it become friends or maybe it gets deleted. Or you could model it as an entity which is a request which is originated by someone. It is related to someone. It is related to someone else who is the person you are requesting. It may have other attributes which may also be ok. At certain times, it is not absolutely clear whether something should be an entity or a relationship. And in some such many such cases, I would say whichever choice you make when you convert it to relations, you might probably land up at the same place anyway. So, sometimes it does not matter too much. You can do it either way. Does that answer your question? Yeah. The next thing is when you are converting that into tables, sometimes we take all the, we create the tables for each and every entity. That is for sure. While designing the database. And we include some of the relations for tables. And some of these relations are not considered when we are creating the tables. So, that is another confusing thing for students to identify which relations have to be converted into tables and which are not. Actually, I would disagree with the saying that some relationships are not converted to tables. What happens is that they get folded into other tables. Every relationship has to be represented in the relational model also. It is not optional. The only question is do you create a separate table for it or do you just fold it into one of the other tables? So, it becomes an attribute of that table. So, the algorithm for converting a near diagram to tables I think makes it reasonably clear. If it is a many to one relationship, then you have the option of folding it into the entity which is on the many side. And if it is not total participation, then you might, you want to consider the fact that you will have null values. And then that choice could be made in different ways. If it is something which you expect many people to be involved in, if not all, it might be ok to have a few null values. If it is a relationship which you expect very few people to be involved in, then you might not make it an attribute, you might just keep it as a separate relation. So, there is nothing wrong in keeping it as a separate relation, but you have the option of folding it in. It is partly an efficiency issue and partly an issue of how many different relations you want to create. But it would still be correct to create a separate relation for it. It is not incorrect. Does that answer your question? Yeah, yeah. Thank you, sir. Thank you. We will take one last question and then move on to today's topic. We are with Srinagji Institute of Technology, Rajasthan. What is the difference between web server and application server? So, I think I had answered this question yesterday. The difference between it is kind of blurred these days. But originally a web server's whole goal was to serve files and to execute programs which are, I mean, there are certain files which are executable. And files in a particular directory are treated as things that should be executed and the output of it should be returned to the user. That originally all that web server did. Application servers, on the other hand, run programs in Java or PHP or whatever. And their primary goal is to run these things. Serving files which are stored in a file system is secondary for this. But things have changed. If you see web servers today, Apache, it has modules which let it run PHP as part of the web server. If you see application servers like Tomcat, they have pretty good support for serving HTTP files and things related to that. So, the boundary is blurred these days. But at the core, you still have HTTP web servers at the entry point to any organization. And typically their goal is to root request to wherever they should go. So, you go to Google, for example. You are just going to Google.com or Gmail.com. But there are a huge number of application servers. Usually they have a number of front-ends. So, you might go to any one of them. And the front-end will actually re-root your request to a suitable back-end. In fact, there are many front-end web servers. And then at the back-end, there are many application server instances. And there is some routing going on. So, here web servers are used primarily as plain web servers. And most places have web servers because the primary goal is to serve files. Application servers are a little less common because they are providing services as opposed to just providing information. Okay, I think we will stop the questions here and move on to today's topic. So, the first topic for today is storage and file structures. And after that, we will move on to indexing and hopefully we will have a little bit of time at the end for query processing. So, our first goal is to understand the physical issues with storage because that has a huge influence on how data is stored above. If you look back in history, the original storage medium was tapes. Data processing started with tapes as a storage medium. And the whole area of data processing focused on how to get data from one tape, get updates to another, merge it to get a third tape, how to get data on multiple tapes, sort them and merge them into one tape and so forth. It was all focused on tapes. When hard disk came in, data processing changed its character. And the systems that were built very strongly reflect the technology which is hard disk. And the physical characteristics of hard disk have had a huge impact on the design of database systems. Today, there is another revolution underway where for many, many applications, flash memory is more than big enough to store all data. And hard disk is needed only for other kinds of data such as video, audio, large text files, images and big data which is data collected by big web services across the world and then may be provided to other people who make use of that data. But if you are an organization and you are just managing the data of your students and employees and so forth, it handily fits in flash. In fact, today it has come to the point where not only handily fits in flash, but it handily fits in main memory. So, there is a lot of emphasis these days on database systems that are optimized for main memory. We do not have time to get into it, but we should just be aware that the underlying storage plays a big role. So, to understand the characteristics of storage devices, we have cache and main memory at the top which are called primary storage and they are volatile. When you switch off the computer, they are gone. The next layers down all the way to the bottom are non-volatile in that if you switch off power, they still retain data. Of these, the highest level is flash which is not quite as fast as main memory for reads and quite a lot slower for writes and below that is magnetic disk which is a lot slower than main memory and flash for both reads and writes. What I mean by a lot slower? What I mean is that if you initiate a read or a write, it takes a while for it to complete, but if you are reading large amount of contiguous data, it turns out that magnetic disk is not so bad. In fact, it can be as fast or faster than flash in some cases. So, these two are called secondary storage. I will come back to characteristics of disk and flash and tertiary storage is much slower and it consists of optical disk, writable DVDs, Blu-ray or whatever. The largest capacities are still in the form of magnetic tape, but the drawback to magnetic tape in particular is if you want to access a particular piece of data, it may be somewhere in the middle of a tape and it may take minutes to just wind the tape to the right point. If the tape is not loaded, it may take even more time to load the tape. So, those are called tertiary and they are used primarily for backup. You do not use it for live data these days. You use it as a backup. In case there is a problem with live data, you can restore it from backup. To understand the physical characteristics of this, here is a schematic diagram. It is not supposed to be a real, really how disk look. They do not look like this at all, but it gives basic things schematically. The basic thing is that disk have these platters. Each thing is called a platter and hard disk drive usually has several platters. At one time, there used to be a 10-12 platter. These days, it has come down drastically to two or three platters. Now, each platter, you know, the whole thing is on a spindle which rotates like this or rather in this diagram it is shown the other way. It is rotating clockwise. It does not matter which way. Then there is a read-write head which can read and write data on this magnetic surface and that is attached to an arm assembly here which can swing. This diagram does not show it very well, but the arm basically swings across. It can position the read-write head close to the center of the disk or it can swing out and position it closer to the edge of the disk. Now, if it is positioned at a particular point, the disk is spinning. So, the parts of the disk that come under the read-write head that once the arm is stable form a track. Now, the way disk is organized, there are multiple tracks and the read-write head is supposed to come to rest exactly over a track, not in between tracks. When it is exactly over a track, it can read all the data that has been written on the track. It can write on to that track also. Now, the information in a track is usually broken into small pieces called sectors. Why sectors? Sector basically has some identifying information and then it has a checksum. The goal of this checksum is to make sure that when you write data, if you read it back, if there is an error during reading it is detected. The checksum is basically some function of the rest of the bytes in that sector and if there was a problem writing those bytes, the checksum most probably with very high probability will not match the contents of that sector and you know it was not written properly. And what actually happens is as soon as a write is done, the disk actually waits for the thing to supposing you wrote this sector here, this sector. The system waits for the disk to spin around, the sector comes back and it is read again. So, whatever was written is immediately read back and compared. If what was written earlier does not match what was just read, then there is a problem and disk system tries to write it again. What if it fails again? Usually there are some mechanisms to remap bad sectors to some other place in the disk and it could go and write the data there assuming that this particular piece of the disk surface has been spoiled for some reason and this does happen some periodically. So, all disks have set aside a few sectors for remapping bad sectors. Again some terminology a track is a complete circle here, a sector is a piece of it. A cylinder is all the tracks which are one above the other on the different platter because they can all be read at the same time from without moving the disk well not at the same time. The disk can choose to read this read write head or the next one or the next one and so forth, but you do not have to move the arm assembly. So, that is physical characteristics. Now, there are some characteristics with regard to the time it takes to do various actions. The access time is the time it takes from when a read write request is issued to the disk to the time when the data transfer actually begins and this has two components. The first is the seek time which is the time it takes to reposition the arm over the correct track. So, the arm is at some track it has to move to an inner track or an outer track depending on the request. So, how long does this take? This depends very much on where the arm is with respect to which track you are asking for. If it is same track it is 0, if it is somewhere else it takes some time. So, what you want is an average assuming the arm is at any random position what is the average time it takes to reach where you want. Now, it turns out that the outer tracks of the disk actually have more sectors than the inner tracks because of physical characteristics of how many bits you can store in a particular in per linear centimeter of magnetic material. So, the disk arm tends to be positioned towards outer tracks it is not uniform and in fact, some tricks to reduce the seek time include not even using the inner tracks just use the few outer tracks. So, the arm movement is very restrictive and it is anywhere between 4 to 10 millisecond for typical disk. And then there is a rotational latency which is the time it takes for the sector to rotate and come under the arm. So, the arm is moved, but at this point the sector may be here which means it will be a full rotation before it comes under the arm or if the sector is just before it will come underneath very soon. So, again there is an average which is half of the worst case rotational time and it is 4 to 11 millisecond on typical disk. So, if you take the combination of these on you know medium disk typical desktop disk you might get of the order of 10 to 15 milliseconds on higher end disk you would get smaller things up as low as may be 5 milliseconds or so on average. Once you got to the right point you can start reading data and now the data flows as fast as the rotational speed of the disk and the density with which data is stored allow and the typical values these days are 25 to 100 megabytes per second it is more for outer tracks lower for inner tracks. So, now there is one other important measure for disk which is how long is it going to take before the disk fails. A disk these days are pretty good, but all of you I am sure have seen at least one disk failure and the typical life of a disk these days is something like 3 to 5 years it is actually been improving slowly it used to be a lot worse and the thing is that when a disk is very new it has a higher probability of failure actually manufacturers have taken this into account and these days before they even ship the disk to you they have run it for a fair amount of time in their facility. So, what you get a disk which have already been run for any manufacturing defect shows up while they are testing it over a period of time before they even pack it and ship it to you. The other thing is as a disk grows older when it is say 5 years the probability of failure increases a lot more. Whereas, when the disk is new and past its burn in period the probability of failure is much less. So, what manufacturer is quote is the mean time to failure for a new disk which they ship you and that reflects it is basically inversely related to the probability that the disk will fail. Supposing you have a mean time to failure for a new disk of let us say 1 million hours what does that mean? It means if you buy 1 million of these disks you can expect one of them to fail in the next hour. If you buy a thousand of these disks you can expect one of them to fail in the next thousand hours. If you buy a hundred of these disks you can expect one of them to fail in the next ten thousand hours. But then that fails too much if you cannot buy one disk and say I expected to last for a million hours it may not. As it grows older its mean time to failure starts dropping and after some age the probability that will fail becomes pretty high. So, that was hard disk. The next issue will come back to this issue of what to do if a disk fails. There is a technology called RAID which deals with it. Now let us move to flash disk which are becoming increasingly important. There is actually two kinds of flash NOR and NAND flash. You do not hear much about NOR flash these days although that was the more common thing in earlier days. NAND flash is the cheaper kind of flash which all of us use extensively. All our pen drives are based on NAND flash and interestingly the way they made NAND flash cheaper was to remove byte addressability. So, NOR flash basically was byte addressable much like memory. You can give a specific byte and it will be just that byte or a few bytes around that byte. NAND flash on the other hand actually requires you to read a whole page which is like maybe 512 bytes to a few kilobytes worth of data at one go. You cannot access just one byte. There is a process where it is like a train. You have to be at the head of the train and you can only read the data as the train goes past you. You cannot go to any random compartment in the train. One page is like a train. You have to pull out the whole train from the flash media and then read the bytes that you want from the middle of the train. So, that is kind of like hard disk because there also you have to read a sector at a minimum. In fact, there is a notion of a page which is a higher level concept and the database system typically does not deal with individual sector. They are too small. It deals with individual pages. Operating systems also do the same thing. They have their own notion of page size. A database can have its own call on how big a page should be. It is usually multiple sectors. Now, if you take a single flash memory device, its read and write speeds are not all that fast. There may be a few megabytes per second. It is actually much slower than hard disk. But these days, what people do is they put together a number of these flash chips into one solid state disk. So, a solid state disk offers an interface which is similar to hard disk. I did not talk about the interfaces. There are what are called SATA, which is very common today and then there is SCSI and a few other interfaces. So, solid state disk looks just like a magnetic disk. The only difference is that A, when you want to read a particular block, the seek time is much, much slower. On a hard disk, I told you that the seek plus rotational latency might be of the order of 5 to 15 milliseconds. On a flash disk, the time to read a random page is much slower. You are looking at probably 100 microseconds or less to read a whole page. That is much faster than 10 milliseconds. That is of the order of 100 times faster. What about writes? It turns out that you cannot actually go and physically overwrite a page on flash without first doing an operation called erase. So, it is like a white board. Once you have written something on it, you have to go erase it before you can write anything more. And the thing is with flash, erase is slow. Erase can take 1 or 2 milliseconds which is now almost of the order of magnitude of a hard disk write. By the way, the random page read I said is 100 microseconds. It can be even less depending on the flash device. You might get it in something like 1 or 2 microseconds even. 100 is kind of worst case scenario. But for writes, you can if the page is erased already, you can write reasonably fast. It is comparable to read speed a little bit slower. But if you need to erase the page, then it is a lot slower. So, if you have already got data and you have to wait 2 milliseconds for the page to be erased, that is a long wait. But what is interesting is that the erase is not done at the unit of the page. It is done at the unit of what is called a block which is many pages. So, many pages can be erased in parallel but a block erase takes the same time as a single page erase which is a 1 or 2 milliseconds. So, this is a weird characteristic. Hard disk are not like this. So, to deal with this problem, what flash devices have is a piece of software running on top called the flash translation layer. You see here flash translation layer. What does it do? One of the things it does is it remaps sector. So, you wrote a sector sometime back. Now, you want to overwrite it. You do not want to spend time erasing it. So, the trick is you will write it to a new location which is already clean. So, it is like I have a lot of blackboards. I have written on one blackboard and when I need to write more, I kind of jump to another blackboard and say, ok now you know that blackboard used to be called physical location 1000. Now, I am going to another blackboard and from now on the new blackboard is called physical location 1000. So, you are actually remapping pages to a new physical location underneath. As far as the operating system is concerned, it is the same physical page number, but the flash device itself knows that it is physically in a different place on the device. So, we need a mapping between the physical location and the page number and this mapping is stored in what is called a translation table. So, all of this is hidden from you the flash translation layer does this. So, this is part of the device driver for a flash. If you plug in there is a device driver which does all of this and the translation table itself is stored on the pen drive itself or on the solid state this. So, pen drives are slow, but solid state does are much faster. Another weird thing about flash is that if you keep writing and erasing the same page over and over again or the same block. You keep erasing a block after sometime there is physical damage to the block and the reliability goes down. And so manufacturers put a limit on how many times you can erase a block. If you erase it more than that many times, they will say look this block would not be reliable anymore and they will remove it from the system in effect. So, your the size of your flash device just shrank a little bit because that particular block is removed from the active blocks site. So, if you keep reading and writing the same block over and over again you have a problem 100. So, how many erases is this 100,000 to 1 million is considered common these days. This might improve there have been some advances in this field, but it used to be as slow as 10,000. Now, how long does it take to erase a single block 100,000 times with 1 millisecond per erase in 100 milliseconds you can destroy a block. So, that is pretty quick. So, you can busily destroy blocks at the rate of you know 10 blocks per second you can completely wipe them out by erasing them too many times. But remember the size of flash device is very big it is gigabytes it has millions of blocks. So, what you want to do is kind of spread the distress around. So, give every block chance to get erased a few times and if it has been erased multiple times try not to erase it again. How do you avoid erasing it again? There is a trick data is usually divided into hot regions and cold regions. Hot regions are regions which keep getting overwritten. Cold regions are regions which have some data which does not change at all for a long time. And the trick is you remap these blocks which have been erased many times and use them to store cold data and the block which was used to store cold data can now be used for hot data and maybe it will get erased a number of time and then it may get moved away to store cold data. All of this stuff which is called wire leveling is again taken care of by the RTL. With wire leveling you usually do not see any degradation in the capacity of the device for a long time but eventually of course it will catch up. But that eventually is many years and by then you have probably junk your system and bought a new one or at least the disk. Now, failures can happen with any medium. Hard disk can fail, flash devices can also fail. In fact inside of a flash device the rate of failure is much, much higher than magnetic medium. So, what do flash devices do about this? It turns out that internal to the flash device itself they have a mechanism which is similar to the rate mechanism which we are going to look at just now. And you can also build a rate on top of a flash device but let us just focus on magnetic device. It does not really matter. It can be built on top of any disk whether magnetic or solid state. The idea of rate stands for redundant arrays of independent disks. Originally it stood for redundant arrays of inexpensive disks. So, disks which are cheap as opposed to more reliable expensive disks. These days disks are all the same more or less the same price. There is a price difference between desktop class and enterprise class. So, you could use desktop class disk in a rate array and get more reliability. So, how do you do this? Well, there are two parts. There is a high reliability which you get by storing data redundantly may be multiple copies or other means and high capacity by using multiple disks in parallel. So, if you want to store data at the rate of gigabytes per second hard disk which can store 100 megabytes per second is not going to keep up with the load. So, what you do is have multiple of these hard disks working in parallel and you suddenly have the capacity to store gigabytes per second. Now, if you the second part reliability turns out to be very very important even more important than the high speed. So, if you have a system with 100 disks each with a mean time to failure of 100,000 hours which is approximately 11 years. Then you will have a mean time to failure of some disk in this array of 100 disk which is 1000 hours or 41 days. So, imagine every 41 days there is a hard disk failure which brings your system down and loses and you lose data that is a disaster. So, what you want to do is somehow avoid losing data and the simplest whatever you need to do to avoid losing data requires some amount of redundancy storing extra information that can be used to rebuild information that is lost in a disk failure. Now, this might sound weird we just spent a lot of time on normalization saying that redundancy is a bad thing duplication is bad let us design a schema which avoids duplication and here we are coming back and saying we need redundancy to avoid failures and this is consistent actually. You do not want redundancy at the logical level because then you have multiple copies you do not know which is correct. This on the other hand is at the physical layer and you know exactly where the copy is if a failure happens you are going to use that copy to restore it and you know you meaning the rate system knows exactly where the copy is. So, if you update something the copy will also get updated automatically you the programmer do not have to worry about this. So, the simplest way of redundancy is mirroring or running in that you just duplicate every disk whatever you write to disk copy one you also write to copy two when you read you can read from either copy. So, pair of disk looks logically like a single disk there are two physical disk which look like one logical disk. If one of the disk fails since everything which you wrote was also written on the other disk you can continue running using just the other disk. Of course if you keep running for a long time there is a chance that the other disk will also fail. So, what you want to do is repair the broken disk as fast as possible replace it and then restore it by copying data from the good disk to the new blank disk which you just plugged in. So, the mean time to data loss that is how long can you expect the system to run before there is actual data loss when is there data loss a disk failed and before you could replace and repair it and restore it the other copy also failed. So, what is the probability of that happening? It depends on two things it depends on the mean time to failure which we have already seen it also depends on the mean time to repair that is how fast can you replace the disk and restore it the data on that disk. So, if you take a long time to repair the chance of data loss goes up sharply if you can repair it fast the mean time to data the chance of data loss goes down mean time to data loss increases correspondingly. So, a trick which is widely used in most rate systems is to keep a spare disk the moment the system finds a disk has died it is going to restore all the data from the mirror disk on to the spare disk. Now, humans are informed and told the look disk died please come and replace the disk as soon as possible but it is not an emergency you do not have to come in the middle of the night to replace it you can come tomorrow you can do it once a week and this is a principle which most companies which run big data centers they do not go and replace failed this immediately they always keep spare disk and once a week you will walk through the array and find all the disk which have failed pull them out and replace them by new disk which will become the new spare in case a failure occurs that will be used as a spare. So, that is a basic idea of mirroring but you do not have to just mirror there is an alternative which is called parity which is also widely used. So, before we come to the parity let us also talk of striping if you want higher disk throughput you want to be stored data at a faster rate read data at a faster rate what you do is stripe blocks across disk. So, if you have a file block 1 may go to disk 1 block 2 to disk 2 block 3 to disk 3 and so forth and then you do round robin and start off again after you have exhausted the disk you go back to disk 1 store the next block and so forth and you get the output of this supposing I want to read blocks 1 to 1000. Now, disk 1 will read blocks 1 and then may be block 11 21, 31 and so on. Disk 2 will read block 2 12, 22, 32 I am assuming 10 disks and all of this will happen in parallel. So, I will get the throughput of 10 disks running in parallel and I can get very high throughput status. So, this is called block levels typing the ith block of a file goes to disk I mod n plus 1 the plus 1 is because I am numbering this from 1 on this. Now, I mentioned parity what is parity it is simply the XOR function applied to a set of bits and in this case it is applied to bits of corresponding blocks. So, supposing I have blocks in this 1, 2, 3, 4, 5 on the 6 disks. Assume an array of 6 disks I can store a XOR of the bits of blocks 1, 2, 3, 4, 5, 6. Now, blocks sorry I am assuming 5 disks with data 1 with parity let me restart this I think I might have confused you there are 5 blocks with 5 disks with actual data and 1 disk with parity. So, blocks 1, 2, 3, 4, 5 go to disk 1, 2, 3, 4, 5 on disk 6 the first block will be the XOR of the corresponding bits of blocks 1, 2, 3, 4, 5. Similarly, blocks 6, 7, 8, 9, 10 will go to disk 1, 2, 3, 4, 5 and the 6 disk will store the parity which is the XOR of blocks 6, 7, 8, 9, 10 and so forth. This is one particular scheme you can improve on this as we will see but the idea is if one of the disks fail we know which disk has failed how do you reconstruct the data in the blocks of that disk. It turns out supposing the second disk failed how do I reconstruct block 2 it is actually very simple the XOR function is symmetric so what I do is I will take the XOR of blocks 1, 3, 4, 5 and the corresponding parity block which is on the 6 disk. If I XOR the bits of those it will give me back the bits of block 2 which is on the failed disk I do not have that block but I can reconstruct it by XORing the corresponding blocks so that is the trick to parity. So there are many options many ways of doing parity and so forth and researchers came up with some nomenclature which differentiates this. This is what is called rate 0 which is block level striping no redundancy you just stripe the blocks across the disk so if I have 4 disks here block 1, 2, 3, 4 go here 5, 6, 7, 8 and so forth just go round and round you can get higher throughput but if a disk fails you lose data rate 1 is mirrored disk with striping that is you can stripe data to these 4 disks and these 4 disks are an exact copy of the first 4 disks now in the research community rate 1 refers to this thing but in the industry many people call this rate 1 plus 0 sometimes people also have something called rate 0 plus 1 with some very minor differences which we will ignore so if you see rate 1 plus 0 or rate 1 0 in industry literature it refers to this thing mirrored disk with block striping and correspondingly in the industry rate 1 sometimes is used to refer to mirrored disk with no striping at all then there are rate levels 2, 3, 4 which we are going to ignore because nobody uses them rate 5 is very widely used also and this is what is called block interleave distributed parity basically it does the following so I have an array of 5 disks and the parity blocks are distributed across the disk so here is a schematic here I have numbered blocks from 0 onwards data blocks 0, 1, 2, 3 are on these disks and the corresponding block on the very first disk is the parity of these 4, the XOR of these 4 blocks now for the next set of 4 blocks the parity is not on the first disk it goes to the second disk and the data blocks are here 4, 5, 6, 7 and on the third disk the data on the third block the data blocks are here 8, 9, 10, 11 are here parity is here and so forth now if you go to blocks 20 onwards this whole pattern will repeat again parity block number 5 will be here block 20 will be here 21, 22, 23 and so forth the pattern repeats why do we do this why not just store all the parity blocks on one disk so it turns out that when you write any piece of data if you write block 5 you have to write block 5 but you also have to update parity block p1 when you write block 10 you have to update parity block p2 also now supposing all the parity blocks on the other disk will also require a write to this disk and what you will see very soon is that there is a long queue of writes for this disk while all the other disks are running at let us say one fourth of their capacity they are idle most of the time the trick of spreading the parity blocks across all the disks speeds this up in the sense that all the disks have parity block writes coming to them so the load is leveled across this so this is the choice of different rate levels rate 1 and 5 are the most common today you would use rate 1 if your application did not need to store all that much data but you wanted very fast reads and writes but if you are storing a lot of data and you worry about the cost of disk you would use rate 5 why should be clear in rate 5 for 4 blocks I had 1 parity block in fact I could push this even more I can have 8 data blocks with 1 parity block the more I push it the more chance I have of loss of data but I can go to 8 without too much problem so the overhead of storing redundant data the parity blocks is much lower in rate 5 in rate 1 for every disk there is a mirror disk so the overheads are quite a bit higher but strangely enough what happened after some time is that people found that in many cases it is not the capacity of a disk the capacity of disk has exploded over time some years back we were getting by with 1 GB disk when I was finishing my undergrad a 10 megabyte disk was considered for a desktop and that evolved from 10 megabytes to 100 to gigabyte to now a terabyte is considered standard so disk capacity is not the issue but if you see the number of IO operations that disk can do back in that era the figure was something like 30 milliseconds to do a read and or write to a random location today it is maybe 10 milliseconds so in an era when the disk capacity has gone from 10 megabytes to a terabyte which is a huge explosion 100,000 times capacity the speed at which I can retrieve a random block has improved only by a factor of 3 so that is terrible so the bottleneck these days is not the capacity but for a given workload IO operations can I perform on the disk in a given amount of time that is often the bottleneck so even though read level 1 is wasteful in terms of disk it turns out it is lot better in terms of number of IO operations to do a single write if I do a write on read 1 I just write 2 blocks and that is it but if I want to do a write on read 5 I usually have to read the parity block and then write it back also so not only do I have to do in fact I have to read the old block read the old parity block and then write the two things so what I land up with is two reads and two writes as opposed to two writes without a read in the context of the read 1 so read 5 is actually more wasteful in terms of IO operations required so it turns out for many applications read 1 is much better than read 5 given that hard disk capacities are so large but read 5 is good when you do not update data much so if you are storing video data, youtube and so forth it is totally wasteful to have more disks with read 1 you would go with read 5 lastly some other terminology there is what is called software read versus hardware read software read is a read which is implemented completely in your operating system drivers hardware read on the other hand is built into some lower layer and the key thing in hardware read is it uses a small amount of non-volatile RAM which could be battery backed of RAM or even flash disk to record writes that are being executed now why is this important supposing you are on a read system and you are had to write a block if you wrote the block simultaneously on both this and a power failure happened in the middle of a block write you could be in a situation where both the blocks are corrupted both copies of the block are corrupted the disk did not feel it is just a power failure happened in the middle of a write now there are ways around this but it slows things down so what you can do instead is both the data which you are writing in a non-volatile RAM if a failure like this happens when you come back up from failure you simply take that pending write which you have recorded in non-volatile RAM and complete it write it to both the disks so does not matter if a failure happened in the middle of a block write it will get completed when you recover so it is lot more efficient to do this if you have a small amount of NV RAM supposing this is usually done in software way so there is a higher chance of data loss with software way and finally some more terminology in the you know original computer systems disks were stored inside the computer were attached inside the computer system today that is true of your desktop but many enterprises have now started decoupling the disks subsystem from the actual computer CPU plus memory hardware so what they do is they have what is called a storage area network so you have a number of disks which are actually connected to subsystem which does read and other stuff and you connect this storage area storage system over kind of network called a storage area network you connect it to the computer using this and now what is the benefit of this the benefit is that the storage subsystems are don't do anything except manage storage and they are a lot more reliable because they don't do anything else on the other hand your CPU with high performance lots of memory is more vulnerable to failure but by separating it out you have ensure that if CPU fails doesn't matter just bring up another CPU which is loading the same operating system data and everything from the disk which is now separate it's on the network so you get much higher reliability by storing all your information in a storage area network rather than on a local disk and with the rise of virtual machine VN this is push the trend along further because it's very easy to bring up virtual machine on a new piece of hardware so if one piece of hardware fails that VM would also be stored on the storage area network disk you can just read the VM from the disk and restore it on a new computer extremely fast so you can mask failures more or less people will just see a very short outage now what about the disk in the storage area you can fail so these disk subsystems implement RAID of course and they will let you hot swap that is while the system is running they will rebuild the failed disk they will even let you add capacity by adding more disk on the fly without bringing anything down so they are widely used these days they are also horribly expensive they cost like 10 times or more the price of a regular disk subsystem people are still willing to pay that price there's also something called network attach storage which is basically the unix nfs or the windows smb system which lets you store files on a remote file server you can't get a disk level access but you can open files write to files and so on so these are also widely used okay so that was the underlying storage now lets move up one layer and see how databases store data on disks so typically a database stores data as a collection of files now what are these files they could be operating system files and then the database leaves the management of the files meaning which blocks are in the file how do you add a block to the file how do you find the ith block of the file all that is left to the operating system so when you install Postgres or MySQL they do exactly this they leave it to the operating system however high performance databases are usually tailored to bypass the operating system and to work directly from the blocks on disk the advantage of doing this is that one layer of software the operating system is bypassed and the database can better control what data goes where what data is kept in cache what is not and so forth let us not worry about this distinction here and we will assume that a relation consist of one or more files and each file has a number of records in the file whether these are OS files or files managed by the database system is something we will ignore from here on we are going to start with fixed length records and then move to variable length records fixed length records are extremely easy to store because each field has a fixed size the whole record has a fixed size so you can treat the file as an array of bytes and record i starts from byte number n times i minus 1 where n is the size of an individual record so it is very easy to seek in that file to a particular record and read it so record access is very simple one catch with this is that if this physical block did not exactly match multiple of the record size the last record in a block might be partly in this block partly in another block that can cause trouble for recovery and other algorithm so usually what happens is you are willing to leave some free space at the end of a block but do not allow a record to span two different blocks so this is how you store the records but records are dynamic they may be deleted what you do if record 3 is deleted again from your data structure scores you know there are several alternatives one is to move all the succeeding records up to fill the space this is totally impractical for a large file it is practical and is used when you just want to do it at the level of a single block this is just one block and you delete record 3 it is evidently possible to move the records at the back up or more commonly the records at the front are moved down so that all the free space in the block is together at one spot or you can keep a link list of free records I will not get into the details it is kind of shown here pictorially these are the records which were deleted they are now free and you can start from the header and follow the next pointer values to find all the free blocks and the last block has a null for its next free record which means that is the last free record in this particular block again these are minor details we would not make you too much use of it now I want to slow down a bit and talk about how databases store variable length records fixed length records are very easy they date back to data structure courses but how do you store a variable length record also how do you store records which may have null values we will answer both those questions in this diagram so this particular thing is for an instructor record where we are assuming that the instructor ID name and the department name are all variable length strings while the salary is a fixed length numeric or integer so how is this stored in a record so this whole thing is one single record all the variable length fields the actual data is stored at the tail end of the record at the beginning of the record we know the first field is a ID and it is a variable length ID let us say I think in our actual declaration we called it KR5 which we could store in line but here we are assuming it is where KR5 so its length could vary so what we do is we store two pieces of information at the beginning corresponding to the first field that is ID the first is an offset within this record where it starts so there is a bunch of information and it starts by 21 so I am sorry 21 the next is how many bytes does the ID occupy it has 5 characters 5 bytes the next field is name that is also stored behind here its starting point is by 26 and Srinivasan is a long name its length is 10 and the third field is the department name that is also stored offline so its starting point is 36 its 10 again the fourth field is an integer so it is stored in line 65,000 here and that is 8 bytes of integer let us say so based on the schema declaration of instructor I know that these are all so I know that this is how it is going to be stored so let us say that these are maybe you know since pages are not all that big these may be 2 bytes and 2 bytes each offset and length 2 plus 2 bytes that is 65 page size so what I will have is 4 bytes for the first field 4 bytes for the next field and so on because they are all where here integer let us say I want large integer so I am storing 8 bytes so I know given the schema I know exactly where in this first array of bytes to find the details of let us say the department name I know that it must start at byte number 8 and occupy 4 bytes and those 4 bytes have 2 bytes of offset and 2 bytes of length which I then used to jump here and find the department name there is one other piece of information which is the null bitmap and this records which all of these attributes have a null value now if you have where care it is quite easy to store null values all you do is say that its length is 0 that means there is no data it is a null value well not quite there is also a null string which is not exactly the same as a null value the empty string the empty string would have then 0 what a so an actual null string could be represented maybe by an offset of 0 but for integers and numeric you cannot do that here so you must have a bit which says which is set if that particular field is 1 so here I am just showing 1 bit per field and if the salary was null the fourth bit here would be set to null so let me cover this last slide and then we will have time for some questions so the last slide that I have in this segment there is more coming up but in this segment is how do you store variable length records and pages earlier we had a fixed number of records per page now we have a variable number of records per page and each record is a variable length so if I want to quickly access a record in a page I need some structure for it and this particular structure called a slotted page structure is very commonly used so what it has is all the records are stored at the end of a page that is what I mean end of a page a page has bytes starting from 0 to some number so typically the records are stored back together without any gaps at the end of the page at the beginning of the page on the other hand I have a header which has several things the first is the number of entries that is how many entries does the block header have each entry itself has a size and an offset meaning that the first entry is for record 0 which has some size and an offset to follow the offset is shown pictorially here as this line what is actually stored as a value it says this record 0 starts here then record 1 starts here record 2 starts here record 3 starts somewhere here and so forth the offsets are stored the lens are stored here one extra piece of information is stored which is where does the free space end and that is this point the free space ends here based on the number of entries I can calculate where the free space begins so number of entries times this size per entry plus the size of this header thing here together determine where the free space begins so if I want to create a new record I just move the free space pointer back over here so now the free space is reduced and the new record has been allocated space here and entry is made for that record over here what is the size of a what is the record is deleted I will basically delete the entry here say that this entry is no longer used it is free but I will also move all the records to fill in the deleted these two records will move to fill in the space and free space expands it is contiguous when I move these I have to update the offsets here so that is required so there is some overhead to doing this but the benefit is that free space is all together if you are familiar with memory allocation systems which have come as part of any language free and equivalent in Java one of the problems they run into is that they do not typically do compaction and the free space can get fragmented into many small pieces here the free space within a particular page is always together so there is no fragmentation within a page the last thing that I want to say is that if you want to mention a particular record inside a particular page this is a common requirement and it is called a record ID typically so a record ID identifies the page where the record is stored and an offset into this block header you never store the physical location of the record because that can move around and if you move it you do not want to go update an index and tell it that record moved within the page what you do is in the index you will store the number and the offset within the block and if you move the record here you just update the this part the starting location for the record and the index itself is not touched because the you just go to the corresponding entry in the block header and from there you can find the record even it has moved many times does not matter you go to that location you will find the record