 So here is a possible construction. Suppose we construct a binary tree structure. What is the binary tree structure? Each block has two pointers in the corner and in between there are keys and pointers. So let me explain what this will mean. Let me go back to, first you copy this down so that you will understand. There are lot of symbolisms there. It's not as difficult as it seems. So I sort of decide that on this disk I will have large number of data blocks. Now we already are familiar with the notion that each block is a number. So a pointer means the block number. So whatever I want to access it is recognized by a pointer. And my problem is that if I have 1.4 crore keys and 1.4 crore pointers for each key, that is each key will tell me which block I have to go. Of course each block may control multiple records if the block size is larger but still that will mean that different keys may still have the same pointer. But while searching I have to search all 1.2 crore keys. How will I do that is the question. And how do I organize index blocks? As we saw if I require 1 lakh index blocks then to search 1 lakh index blocks itself is a very time consuming. So what I do is I merge the index and data blocks together and I decide to have a storage of this kind where this is let's say 1 block. In which I have one pointer let us say this pointer says 1,003 and this pointer on this side says 1,005. I am just giving an example. Then I read this block. In this block already I have some key plus pointer, key plus pointer etc. Let's say all keys are arranged in sequence. Why I have these two parts is that suppose I am searching for a key called K prime. Then I read my first block. Let's say this is the root block. I compare this K prime with this key. This is a key 1, K2 etc. Kn are the keys here and Pn. If K prime is smaller than K1 then I assume that K prime I will not find amongst these keys. If K prime is larger than Kn then I again assume I will not find it here. So only keys between K1 and Kn are to be found here. Once I read this block this much information is inside the memory so I can quickly compare K prime with this number and this number. If it is between these I have found out my key and I have found out the pointer because I can search in memory which is much faster. But if K prime is less than K1 then it is this pointer which tells me that for all keys which are less than K1 you go here. For all keys which are greater than Kn you go here. You realize that this is like a binary search. So suppose in between the center of the keys are kept in the root node then all keys which are less than this I go here. All keys which are greater than this I go here. When I go to the block number 1003 I have again found a similar structure. Two pointers and some keys here. Please notice that if I had one lakh keys then by this binary division I have divided that into 50,000 keys on this side 50,000 on the right side. There is a binary. Now again for 50,000 I will find some keys here. Most probably the K dash will not be amongst these. But then if K dash is less than this I will go this side K dash is more than this I will go this side. By 50,000 now becomes 20,000. 20,000 now becomes 20,000. That is not binary searches. In short if I mix keys and pointers together and if I have these pointers leading to less than K1 type of keys or greater than Kn type of keys on these two sides I am basically dividing the search space into two and by halving I can continuously drill down and I do not have to do one lakh searches. But how many searches I will have to do? One lakh, log one lakh to the base two. That is roughly the search space that I will have to follow. So log one lakh to the base two which is about 17 blocks which is the height of the truth. Do you realize that this is much better than one lakh searches? This is the equivalent of binary search. It is a very crude equivalent but it is still an equivalent because it reduces the search space. Please remember that the blocks may be completely apparently put on the disk I do not care. But I now know that even for one lakh indices I can read my required index in at most 17 block reads. I will know exactly where the pointer is. Once I get the pointer I will have to go to that pointer and get the record. So consequently in 18 disk block reads I can get the actual. This is too slow. Can I reduce this 18 further down? Because please remember 18 block reads at let us say 50 blocks per second which still mean a substantial one-third of a second almost or slightly more. So can I reduce that further? Because anything that I do to reduce this will mean my information system is more efficient. Please also understand that it is not the actual information reading or writing which is taking me time. But to locate a place where to read or write which is taking me time, indexing is taking me time. So instead of a binary tree kind of structure which points out that if key is less than this go this way more than this go this way. This breaks the search space only in two halves and each level it breaks it into further two halves. Can I do better than that? Can I break the tree into multiple halves? So instead of making a binary tree can I use a multivariate tree? This multivariate tree is called a B tree. Its actual implementation is called a B plus tree implementation. A typical node in a B plus tree looks like this. It has p1, k1, p2, k2, etc., etc., pn minus 1, kn minus 1, pn. So that means it does not have directly a key and a pointer. In fact this tree is very funny. At the index level no information about where the desired information is located is stored directly. This is only a search part of the tree. And in this search part of the tree the information is organized such that each of these case are called search key values and each of the p's are pointed to children. There is no information stored in this node at all. Every time you go to a node you will be led to another node. And on what manner will you be led to another node? These pointed records or buckets of records which are called leaf nodes. So this tree structure is extremely funny. Let me just illustrate that by another diagram here. In fact on the same thing if this is a binary tree the structure that now we are envisaging is that here is a node. First pointer points to one node here. Second pointer points to another node here. Third pointer points to a third structure here. So these are keys actually like that. So that means if there are 16 pointers this node alone is pointing to 16 different nodes. Where to search depends upon what your k dash value is. Visa way, this, this, this and this thing. Consequently instead of a two way bifurcation you are trying to get an m way bifurcation where are the m indices. The actual values are stored in what is known as leaf level nodes only. Absolute leaf level nodes where you will actually have the key and pointer which you are searching for. Let's look at this thing further and you will understand the significance immediately. So here is the k1, k2, k3, km minus 1. These are not the actual keys which you are searching for. Please remember these are called the search keys. It may so happen that k dash which you are searching may be equal to one of these. But invariably it will be either smaller or greater than one of the values. All that these pointers do is they decide where you should go. I think you would have reasoned by intuition that if the key that I am searching k dash is less than k1 then I should possibly go to pointer p1. If it is more than k1 but less than k2 I should go to p2. If it is more than k2 but less than k3 I should go to that. And if it is less than k1 minus 1 I should go to pn minus 1 pointer. And if it is greater than kn minus 1 not equal to I can decide equal to anywhere I go to this pn. Consequently if there are one lakh records that I have to search one lakh is now getting divided into 8 or 10 or 16 portions. So only one tenth or one sixteenth portion remains under every subtree. And I apply the same logic at every subtree. So I instead of log of one lakh to two I will get log of one lakh to M where M is the number of nodes here. Is the clearance a very elegant structure actually. So here is the b plus 3 search key organized. The search keys in a node are always ordered k1 less than k2 less than k3 less than kn minus 1. All the search keys in the subtree to which p1 points are less than k1. For any r between 2 and n minus 1 all search keys in the subtree to which pn points have values greater than or equal to ki minus 1 and less than kn. So equal to is taken another. I mean all these mathematical things simply amounts to the following. If I am searching for k dash from one node root node I can find out that k dash will belong to one of these n partitions. And I will go to the right partition. There again I will have multiple things. I keep doing it till I go to the leaf level node and either my record is not there at all or if it is there it has to be found in the leaf level. All that I am doing is reducing the height of the tree from log one lakh to two to log one lakh to M. This is called the b plus 3. With 24 byte key and 8 byte pointer a block of 512 bytes can hold 16 pairs. Consequently with about 1.2 crore records height of b plus 3 will be log 1 crore 20 lakhs to the base 16 or just 6. This is one-third the height that we got in binary tree. You get the difference now? Of course the modern block sizes are much larger. They are not 512 bytes. They are typically 4k byte block size. You can imagine how quickly you can descend down. Even for a very... So for example 1.2 crore records with the modern block size your height will be 2 or 3. Absolutely quickly you can... That is the reason why you can quickly access the data because you organize it in b trees. Please remember that this assumption, this procedure assumes that your tree is balanced. What if while inserting the records the tree gets lopsided? Then you search every time K dash R less than this. Go here, go here, go here. There is nothing on that side. They are all 1.2 crore records assembled on this side. Your son. Maintaining the balance in a binary tree is far more difficult. Those of you who have studied binary trees or search trees would know that maintaining B plus tree balance is automatic and inherent in the insertion philosophy and the logic that binary B plus tree insertion deletion update logic follows. So it is a more complex matter. We are not going into the details. But we will assume that B plus trees are inherently balanced at certain extraordinary cost for inserting a new record which means that whenever a new record is to be inserted the algorithm to insert the new record maintains the balance of the tree equal things on the other side. If at all the balance is going to be disturbed it actually breaks some more and creates a new B plus tree, specially modifying tree. But even at that cost the B plus tree is always balanced. This is by far the most common method of indexing used in all the other products and in almost all other products which require some kind of a database indexing or database like indexing. The trouble is that when you say B plus tree or B tree it is imagined to be binary tree. The B tree or B plus tree for indexing on the desk is not a binary tree. This is a multivariate tree anyway. The B does not stand for binary. Nobody knows what this B stands for. So Bayer and McLeod who first proposed this tree since the first author started, his name started with B people thought it was a bio tree. Then he said no, no, no, I didn't mean it to be like that but we are working for a Boeing company so we put that as a B tree because it was a Boeing tree. But people refused to call it Boeing tree or bio tree it is certainly not a binary tree. The best thing is to refer to it as a B tree or B plus tree. But you can see how important it is because the reduction even with 16 way bifurcation is significant. You can imagine what will happen if log is to be taken not to the base 16 but say something like 256 or even 1000. Should you take log of this to 1000? How much will it be? You can see immediately the height will reduce. And height is important of a search tree because you have to search so many nodes in that height. Is it clear now how information can be organized? So two conclusions, one is that as far as we are concerned the disk is like an array of blocks and it is the responsibility of operating system to number these blocks and give those numbers to me. Two, when I put information inside these blocks then I want to identify information using some key which is like a roll number, employee code, part number, etc. And the disk and operating system are completely oblivious to the key value. So I will have to create an index structure for maintaining information that I do and therefore I will have to create some B plus tree. In case you are using a database product which is like a oracle or Postgres SQL or whatever all of them have this indexing facility. There is a command called create index. So when you create an index, a B plus tree index is created on the key that you mentioned. The entire data that you have in the disk file the index file is separately created while these indices are created. And the leaf lower nodes will actually contain a pointer to the original data file. It does not matter where in that file the data is. So whatever one, two, three, four accesses in the index search and one access to the block you are guaranteed to get your data. That is how indices work very well on the desk. Is this okay? Next we consider. So this is about indexing. What it means in terms of our databases suppose you have a schema, you have tables and you have created tables which may contain any amount of data whether it is 5000 students or 1.2 course you know 12 standard people or it could be the entire voters list of Maharashtra state or Karnataka state whatever be the numbers. You can actually index them. PAN cards for example, PAN numbers. How many people about 6 crores or something who pay income tax or you take bank account holders or life insurance policies. There are 20 crore policies issued by life insurance company. So 20 crore policies at a single storage if you want to search, how long will it take if you do not have this kind of but with modern desk all 20,000 crore policies can be searched in a jiffy. The level of the B plus 3 is hardly 3 or 4 and you can create that index automatically by giving a single command. Of course to execute that command to actually create the index is not going to be so simple. To construct that index it may take hours but once you construct that tree for ages you can keep accessing the records and keep inserting or deleting without disturbing the balance. That is the beauty of it. So this is one level of quick accessing the information. You will now notice that it is completely independent of the disk access time, rotational delay etc. Those characteristics which I mentioned are important from an absolute perspective that you are trying to configure a computer what kind of disk you should have, how many disk you should have what should be the access time. So you would naturally try to go in for faster disk. You mentioned about SATA disk. So SATA actually today SATA drives are much faster in fact and the interface is also much more wider and faster than the SCSI drives of earlier years. So year after year every 6 months in fact the new technology keeps changing. The advantage of the SATA drive are commodity drives and very large capacity disk drives. So for example, 500 GB SATA drive is available today and the cost is not enormously much. So per byte storage on the disk is now much cheaper. Of course you have to wonder that if you have 500 GB disk and if that disk crashes how will you handle the backup and restore of 500 GB disk. So you know what people do? People say I have 500 GB disk, I will buy 2 of them and whatever I write here I will copy it there. The probability of both disk failing simultaneously is very low. So one disk fails, the other survives. But what happens while you are writing on one disk? Yesterday's thing you have copied but today's thing before you copy this disk crashes. Now you have yesterday's data but you do not know which data was added today. So there are some kind of problems. You want the kind of redundancy which will automatically take care that anytime you write it is guaranteed that writing is done simultaneously to both the disk etc. Meaning you want to increase the redundancy in order to safeguard the data against the possible failure of disk storage. Towards this new technology evolved about 15 years ago it is called the RAID technology or redundant arrays of independent disk. R-A-I-D is the name. Actually the original name as I remember was redundant arrays of inexpensive disk. But then nothing became inexpensive or everything became equally inexpensive. So they call it independent disk. So this is a disk organization technique so that you manage a number of disk as if there is a signal disk that you have. So instead of one disk if you have two disk but imagine you have only single disk. As far as you are concerned you are reading and writing to one disk. But internally this so called RAID mechanism ensures that there will be a mirror image or something like that. So if one disk fails the second disk will continue to provide read and write capability. The RAID means that you at your level of programming do not have to worry about it. They come to a hardware, software together manages this. So as far as you are concerned you have a single logical view of the disk. Internally it may be more than one disk. This organization of multiple disks into a single volume is called the RAID technology. High capacity and high speed is available by using multiple disks in parallel and high reliability by storing data redundantly so that data can be recovered even if a disk fails. Take a simple case. I have two disks made into RAID. That means I will be mirroring whatever I write on one disk. So when you give a write command it will be written on one disk. It will also be written on second disk. The operating system will not come back to you saying I have written. So writing will take whatever you want of time. But imagine the reading. If different people are reading then the reading can be done thoroughly from both the disks. So that means the readings will be improved. So there are some advantages of that kind even in RAID technology. Let us look at the different nomenclatures that RAID has and what it means. First of all understand the basic reliability algorithm. Those of you who study reliability of course know these things. The chance that some disk out of a set of n disks will fail is much higher than the chance that a specific single disk will fail. This is common sense probability. For example a system with 100 disks each with MTTF is mean time to fail of 100,000 hours. That means ordinarily a disk is guaranteed to work for 11 years let's say without any problem. But if a 100 disk together then they will have an MTTF of 1000 hours or just 41 days. So please remember that in a situation where you have a single disk the probability that it will fail is once in 11 years. But if you have 100 disks at one place the probability that anyone of those disks may fail is just 41 days. That is the MTTF. So techniques for using redundancy to avoid data loss are there. For critical whenever you have a large number of disks. And let me tell you that the modern what you call NAS or sand storage, the network area storage or such storage would deploy typically 4000 disks, 5000 disks, large disks. So in that case you cannot assume that no disk will ever fail. The road was originally a cost effective alternative to large inexpensive disk. So I stood for inexpensive. But today rates are used for the higher reliability and bandwidth and therefore I is interpreted as independent. This is just a piece of information. It does not matter what you call it. The idea is to safeguard your storage. So how do you build reliability via redundancy? First you should be able to rebuild information if it is lost in a disk failure. The simplest thing as I mentioned sometime ago is called mirroring or shadowing. That means duplicate every disk. The logical disk consists of two physical disks. Every right is carried out on both the disks. Consider this you are writing your notebook when taking notes and reading them there are two notebooks. So whenever you are taking notes you write here, write there, write here, write there. The advantage of mirroring the disk is of course as I said while you have to write on two disks you can read simultaneously from different disk. This is of course meaningful if you are a multi user environment. If a single user is using the disk I am reading only one record at time so it doesn't matter. But if multi users are there then this makes sense. It is not analogous to you can say zeroxing whatever making a multiple copy. But it is not analogous to zeroxing because you can actually there is no concept of reconstructing information. See your disk files which contain volatile information every update changes the contents. So you don't in the handwritten notes the example lies down there because in a handwritten note you don't keep writing cutting out and writing something on the same page and zerox again same page and zerox. You write continuously on different pages that is different from a disk. In a disk the data is getting updated your CPI changes data is updated your hostel number changes data is updated. And updated data is there exactly on one disk so it is like exactly one page you have. Every time you change information you want to zeroxing so the analogy dies there. The point is that if one disk fails I should be able to extract all the information from the remaining disk. I should be able to reconstruct the original information. This does not make sense because it is severely true if I am doing mirroring. Mirroring means identical information so big deal if one disk is gone I can extract the information from same disk. So mirroring automatically ensures this notion of rebuilding I don't have to rebuild anything the information is there itself. But here the problem is that the capacity that I am able to build for this redundancy is double as much as I require. You know if originally let's say 40 GB disk was to cost me 5000 rupees I have to spend 10,000 rupees plus the additional hardware cost software cost to ensure this happens. I am wasting double the capacity to get this redundancy can I get this redundancy in less than double nothing. Now that gives rise to more interesting algorithms mean time to data loss depends upon mean time to failure and mean time to repair. So MTTF of 100,000 hours mean time to repair of 10 hours gives mean time to data loss of 57,000 years for a mirrored pairs of disk. Ignoring the pair okay I don't know I have I think I have included because this was the statistical analysis that if you have a mirrored pair of disk basically much beyond your lifetime the data should remain. This is of course probability theory so please remember that probability always applies to large number of things. Your individual disk if your machine is quota then it may fail even the next day even the mirrored disk. But this is generally the wisdom okay. It is not like that both disks are independent there are the different disk controllers and one the operating system will issue a command to write on both disks exactly the same block and it will be written correct. So that's why there is the read the write is always followed by a read in an operating system. So suppose you give some information to me to write I have the operating system. Now I know what the original information is that is what you give me. This I will try to write on this disk as well as this disk forget the redundancy even if I have only one disk I write it there after writing and read it also. So I will know exactly what you give me is correct or not. So this is called read after write verify this mode is costly because you remember I have to wait now for one more rotation for that is to come back. Of course there are thousands of requests on this coming from multiple users so throughput will not be affected but response time will be affected for you. But read after write verification will guarantee that if there is a corruption afterwards we do not but both the things are considered identical. There are other things we shall discuss why how this can be taken care of. So these schemes the rate levels the one which I just described the mirroring is called rate 0, rate 0, 1 kind of thing. The other schemes which provide redundancy at lower cost by using disk typing combined with parity bits. If you have heard of parity bits that you are storing 8 bits instead of 8 bits you store 9th bit 9th bit is some kind of parity or even parity calculation. So if some data gets corrected in those 8 bits but checking the parity bit you will at least know whether it is right or wrong. Then there is information theory which says that if you are double parity bit then you can even correct the other thing. So these are called error correcting codes. There are variety of codifications that are available. Thankfully for us they are all implemented at the hardware level or at the operating system software level so we do not have to worry about. Suffice it to say that if the operating system returns some data from the desk it can be taken to be guaranteed that that is the correct data. Otherwise it will tell us that the data is not available or disk is correct. The entire disk by the way is safeguarded through other means such as what you may call redundancy checksums which are calculated by the operating system quite independent of database or whatever you have. So there are enough mechanisms which we do not have to worry about. But let us very quickly look at the different rate organizations for rate level and how they what exactly they do. Here rate 2, rate 3, rate 4, rate 5, rate 6 are the numbers given. On internet there is a beautiful simple if you just search for rate levels RAID levels you will get I think Wikipedia also as an entry. It will give you the different rate levels and what they mean. So there are different rate organizations or rate levels which have differing cost performance and reliability characteristics. For example for transactional databases, banking databases and so on where you do not mind the cost but reliability is absolutely essential. You will do mirroring. There are installations which do mirroring plus writing. So you will have six days with five as the data and one as parity notionally and other six days which mirror completely these six days. The absolute guarantee that whatever happens you are because you are talking online accounting. Take online trading system. What will happen to Bombay stock exchange if somebody says oh I did not know but two days have failed in the same block. So you cannot stop the national trading you know. So that's the reason why ultimately it's like a control of pollution. You spend infinite money and you will have zero pollution. So something similar to that but greater redundancy gives you a better self-car. So is the general principle clear? I think you can read the rate levels elsewhere if you want to know more details than this. There are issues on where the rate algorithms are located. It is possible to provide rate through software. That means the implementation is done entirely in software with no special hardware support. So you take unique operating system for example those of you have played with Linux for example. You can actually have this two disks on your system and you can make the Linux treat these two disks as if they are a rate disk. So you can do mirroring. You have three disks you can do whatever red, striping and so on. Amitabh Isaac was one of the research scholars and a hacker of computer systems. He was an aerospace engineer and he was working in aerospace engineering. He created a virtual red array of disks using 40 disks on 40 different PCs connected on network. So your local area network on which 40 PCs in the department are connected ordinarily not more than 10 or 15 will be simultaneously working. So all that he said is please leave your PCs on and let this so called utility be running on this, nothing else. And that means my device even though I am not using it is actually being used to support somebody else's data. Because even if that fellows this goes because of the logical rate implementation which is called distributed rate implement. So you can see software can do wonderful things. Ultimately all the logic is software whereas you bury it in hardware or run it as a software utility is your issue now. So when you implement the functionality using software then you call it software rate. But mostly in the servers you have special purpose hardware which comes for example there is a rate controller. The rate controller itself has two separate connectors to two separate disks and the rate controller manages the entire logic of writing, reading etc. Of course the implementation is much faster in terms of performance. The rate implementations have special hardware so they use non-volatile RAM to record writes that are being executed. Power failure during write can result in corrupted disks. So some of the issues are there. There is no technology which is perfectly foolproof but you make things better and better and better. For example our own information system in the institute, the servers on which that information system works has read this. So that a single failure will never affect. Of course they have backups and restore and everything else. So is this notion clear? Generally you will have hardware read on non-trivial servers which house information systems. But on a personal fan if you have let's say useful information system for your office or spreadsheets or whatever. In your own lines of activities subsequently as your professional activities. It might be useful to consider even a PC which has two disks. Two disk is normal these days. By a higher capacity disk can put a read logic either hardware read or software read. Why? Because a 500 GB disk which will be standard by the time you step out of the institute and join a professional work and have money to buy your first high-end PC. You cannot back up 500 GB disk even with 4 GB DVDs. More than 2, this is the important. Yes. No, then that's why I said that in some cases I will have a dual read. That means I will have a 5 plus 1 striping plus mirroring and then the entire thing is again mirror. So you can then increase to whatever extent. Generally the sand storage of the modern implementation that have been in place. They have not lost a single byte of data in last 10 years. Although many times individual disks have failed. But not a single byte of data has been lost. So that means that the technology is now proven. So that's how these technologies mean for official purposes. You take Google. I mean how many servers and how much data that they deal with. They can't survive. Whatever probability you put, the number of disks that they have are so many. Something or the other is always failing before your eyes. So you can't afford to have this kind of life. This is the last few things that I would like to suggest. Hot swapping. Hot swapping means replacing a disk while the system is running without power down. Some hardware rate system suggests this. It reduces time to recover and it improves availability greatly. What happens is your disk has failed. So you have modern diagnostic. When the disk failure is indicated, the operator is told disk number 4 is gone. All that you do is you go there and pull that disk out. The system is running and pull that disk out. Put a new disk in and the system will continue to run. In fact, for the first time when the rate systems came, one of the benchmark protocols that I followed for large applications in the industry was that we run the transaction posting system in a simulated load and after the load stabilizes, I would physically pull out a disk. So when you physically pull out a disk, the performance dips down. Naturally, because now the remaining disk there to reconstruct or something like that. But it stabilizes. Then you insert a new disk. Now the modern technology permits you a choice of saying how much of the systems processing time can be spent on reconstructing this disk. Ideally you should stop doing anything else till this disk is reconstructed because during that time you are in danger. But in real life, you want to support the processing of the actual transaction that is going on in the world. So you would like from a business point of view to reconstruct this disk only in the light. But you don't want to take the risk. Similarly what you can do is particularly for systems for which there is no day or there is no mind. Imagine Citibank system. Citibank works around the international board. So somebody or the other is always there withdrawing money from ATM doing something. So the system is never shut down. So in such cases, you allocate say 10% of the processing resources for reconstruction. So while the other disks are serving you, 10% of the process resources are reconstructed. For certain time the reconstruction is complete, the disk comes online. But these are called hot swappable disks. Absolutely no modern server is installed for any non-trivial application without such hot swappable disks. These disks are slightly costlier because they come with a back panel. Ordinary disks have to be connected through cables to the controller and such things. But these things come with a sort of steel plate and they just go inside. So even when the power is on you can actually take it out, push it in. Very reliable technology now that you have. So this is called hot swap that's all. Okay, here is a quick thing on. I think everybody seems to know everything about optical disks. I'll just leave it for you to read. The compact disk read only memory or CD ROMs can be loaded or removed. Typically 640 megabytes per disk. You can actually get 700 megabytes. The high seek times. So optical read head is heavier and slower. So seek time is 100 milliseconds and higher latency. They move at about 3000 rpm and lower data transfer rate compared to magnetic disks. So you will notice that the optical disks are much slower than hot disks. But for an individual perspective is good enough. I mean I'm using it for example you can see movies from CD ROM without feeling anything. Once the thing starts you know there is enough buffering for you to do. But try to put a CD disk and make it shareable amongst 20 users. Then you will have fun. I mean the response will vanish. So they are really slow there. The DVD disk or digital video disk or digital versatile disk. Both nomenclature are used. Hose 4.7 GB. DVD 9 formats hold 8.5 GB. DVD 10 and DVD 18 are double sided formats which are 9.4 GB and 17 GB. Other characteristics are similar to CD ROM. These have become by far the most popular backup storage. They are not only backup storage. These have become the most popular distribution storage. So media in which software is distributed, media in which data is distributed. One advantage of these as compared to floppies or any other mechanism is that once recorded you cannot change the contents. So for example a bank branch which records let's say the transactions of the day. Earlier they used to send the data on a floppy. There is always a non-zero chance that somebody who is taking the floppy might change your bank balance or might do some fancy stuff or whatever. Whereas CD that is not possible. So if you look at security perspective CDs and DVDs are considered more secure. So this is the story of the storage. Magnetic tape everybody is familiar. We have all seen tape recorders of the old vintage. Anybody who has not seen a tape recorder. Music tape recorder. So instead of music you imagine bits and bytes are being written. Actually nowadays even on tape recorders the music that you record gets recorded digitally. So only bits and bytes are written their interpretation is different. Here it will be roll number, marks, whatever one. So tapes are essentially sequential devices. So while the capacity may be large the way you read them is starting from beginning to end. You cannot directly put a B plus tree there on that tape. You have to read sequentially search and pull. That is why tapes are never used these days for actual serious information processing. But tapes are used as backup. And a typical backup for example is either a digital audio tape or that which is a very small tape can hold up to 40 GB of data or a DLT or digital linear tape which can hold 100 GB plus or 330 GB with Ampex helical scan formats etc. Transfer rates are few to tens of megabytes per second. So it takes a long time. But these backups are and by the way the DLT technology guarantees that even though the reading writing mechanism over the years may change even if you have recorded some data in a DLT tape 10 years ago you can still read it exactly without any problem. That's the guarantee that the DLT technology gives. So that means all upgrades are backward compatible in terms of recording formats, recording and so on. That is necessary. Otherwise I mean imagine you had let's say you have a lot of backup on floppies and suddenly floppies have disappeared. What will you do with those floppies? It appears to be a laughable matter because why should anybody do such a stupid thing? After all if the technology changes I should change it. Ask a large number of businesses who used to keep backup on large magnetic tapes. Technology changed. They also changed. But nobody remembered that 400 art tapes which are kept in the silos are no more readable because they don't have any technology. I have known a large corporation spent money through their nose going to an agency which had a tape drive which could read those old data format and which they had managed to connect to a new computer. They were charging money not for computer time but for having the only magnetic tape reader of that kind in the country. And people had paid a lot of money because there's no other way out. You had to take that data out and put it onto something readable. Now this is something which is so different from the human scripting technology. You know when you write whether you have written on Bhujapatra or whether you have written on a modern notebook as long as the script is same it is still legible even today. After 4000 years you can still read it. Such is not the case with the digital storage. The digital storage technology lifespan is very limited and we have not yet found an answer to it. So there is no longevity for the digital records. Digital records of 10 years ago don't make sense today. Digital records consequently that you and I swear by today namely these CDs and DVDs may not make any sense 10 years later. Consequently in modern information system scenario whatever be the technology you deploy you must always remember that annually you must do a rain check. Where is the storage technology and backup technology going? What is my current asset in terms of backups? And when should I convert all my backups from this existing technology to something different? And if I don't do that I will be geopolitizing the position of the organization as far as information system. So this should be kept in mind. Okay I think, yeah magnetic types I mean these used mainly for backup for storage of infrequently basically archival information. There are tape jukeboxes by the way. This is something very interesting. You know what a jukebox is? In the old time when you had these records, gramophone records, there will be multiple gramophone records you put a coin inside the box and you choose a song and that thing will move and that particular record will fall in and that music will play. So instead now imagine that there is a jukebox of tapes and you say I want fifth tape so that the tape thing will move and the fifth tape will drop into the reader and you can read it. Now you have 10 tapes or 20 tapes you can have this beautiful moving arm. What if you have 1000 tapes? So you have actually a large almerab, two sided almerab. So on one side there are all these tapes on the other side there are all these tapes and in between there is a robo arm. The robo actually moves. Okay so you say tape number so and so the robo will move and you pick out that tape bring it back and put it in the reader. You may not have only one reader you may have four or five tape readers. These tapes I know cost crores of rupees but they guarantee that you can actually get the data which this is called near line restore. That means you don't have enough disk capacity to put all that data but if you want to recover somebody's data which has gone into archival story you don't have to have physical people hunting for tape storage from the cupboards or something. This will be automatically hunted. The recovery time is anywhere between 10 seconds to about one minute. 10 seconds if the tape is already there so you just re-read that tape into the storage itself. Otherwise you instruct the robo to go to that particular tape say which is in dry 1324 then pick out that and put it here. An additional question that I will leave for you to think is just like there is an indexing issue on disk files when I have huge backup which has 1000 tapes and not all these tapes contains sequential backups they someday contain an incremental backup of this database something of something else. How will you index these tapes? And where will you keep this index itself? So these are some very non-trivial and large issues on information system which the backup technology must answer. I think we will stop here now.