 So, in this morning's second session, we are going to be covering database internals. Now, given that we don't have all that much time to cover it in this workshop, it's going to be a highly condensed version and what I would like to do is go over the slides fast and have time for discussion. So, I'm not even going to cover all the slides, I'm going to cover only a small subset of the slides. In the final December workshop, you know thing, we will be going in a little more detail than this. So, in the two hours in today morning, I'm going to be covering essentially three chapters which are storage and file structure, indexing and query processing. So, what are the issues in storing of data? What are the different types of storage media and how do they affect the design of a database and how do they affect the performance of a database? These are very important questions which we will briefly address. The next topic we will address here is the issue of RAID systems which are really critical to ensure that data is not lost and some of these ideas which are used in RAID systems are also used at a different scale for replicating data across different places for disaster recovery. We will also briefly talk if time permits about file organization and buffer management and so on. So, physical storage media as you are no doubt aware can be classified as either volatile or non-volatile. Volatile is media which will lose their data the moment power is switched off and non-volatile is media which where the data will persist. Now, there are many kinds of media here and there is a hierarchy starting with cache memory at the top which is volatile, but extremely fast and it is in the processor typically. Next you have main memory which is significantly slower than cache memory, but it is a lot bigger or in the last five years or so flash memory has really become very important and cheap and large and it is very widely used today in between at a level in between memory and disk. Disk is even cheaper than flash, but in comparison to flash it is a lot slower especially for random IO which has had a major impact on the design of databases on the design of query optimizers. The fact that magnetic disk has slow IO has had a huge impact and as a result because of the expansion of flash memory which does not have some of those drawbacks the actual internal design of databases is today being reworked every major database company is rethinking their internal architecture, their query optimization and processing strategies to take flash memory into account and then there are optical and this and magnetic tapes which are becoming increasingly less important for data storage of at least of data which is typically in a database system. So, again these levels are called primary which are cache and main memory, secondary which is what basically concerns us flash and magnetic disk and then tertiary which is used to be magnetic tape and optical storage, but some people say that for a lot of databases flash is going to become secondary and disks are going to become tertiary storage in the few years to come. So, although this may become tertiary in a few years for the moment we still have to live with this for any substantially large amounts of data. So, we need to understand what are the characteristics of this and how do they affect how data is stored and retrieved and to understand that we actually need to know a little bit about how this are mechanically this of course is schematic diagram they do not actually look anything like this, but the key thing to notice there is a set of magnetic disk which are spinning and around this spindle and they are spinning fairly fast something of the order of 5400 to 7200 revolutions per minute and then there is an arm assembly which actually swings this diagram does not show it that well, but basically this arm can swing from the inner portion of the disk platter to the outer portion and when arm is stationary and the disk is spinning the part which goes under the disk one of these disarms is called a track and if a disk may have multiple platters these days disk do not have all that many platters they have few, but the set of tracks the circles in one platter plus those in the next and so on all of them put together form a cylinder. Now, each track here is divided into units called sectors which are the smallest unit for reading and writing from the disk, but database systems usually combine multiple sectors into a logical unit which is referred to as a block operating systems also do the same thing database systems may use the operating system block size or they may use a different block size. So, we have already seen this now that was a single hard disk. Now, most large scale databases do not have a single disk they have a collection of disk and the question is how do you organize all the disk together and give a single view to the database. So, the database does not have to necessarily worry about how many disk there are and what is the capacity of individual disk and so on and there are two ways of doing this one is a storage area network which basically has disk which could be organized in some form and they may actually combine multiple physical disk into one or more logical disk which look much larger and higher performance logical disk and the key thing here is that this are not directly attached to the computer and what is the reason for this? The reason is that computer systems do have failures if a computer fails you do not want the data to be inaccessible you would like other computers to be able to access the data. So, a storage area network is basically a collection of disk which are shared across multiple computers and each of these can access the disk directly and they get a view of a single large disk. Now, of course, if multiple people go and update the same disk there will be chaos. So, the software has to take care to make sure that you do not write garbage on the disk because multiple computers are clashing with each other. So, that has to be dealt with. In network attached storage on the other hand the fact that some there is a disk is kind of hidden what is exposed to the file system view. If you have used NFS or AFS or the file sharing in Windows all of these present network attached storage interface. Which hides the fact that there is a disk you just see a file system. Now, when you have a disk how do you know measure how its performance is which has a major impact the first thing is access time which is the time it takes from the time you issue a read or a write to the time when the data is actually stored or retrieved. What is involved in this access time for a hard disk? Well, there are two aspects one is the particular data you are writing may be on a specific track of the disk which means the disk arm has to move to that track this time is called the seek time. Once the disk arm is moved to the track the disk has to rotate until the specific sector or the block which you are reading or writing comes under the disk arm that is when it can be read or written. So, that is a rotational latency. Now, the seek time is of the order of 4 to 10 milliseconds depending on the disk system and rotational latency is also similar. So, overall latency is of the order of 8 to 20 but it can be a little bit less than the average latency for some high end disk can be as low as 5 milliseconds and more typically of the order of 10 milliseconds average. Once you have started reading data though. So, 10 milliseconds is a fairly large amount of time if you consider that on a single disk you can have at most a 100 separate read write operations per second 100 read write operation at different locations. So, the problem is that this does not support a large number of random I O random I O is a huge problem with this. In contrast once the disk arm has moved to a position this can read data extremely fast and this these speeds have been improving steadily over the years and today single disk can give anywhere between 25 to 100 megabytes per second and if you have multiple disk you can get much faster than this and in fact if you look at flash storage systems some of them actually give of the order of 400 megabytes per second the high end ones. So, that is extremely fast and it is actually hard for the CPU to keep up with these kinds of transfer rates. So, many databases today if you do sequential I O the bottleneck is the CPU not the disk itself of course the solution there is to have a parallel database you have multiple CPUs which can process parts of the data. Another problem with disk as also with flash and other things that they are susceptible to failure and how often do they fail that is characterized by the mean time to failure which is assuming that the failure rate is uniform. In reality disk drives tend to fail more when you purchase them immediately and also after a few years when they are growing old they start to fail more often in between they fail less often. The mean time to failure which manufacture is claim is usually in this intermediate period between the initial high failure rate and old age. What is the expected frequency of failure is translated into mean time to failure and so the claim is of the order of you know around a million hours mean time to failure for typical disk which sounds quite decent, but if you have a large collection of disk if you have a thousand disk you can expect one disk to fail every thousand hours and how much is thousand hours it is around 50 days less than 50 days. So, any parallel machine with multiple disk is bound to see lots of failures and if each of these failures results in loss of data you are in deep trouble. So, just because the disk fails doesn't mean you should lose data how do you protect against that basically using rate we will come to that in a moment, but before that I also want to briefly talk about flash which is as a technology it's been around for a while, but it's widespread usage is in the last five years or so. Now, what are the characteristics of flash? You know flash is very widely used it's very cheap and we use pen drives all over cameras or cell phones everybody uses flash. In the context of database though what are the issues with flash? So, if you see NAND flash a single flash device may have about only about 20 megabytes per second, but typically manufacturers put a large number of these in parallel in order to get much higher bandwidth. So, the solid state disks as opposed to your pen drive may give you only about 20 megabytes per second, but a solid state disk would keep multiple of these in parallel internally and give much higher bandwidth. Flash is not quite like memory. One of the tricks used to reduce the cost of flash is that flash memory is accessed kind of serially meaning you can access the first bit of a page and then the remaining bits of a page have to be read sequentially one behind the other. You cannot randomly access a byte within a page. So, the unit of reading from a flash device is actually a page. How big is this page? It depends on the flash device, but 512 bytes is a typical flash page. So, flash is not truly random, but it's close to random because 512 bytes is a fairly small unit. But the key difference between flash and magnetic disk is that on a flash device once you've written a particular page you cannot update it immediately. Now, what do you mean by you cannot update it? What you mean is that in order to make any change to that page, you first have to erase that page and the erase operation is different from a write operation. And this erase operation is actually very slow. Write operation is quite fast. It can write within a few microseconds and erase operation can take a millisecond or more. On the other hand, an erase operation can erase a large number of pages at one go. In one millisecond, you can erase a block containing many pages. Now, how do you handle this situation? You cannot immediately write to a page. You don't want to wait one millisecond to erase and then write it. So, what all flash devices do is they have a layer on top which provides a remapping or a mapping of logical page addresses to physical page addresses. So, I've written a logical page. So, I want to write it again. It's actually written somewhere else on the flash device, not in the same physical location. And all of this is hidden from you. When you use a flash drive, this is actually happening internally. You don't know about it. There's a layer of software which is taking care of all of this. But how does that impact a database system though? The issue is that if you keep writing a page again and again on a in memory, it's not a big deal. On disk, again, it's not a big deal. On flash, if you do a lot of writes, eventually the fact that you have to erase these pages, old copies, catches up with you. You can't keep writing indefinitely because you have to spend some time erasing them eventually. So, you can deal with short bursts of writes, but eventually you have to erase it. So, a fair amount of effort is going into database design to minimize the number of repeated writes to the same page. There's another issue which is that you cannot keep writing the same page too often, which eventually the page will get damaged physically. But that is again hidden from you by something called wear leveling. I won't get into details. Moving on to raid, the key idea behind raid is that you keep some extra information. If a disk fails, you can use the other copy. And there are actually two concepts in raid. One is parallel IO. The other is protecting the data in spite of failure. So, redundancy is the part which protects the data. One simple way is to have wherever you had one disk, you have a second disk. And any write goes to both the disks. So, they are kept identical every time a write happens. This is basically called mirroring. And this is actually a simple but very effective technique to protect from data loss. So, if one disk fails or even if not the whole disk, if one block of the disk gets damaged, you can read it from the other. What is the probability that data is lost? It will happen if both the disks fail. Now, why would both the disks fail? It could be because the machine room burns down, but that is less common. If you don't worry about that kind of failure, if you worry about the usual more common kind of disk failure, the probability that both the disks will fail before you repair the first disk. Supposing a disk fails, you do nothing about it, eventually the other disk may also fail. That's a bad idea. So, the moment a disk fails or soon after it fails, you need to replace it and copy data from the surviving disk to the new disk and bring it back. In that window before the disk is replaced, there is a danger that the second disk also fails. So, the probability that you lose data is proportional to the failure rate times the time to repair. So, if you take one day to repair, there is a certain probability. If you take one month to repair, the probability that you lose data goes up 30 times. So, you would like to repair it soon. In fact, there are rate systems which will keep a spare disk and if a disk fails, they will start repairing it immediately before any human is even involved. So, within a few hours, they can recover the failed disk and then there is no chance. The chance of data losses becomes small. The second part of rate is that you can write data across multiple files. So, if you are reading data, if you are reading a large file, you can read it in parallel from all the disks. So, how do you spread? How do you kind of spread the data across multiple disks? There are the standard way to do it is to take a block of a file. How big should the block be? We will see that, but take a block, write it to the first disk, take the next block, write it to the second disk, the third block to the third disk and so on. So, now, when I read a large file, I am going to read blocks of the file in parallel from all the disks. So, what is the benefit? I get much higher bandwidth. So, if I need to retrieve data fast, I can read in parallel from the disk. So, that is the second benefit of rate. And block level striping is typically what is used. There are some alternatives which we are not going to bother about. So, there are what are called rate levels. Rate level 0 is striping of blocks across disks and there is no mirroring. Rate level 1 has striping plus mirroring. Every disk in this stripe is also mirrored. And so, this is fairly common. Rate 0 is used only if your data is already saved somewhere else. You are just keeping a local copy and you can always retrieve it, then you can use rate 0. Otherwise, you can use rate 1 or you can use rate 5, which does not actually keep a copy of the data. Instead, what it does is for each block, let us say there are 4 blocks here. The first block of each of these disks or these 4 disks, a parity for that block is stored. A parity block is stored on the 5th disk. So, if this disk dies, you can actually recompute the failed blocks of this disk using the parity blocks here. And there are some other details in distributed parity. For lack of time, I will not get into it. But what is the benefit of rate 5? The overhead now is 1 extra disk per 4 disks whereas, rate 1 had 4 disks per 4 disks. Now, this 4 is not a magic number. I mean, you could use 5, 8, whatever. But if you use more disks, then there are some drawbacks. In particular, for rate 5, it is quite bad for random IO. Because the moment you want to write a particular disk, you also have to compute the parity block for that, recompute the parity block and write it to another disk. Recomputing the parity block has some cost here to read the old parity, write it. So, for random IO rate 5 is actually quite bad. For sequential IO, it is okay. It gives you good performance at low price overhead for multiple disks. So, these are used in different settings depending on whether you want high random write performance or you are storing a lot of data and you want to minimize the price overhead. So, that is the choice of rate level. There is one other issue in rate which is usually distinguished as software versus hardware rate, although some hardware rates do not actually do some of these features. Then the basic idea is the following. When you are writing data to two disks, there can be a failure in between the two writes. Maybe power goes off and now the two disks are inconsistent. When you come back up after power comes back, how do you know the two disks are inconsistent? Rate depended on having two copies of the data. Now, one disk has an old copy, one disk has a new copy and if the disk with a new copy fails, you only have the old copy. You lost data. So, how do you deal with this problem? And hardware rate deals with it by basically keeping a small amount of non-volatile RAM where the write is recorded. So, if you come back up after a failure, it completes any incomplete write. So, both the disks come back in sync. The alternative is to scan the two disks and compare them. And if they are different, make sure that the write is copied to the other one and they are brought back in sync. Again, all of this is hidden by hardware rate controllers and typically you do not have to worry about it. So, that was a very quick overview of rate. There are more details if you are interested in the book. Now, this is the hardware level. How do you actually store the data in the physical device? But databases need also an abstraction. How do you store records on the disk? What is the way in which records are organized? So, there are different ways to do it and these are called file organizations. Now, a database itself is typically implemented on top of an operating system, although high performance databases bypass the operating system and directly get access to the disk. So, the operating system does not come in the way. So, if you want really high performance, you would bypass the OS. But for most normal users, it causes more trouble than it is worth. So, the database system uses operating system files to store its data. Now, how does it organize? How does the file system organize files on top of raw disk? Well, that is a subject for a different course not this course. So, we will not bother about it. What we have bothered about is how are records stored in files? So, if you assume records of fixed size, one way which is very widely used actually for many years on the earliest days of computers is to store records one after another in a sequential file. So, if the record is 20 bytes, bytes 1 to 20 is the first record, 21 to 40 is the second record and so on. And within each record fields are at fixed offsets. So, you know where to start reading the name, where to start reading the salary and so on. So, the earliest database is all insisted on fixed record sizes because they were very easy to map to files. But later people said well, there are some issues I will come to variable records in a moment. But if you have fixed records, what if you delete a record? What do you do? So, if you deleted record 3, you could move all the records up in the file. But of course, that is very inefficient file can be huge you do not want to do this. Another option is to move the last record of the file up into the vacated space. Yet another option is to leave the space empty and keep a list of spaces which are empty currently. So, if you insert a new record you can use those spaces. And this is more commonly used than the previous alternatives, free lists. But actually today's databases do not restrict themselves to fixed size records because why? Primarily because programmers would like to use don't want to have strict limits on names. Early generation databases say the name should be only 10 bytes or 20 bytes. Now, what if a name is longer? We have many people with much longer names. Their names get truncated. So, how do you make a decision on how long should the name be? Programmers would prefer to say let the name be as long as it needs to be. But now if you allocate a lot of space for a name let's say 100 bytes per name. For most people you are wasting all the remaining bytes. If their name is 10 bytes, the other 90 bytes are being wasted. So, that's actually a bad trade-off. So, the database should support variable length records and we will see how to handle them. There are again two aspects in variable length records. How do you lay out the fields within a record? So, there is no wasted space. There are some details here. For lack of time I am going to skip this. The other aspect is how do you allocate the records to pages such that you don't waste space? And let me spend a minute or so here. So, typically a database consists of multiple blocks, a file rather consists of multiple blocks. Within each block you have multiple records. The way it's stored is the records are stored consecutively on one side of the block. If you delete a record, what happens? You move the records to fill that space. That was the scheme we discussed in the beginning for fixed length records. But you can apply it for variable length records. But the key difference is that you do it only within a page. You are not going to move an entire file worth of records, just one page worth of records. Now, how do you know where a particular record is on the page if it's going to be moving around? So, you have a header here which keeps the size of each of the records in the page and the location of the records. So, externally if I want to say I want to identify a record in that page, I will say what is the address of the page and then an offset here, the third record of the page. The third record, how do I find it? I go to the third entry and then see where does that record start, which is here in this case. And I also know how big is that record. So, I can read the bytes of that record. Now, how do I know what are the contents of that record? That was what we saw in the previous file. How to interpret the bytes of a particular record? We skipped it, but it's there in the previous slide. So, this is how records are organized into pages. Now, how do you decide which page a particular record should go to? Now, there are several alternatives. In a heap file, which is this default in pretty much all databases, a record can go anywhere. If you are adding records continuously to a file, they will go to the last page or block of that file. But if you have deleted in between and you insert a new one, it can go anywhere. You don't control it. In a sequential organization, you decide that the file should be sorted on a particular key, could be the primary key, it could be some other attribute of the record and the system will keep it sorted in spite of insertions and deletions. While in hashing, you compute some hash value on the key and that decides which page it goes to. For lack of time, we are not going to cover hashing in this course. We are going to very briefly look at sequential organization data in the context of B trees. So, this is how records are allocated to blocks of files containing. So, you may have a file per relation typically, although some databases will allow you to sort multiple relations in a single file or most databases will also allow you to split a relation across multiple files. Again, you don't know anything about it as a database user. It's done transparently by the database. You don't know anything about it. In fact, large relations are typically spread across multiple files and pretty much any database system. Now, the next higher level is how do you know what are the relations in the database and which files contain records of that relation, what is the schema of the relation and so on. So, this is the next higher level view. This is the relational view of the data and this is actually stored in what is called the data dictionary or the catalog. So, the catalog has various kinds of information. For example, what are all the relations in the particular database? Typically, the database is actually organized as multiple schemas. So, you will keep track of which schema contains which relation and then it has the names, types, lengths of attributes of the relation. It has views and their definitions. It has integrity constraints. So, all of this is the data dictionary. It also has more stuff for the users, passwords, some statistics about relations which is used for query optimization. How is the relations stored? Where is the relations stored? Maybe the file names or the physical location if you are bypassing the file system. What are the indices in there? So, all of this is the data dictionary. Now, if you think about the data dictionary is actually storing information. Now, how would you organize the information in a data dictionary? You could model it. You could have an error diagram for information that goes into a data dictionary and then map it to relations. And in fact, the relational representation is actually what is used to store the data dictionary. And pretty much every database lets you access this information through a special set of tables which have special names. Now, what are those tables called? Depends on the database you are using. There is no standard for this. In PostgreSQL, the PG underscore relations have that. In Oracle, I think it is called sys dollar. And other databases have different convention, but you can actually access those relations directly, which of course brings up the issue. We said that the information about where the relation is stored is in the catalog. Now, we have a chicken and an egg problem. How do I find where the catalog relations are stored? In particular, this specific catalog relation which keeps track of where are relations stored. I need to use the relation to find out where it is stored, which is of course impossible. So, as a special case, that one relation is stored at a standard location in the database. When you create the database, the first block or one of the early fixed blocks of the database is used to store that relation, starts from there. So, you can go to that fixed location and find what are the relations and where are they stored physically. Included in these are all the remaining relations in that catalog, which you can then use to process it. And what databases again do is, they read the catalog information into memory when the database starts up. So, they do not have to go to disk all the time to do every small task. So, the catalog is actually replicated in memory while the database is running, at least parts of the catalog are replicated. And this schema diagram shows a toy catalog. For lack of time, I am going to skip it, but it should be fairly straightforward. And the last topic in this chapter is the database buffer. So, we have seen how data is organized on disk. We have seen how to find out where data on disk can go and fetch it. So, now every time you need to access data, one alternative is, you go through the catalog, find out where data supposed to be, go to that file and fetch the data. Now, if you access a certain piece of information repeatedly, this is going to become a big burden. Every single access results in a random IO. It is going to be very, very slow. So, it is absolutely critical to have a cache or a buffer of blocks from the database. So, that cache is in memory. So, if you access a particular block and then access it again soon, it should be in the buffer. So, you do not pay the disk IO cost. So, this is absolutely critical. And the database buffer placed this role. So, what is the role of the buffer? It has a set of pages in memory, which contain copies of disk blocks. Which disk blocks? This changes dynamically. When you read a block, it is read into the buffer. What if the buffer is full? So, then you have to throw one of the existing blocks into the buffer. So, there is a buffer replacement strategy. Now, you may be familiar with replacement strategies in operating systems, LRU or clock or other things. Now, databases typically use more specialized strategies, because they know typically what you are going to do with the data. You are doing a join, you are doing a selection, you are doing a relation scan. So, using this knowledge, they manage the database buffer more cleverly than an operating system could, which does not know what you are doing with the data. So, that is the job of the buffer manager. I am going to skip these slides, we have already seen this. And so, that was a lightning overview of storage and data access. Are there any questions on this before we move on to the next chapter?