 to the 17th lecture in the course Design and Engineering of Computer Systems. So in this lecture, we are going to continue our discussion of file systems in operating systems. So this is a recap of the previous lecture. We have seen that a file is stored non-contiguously on the disk. So on the disk, a file is split into fixed size blocks and all these block numbers of a file are kept track of in the inode of a file. So this inode is the metadata and whereas these are the data blocks of a file. And the inode number of a file uniquely identifies a file. And every file and its inode number, this information is kept track of in the directory that is containing the file. A directory is also a special file which stores information between the file name and the inode number. All of these mappings of all the files present in a directory are stored in the data blocks of a directory. And anytime you want to find out the inode number of a file, you will recursively traverse the path name starting from the root and finding out the inode number of a file. Then on disk, a file system consists of a super block, various data blocks, metadata blocks like inodes, bitmaps and so on. And in memory, a file system consists of various data structures. You have the file descriptor array which is part of the PCB of a process. This has pointers to various open file table entries of the files that are open by this process. And this open file table has information about all open files, it has information like the offset and you know pointers to inodes and so on. And this inode is an in-memory cached copy of the actual inode on disk. And you also have the disk buffer cache which is the cache of the recently accessed disk blocks. So all of these are the data structures pertaining to a file system on disk and in memory and all of that we have seen this in the previous lecture. Now in this lecture, we will see how system calls like file open, read, write all of these are implemented. So now let us begin with the open system call. So this is the simple syntax of how the open system call works which is you take a path name as an argument and a few flags that describe how the file should be handled and this open system call returns a file descriptor or a handle, some number it will return to you. So what does this open system call do? It will first traverse this path name, we have seen how this is done in the previous lecture, you walk this path name and finally find the inode number of a file. Now if this file already exists, you will find out its inode number. What if this file does not exist, then you search in this directory you will see that there is no entry for a.txt and if the flags indicate that create a file if it does not exist that is one of the flags that you give. If we are asked to create a file while opening it then we will allocate a new inode from the free inodes on disk and we will add a new entry like in this food or in this food directory we will add an entry from a.txt to its new inode number, this entry will be added and you will create a file also if required during this open system call. So now you have identified the inode number of a.txt or you know assigned a new inode and a new inode number if one did not exist before and once you have this inode number what you will do is you will copy the inode into memory, whenever you open a file you will bring its inode into memory for faster access in the future and then you will create all of these connections in the file descriptor array in the open file table so you will create a new entry in the open file table which has a pointer to the in memory copy of the inode then this pointer to the open file table entry will be stored in the file descriptor array of a process and this index in the file descriptor array this will be returned to you as part of the as the return value from the open system call ok. So every process by default when the process is created it has three files that are open that is in the file descriptor array entry 0, 1, 2 are already allocated for standard input, standard output, standard error that is the way you take input, the way you standard output which usually points to screen, standard error all of these are already open by default. So subsequently if you open a file as soon as you start a program you will get that the fourth entry and so on will be returned to you as a file handed and of course the closed system call is simply the opposite of it it will you know free up this entry this entry and so on all of these connections that were created as part of the open system call will be freed up in the closed system call. So this is a summary of how the open system call works and once you get this file descriptor you will use this in all subsequent system calls to read or write a file and so on. How? From this file descriptor you know the pointer to the open file table entry from there you know a pointer to the inode and in the inode you have all the data block numbers of a file from that you can figure out which block on disk to access to read or write from the file. Therefore this return value this file descriptor is important. Now how do you read a file? So let us understand a little bit more detail about the read system call. So the first step I have just described use the file descriptor and finally you will access the inode of the file that information is known to you. Now from the inode of a file you will identify which data blocks of a file to read. So this read system call takes these arguments the file descriptor then there is a buffer a user space memory and how many bytes to read. This information is given to you as part of the read system call and you also have some offset if you know have you read 100 bytes in the file so far have you read 1000 bytes in the file so far this information the offset is also maintained in the open file table. From all of this you will know next which set of bytes in the file should I read and where are they located on disk this information you will get from the inode. So now you have some block numbers and the read system call can request large amounts of data also that can span multiple blocks also. So for simplicity let us assume there is a single block that you have to get from disk in order to satisfy this reads request. So how do you do that? So first suppose we need this the next 64 bytes are located in say block number 42 on disk. First what you do is you will check the disk buffer cache. Has somebody else or this process in the recent past access the same disk block from the disk in which case this block number 42 will be present in the disk buffer cache. So we will check that if it is present of course we will directly go to the next step if it is a hit in the disk buffer cache we will go to the next step otherwise if it is a miss in the disk buffer cache that is whichever data you want to read in this file is not present in your disk buffer cache then you must go to the actual hard disk in order to fetch it. So this is where all of this thing we have seen about issuing a command to the disk all of this comes up. If it is all of this is only if it is a cache miss in your disk buffer cache. If it is a cache miss then the device driver will ask the hard disk to read the data and this process that has made the read system call at this point it will block because you know there is no point going to the next line until this buffer has the data the program will assume that this buffer has the data present from the file if that is not present what is the point executing the next line in the code therefore this process will block and the OS will context switch to some other process and at a later point of time when the disk slowly finishes reading this block number 42 from its hardware it will then DMA the block directly where will it DMA the block not into this buffer but into the disk buffer cache ok. So you have the disk buffer cache has many slots for recently read disk blocks and the hard disk will DMA the block say this block number 42 it will DMA it into one of these free entries in the disk buffer cache and then it will raise an interrupt ok. So by the end when an interrupt is raised then the data is already present via DMA in the disk buffer cache. So then the OS will handle the interrupt it will say ok this process that was blocked it is ready to run the scheduler will run this process and so on. So if there is a cache miss and you have to go to disk all of this will happen. On the other hand if there is a cache hit this block 42 is already there in cache due to a previous read then you do not have to do all of this. So then what do you have to do the only thing you have to do is copy whatever bytes. So the in the user space there is this buffer the 64 byte buffer that is given to you if the user has requested 64 bytes copy them from this disk buffer cache into this user buffer and then you can return the read system call will return ok. So if you memory map a file then this extra copies avoided when you memory map a file you will directly read from this cached copy using different virtual addresses we have seen that concept before. But if you are using the regular read system call because the read system call may request only smaller you know it may request 5 bytes 10 bytes. So therefore you would not give access to this entire block that you read from disk which is there in the disk buffer cache. If your small amount is needed you will copy it into this user space buffer and then the return value from the read system call is the actual number of bytes that have been read ok. So sometimes if you say 64 bytes and if your file only has you know 32 bytes left you might only return 32 bytes or you might return some error value say if this read could not happen whatever. So the return value will depend on how all of this processing actually happened ok. So this is a summary of how the read system call works. And note the importance of the disk buffer cache here. If it is a hit in the disk buffer cache the process need not be blocked and it can immediately return from read but if it is miss in the disk buffer cache then it might result in the process getting blocked. So next we will see the write system call. It is similar in structure to the read system call the arguments are the file descriptor some memory buffer which has the data that has to be written and the actual number of bytes to write. So using this file descriptor you can identify the inode and from the inode you can identify which data blocks of a file should I write into. So if you are writing beyond the end of the file that is suppose your file has 100 bytes so far and you want to you know your offset is at this point and you want to write the next 100 bytes. Then you will have to allocate new data blocks for the for this write system call you are not writing into the existing blocks but you are writing beyond the end of the file. If you are writing beyond the end of the file then you need to allocate new blocks that is you know you will have to go to your free list or some bitmap on your file system. Find out some new blocks get this new block number then in your inode this new block number has to be added because this is a new member of the family right you have to tell the inode hey keep a pointer to this new block also. So all of these changes to the bitmap indicating that this block is occupied to the inode adding a pointer adding this block number all of these changes also have to be made if you are expanding beyond the end of your file otherwise if you are writing to an existing block in the file you simply using the offset and the block numbers in the inode you identify the existing block number. And then the next step is you look for this block is it there in the disk buffer cache. If it is not there you will read it into the disk buffer cache first all writes happen in memory first they do not happen on disk therefore if this block is not there in your disk buffer cache you will first read the block into your disk buffer cache. And now so this block number wherever you wanted to write this file data that block is there in cache and then your write system call has given you a buffer of some number of bytes to write. Then at this point once you have obtained a copy of this block in the disk buffer cache which might you know this reading might cause the process to block and all of that is done after all of that is done then you will copy the requested number of bytes from this user space buffer into this copy of the block in the buffer cache. And now this cached copy of the disk block is different from whatever was there or is there in the hard disk you got it into the disk buffer cache and you modified this cached copy therefore now this cached block is marked as dirty. And then when do you update this disk copy this final copy and eventually everything on disk has to be updated because this is this disk buffer cache is just in main memory that is RAM that is all a tile when the power goes off this is gone. So finally you have to update the on disk copy but when do you update that on disk copy there are two ways if your disk buffer cache follows the principle of a write through design then as soon as in the write system call you modify this disk buffer cache entry immediately you will also modify the disk entry you will do a write to the disk ok. But if your cache is a write back cache then you will say ok there is no hurry to do this I will do this later that is you will simply keep a changed copy in cache the on disk copy has not been updated and later on when some such changes accumulate or something you will write ok. So now your user code when does it resume in the case of these are also called synchronous writes write through cache is also called a synchronous write in the case of synchronous write in the when you make the write system call then itself the disk access also will happen therefore you will return from the write system call after a delay. But if it is an asynchronous write if it is a write back cache then your write system call will just dump the data into this cached buffer and it will return immediately ok. Then the system call when it will return will depend on the kind of your cache and when it returns it will actually tell you the number of bytes actually return if there are some error if all bytes could not be written all of that information is given back to you in the form of the return value from the write system call. So the next concept is what is called linking and unlinking a file. So the same file say you have a file directory 1 slash a dot text ok. So this a dot text is actually pointing to some i node number and you know it has some data blocks the pointers of which are stored in this i node and so on. So the same i node the same file you can actually link it from a different location in the directory say directory 2 slash b dot text ok. So this file name can also refer to the same i node. So this is called linking a file from different locations ok. So when a file is created when you create a dot text then this i node number is linked to this directory for the first time this is your first link. But once this i node is created you can add many different links from many different path names to the same i node and the same file and in the i node there will be a counter a link count that basically tells you how many such links are there to this file data from different directory locations ok. So this type of linking where you add a mapping between an existing i node and a new file name in a new directory that is called hard linking ok. And these are just different aliases or different path names to access the same underlying i node. Now if you delete this file directory 1 a dot text then this link is gone but you can still continue to access the file from this link and once all such links to an i node are gone once its link count goes to 0 that is when the file is actually fully deleted from disk. But as long as one of these links are alive from different directories the i node continues to exist the data in the file will continue to exist ok. So this concept is when you we say we link a file to a directory it is this is called hard linking. So the flip side of this is what is called soft linking which is you simply add a mapping from a new file name to an old file name that is a dot text you will say this is equivalent to some other b dot text in some other directory you are just storing a mapping between file names and not a link to the underlying i node number. So now if you just soft link this b dot text to a dot text then if you delete a dot text you can no longer access b dot text right the soft link will be broken why because you are not storing a mapping to the underlying i node number you are only storing a mapping to another file name that is called soft linking and there is a difference between hard linking and soft linking. So the ln command in Linux if you have used it you can do both hard linking and soft linking with it. Then there is another system call called the unlinked system call which is basically removing these links. So normally when you delete a file when you say do rm in Linux you are not actually deleting the file fully you are simply unlinking it from one of the directories and if all such links are gone it the file is unlinked from here unlinked from here then it is truly deleted on disk. So this is the concept of linking and unlinking and there are system calls for this also. So the next topic I would like to describe is what is called crash consistency that is we have seen in all of these system calls every system call updates multiple blocks on disk you know you when you are appending data to a file you are changing the data block of the file then you might if you have allocated a new data block maybe you are changing the i node block of the file in order to add this new block number you are changing the bitmap block in order to mark this newly allocated data block as occupied. So you are changing multiple blocks if this is your disk you are touching this disk location this disk location multiple things you are touching in one system call and all of these changes are being made first in memory and then they are being written to disk ok. So your memory will first you will change them here and then you will write it to disk and whether it is you know write through cache write back cache whatever it is the changes are first made in memory only then made to the disk and this holds true for everything even your i node block if you are changing your i node that also that i node block is also there in your disk buffer cache you are changing that copy first then changing your disk copy. So now when you are doing this in two steps first memory then disk what happens if a power failure occurs you know the user has made a system call the user has written some data but before you can write that data into the hard disk your power has failed your memory contents are wiped out clean. So then in such cases what can happen your disk can only be partially updated and you might see some weird behavior some inconsistent behavior in your file data because of this partial updates. For example suppose you are you know writing to a file adding data to a file right you have to in the file at the end of the file you have added a new data block and you have managed to write the data to this data block but the pointer to this data block that is there in the i node this pointer you have not managed to write that is the i node on disk is not updated but this data block is updated then what will happen you can no longer access this data later on when you start if nothing if something is not stored in the i node you cannot access therefore it is like your written data is lost. Another problem suppose you made a change to the i node saying hey keep a pointer to this new data block but this data blocks contents you have not written this data block still has you know some empty values or some garbage values from a previous life whatever it is it has some garbage here and this pointer to this data block number you have stored in your i node but this itself has not been updated then what will happen when you read the file you want to get some garbage values. So all sorts of these weird things can happen. Therefore when you boot up after a crash your file system has to do some things in order to ensure that this crash inconsistency due to a crash does not happen and how do you do that that is the problem of ensuring crash consistency in file systems and the many techniques that we are going to see now. Note that this problem will exist even with right through cache even with right back cache anything. So then how do we guarantee crash consistency modern file systems have many ways to deal with it clearly power keeps going on and off and all of that so we have to handle it. So one tip that a programmer who is building a file system can keep in mind is that always update the data blocks before updating the metadata blocks. So if you are writing new data at the end of a file it is better to write this data first and then store a pointer to this data block in the i node so that even if this first step complete second step does not complete it is okay you have just lost data but if you do this first and this later then what you have written some block number into your i node but that block number that block has garbage that is even more of a bigger problem. Therefore in general it is good to first write data and then add all the pointers to that data when you know your data is safe on disk then add a pointer to it in your i node and things like that. But this alone is not enough to guarantee consistency because you know you could have cases where there are multiple metadata blocks then what in this case of appending to a file I have to change the i node and I have to change the bitmap marking a certain data block as occupied. So there are two metadata changes which one do I do first. So therefore just this principle of writing data before metadata is not enough we need something more and therefore we have tools like file system checking tools in Linux it is called FSCK. So whenever you boot up after a failure if there is some inconsistency in these metadata blocks if some metadata changes have been written and some have not been written if there are inconsistencies then this file system checking tools will run through all the data structures in a file system at boot up and try to identify these inconsistencies. For example if you find that the bitmap marks a certain data block as free but that data block is present in the i node of a file that should not happen right if you have allocated a block for a file and you have stored its block number in the i node then that block should be marked as occupied in the bitmap otherwise somebody else may use that block and overwrite your contents we do not want that. Therefore all such inconsistencies you will check this file system checking tools will you know look at all the bitmaps if a block is occupied is it there in some i node if it is not there in some i node mark it as free all of these checks will be done by these file system checking tools. Whatever invariants you want to guarantee on your file system will be checked okay but this is of course it takes a long time to do these changes and it is it may miss some of the changes some changes might still be lost right. So this is not the only way especially if you are very particular about the consistency of your data you are like a banking system that is you know very important for you to keep data safe then you might do some other techniques beyond just file system checking tools. So the fundamental problem here is what we want is this property called atomicity that we have discussed in an earlier lecture. What is atomicity if I am making multiple changes in one transaction or in one system call I want either all the changes to happen or if you there is a failure it is okay for all changes are also lost but I do not want this case where some changes happen and some changes do not happen. Remember the example you do not want an e-commerce website to charge your credit card and then forget to ship you the product that is not right. So what we want is atomicity either do everything or do not do anything at all. So how do we guarantee atomicity a common technique across computer systems that we will see in many places to guarantee atomicity is the concept of logging and this can also be applied to file systems. So what do you what does this concept of logging mean suppose you are changing you know multiple blocks on disk the data block the inode block the bitmap block you are changing multiple blocks on disk. So instead of directly changing these one by one and have the problem of power failing in between what you will do is you will write all of these changed blocks into a log into a special area of disk called the log okay block one block two block three all of these changes that are have to be made atomically are first written to a log the original disk locations are not touched and after you store all these changes then you will write a special commit entry saying okay this is all this is the start and this is the end of all my changes you will make note of this on disk and then once this is done then you will start making the changes one by one you know you will update the inode you will update the actual bitmap everything you will update and once all these changes are done then you will clear out this entry. So what has this logging achieved you are writing it two times you might say okay what is the big deal you are just writing the same thing two times here is what it achieves okay suppose a crash has happened before the log has been committed you only wrote these two three blocks and the crash has happened then when you restart there is no commit entry so you do not know are these the full set of changes or not you do not know therefore you will not go back and touch these actual disk locations then no changes will be made on the other hand if the crash happens after the log is committed then you have a record of all the changes you remember all the changes you have to do and then after a crash when you restart you will go back and apply all of these changes you will you know modify the actual inode bitmap everything you will correctly modify you will this is called replaying the log so that all the changes are done atomically so either you do all changes or you do not do anything at all so this logging is a good way to guarantee atomicity. So in general there are many reasons why your hard disks your secondary storage devices can fail and they can lose some of the data so logging is of course one way to protect against power failures but there are also other reasons like the data on the disk itself can get corrupted due to various reasons you know you you've written a bit one after some time due to age the hard disk might make it into bit zero we cannot guarantee that things will stay forever so therefore we need techniques to protect the integrity of the data in the case of failures so logging is one way to protect against power failures but there are also other techniques that modern file systems employ in order to guarantee data integrity for example you have the concept of raid disks which is you use multiple disks not just one hard disk but use mirror your data across multiple disks so that even if one copy is corrupted the other copy can stay you also add things like checksum you know a set of bits and then you add some checksum at the end of it you know some extra bits that you compute as a function of these bits why because if your data is corrupted later point of time you recompute the checksum it will not match suppose you are adding all these bits and storing the value here if some bit changes then the sum will not match in the future therefore this will help you detect data corruption the other thing you can do is for every set of bits you can add a few what are called error correcting bits some extra bits for redundancy to your data on disk so that even if some bits get corrupted due to hardware failures you can still recover them using these extra error correcting codes okay all of these are ideas that are widely used in computer systems then the other thing that you can do is you can also do snapshotting that is periodically you take a back up of your entire disk data so that if something bad happens you can roll back to the previous copy okay so this is done using by a technique that is called copy on write that is you make a copy of all your disk blocks and in the future if you are changing any of them then you make a when you are writing a block you make another copy so that you have all your previous versions and you have the new version you never overwrite you always make a copy when you have to modify a block this copy on write will ensure that you are preserving the older versions and the newer versions and therefore you have snapshots available you know two days back what was my file system I can recover it for you okay so therefore different file systems modern file systems they differ on all of these parameters they differ on how they organize your data and metadata on storage what is your maximum file size what is the disk space used what features are you using for reliability and note that there is this trade-off there is no one correct answer for how you build a file system there is always a trade-off if you do all of these things for reliability like logging or checksums or snapshots all of these are good for reliability but they hurt your performance in logging you are writing everything to the disk twice which is wasteful but if you need reliability that is a trade-off you have to make similarly there are different file systems that are optimized for different storage technologies you know you have traditional hard disks you have solid state devices you have newer technologies like non-volatile memory for each of these different technology your design decisions in your file system will be different you know you might say I might want to use a different way of organizing my data on disk for different technologies similarly specific to different applications you know the ideas we've seen are for general purpose file systems but suppose your files have a specific structures you are storing only a certain kind of files files of only a certain size then you can optimize your file system data structures for this particular workload right so there are different file systems that are optimized like that then there are different file systems based on are you accessing a local disk a remote disk over the network then features like compression encryption more than one levels of caching so there are all sorts of features available this is an active area of research you know building different file systems for different kinds of applications and workloads for you know analytics workloads for machine learning workloads this is a very active area of research and therefore there are many different file systems and there is no one particular most optimal file system out there if you are designing a computer system you have to make this choice understand what are all the options available for you and pick one of them based on your requirements your applications your technology your performance and reliability trade-offs and so on. So now if there are all of these different file systems which have the different implementations for example if your write system call implementation will depend if you are using logging you will first write to the log and then make changes later otherwise you will directly make changes. So your implementation of your file system will be very different for these different options so then how do operating systems support multiple file systems is it that after you know delete like a million lines of code in the OS in order to move to a different file system no. So today file systems are built in a very modular manner in modern operating system so that you can quickly change from one file system to the other and how is that done that is done with an idea that is called a virtual file system or VFS that is the operating system the virtual file system in the operating system defines some high level concepts some objects like you know files, directories, inodes every file system will have this. So the operating system code is written in terms of operations on these files directories and inodes for example to open a file open the root inode then read the data blocks of the root inode then open this directory find out the inode number that is how you will write your code for a system call. Now the underlying file system can implement this you know look up a directory or open an inode or open a file all of these operations can be implemented differently by different file systems a file system that implements a directory as a records or as a linked list or as a tree depending on that the traversing or looking up a file will be different but the concept will be the same. So what the OS code the way it is written is first these system calls are implemented in a virtual file system on some abstract objects like files, inodes and everything and then the actual implementation of the file system will provide pointers to all of these data structures the functions you know a function that will look up a file name in a directory all of those function pointers they are implemented by the actual file system. So that you lot of your file system code doesn't have to change only these function pointers have to change if you want to move to a different file system you provide a different set of function pointers to your VFS and you are good but you are not having to change a whole lot of code in your OS. So in this way this is the concept of layering we have seen this in the second lecture of the course you know instead of building everything in as one big block you build it in layers right there is the VFS layer then there is the file system implementation layer then underneath this there is the disk buffer cache where blocks are cached and then there is the device driver each one has their job the device driver only deals with reading and writing data from the hard disk the disk buffer cache the block layer only deals with storing the recently accessed disk blocks then the file system implementation layer will only implement functions on these different objects and the VFS layer will look at actual system call implementation in terms of these objects. So everybody has their job cut out so that you can easily move to a different disk buffer cache implementation a different device driver a different file system all of these switches are easy to make. So that is the end of our discussion on file systems and operating systems we have seen how some of the file system system calls are implemented we have seen how you can guarantee things like atomicity and crash consistency what are the various different ideas that are being used in file systems today and how the choice of your file system depends on your underlying application and you have a choice of file systems you can switch easily using mechanisms like VFS that modern operating systems provide. So with this I would like to wrap this lecture up and just to help you understand the concepts of this lecture better I would like you to you know try out some of these system calls like open read write link unlink you know play around with them write some code using these system calls and not just directly the system calls but there are also many libraries available in different programming languages for file access so understand them in order to assimilate the concepts of this lecture better. So thank you all that is all I have for today see you in the next class. Thanks.