 And now we will get into input output, file systems. So I won't talk about directories much, I will be mostly talking about files. We have already looked at file input and output through fstream. Today I want to trace in complete detail what exactly happens when you use an fstream to dump data and read data. So there should be absolutely no confusion left at the end of it. So what is a file? A file is an 1D array of bytes on disk. Just like memory is a 1D array or vector. A file is also a 1D array of bytes with the difference that it exists on disk and so it's lower to access. There's no other difference. It can grow and shrink at the end just like a vector. Interpreting the bytes is the business of the program that uses the file. The operating system doesn't care what you store in a file. I'll talk about the hex dump utility, which lets you see what's in a file byte by byte, printed in hex. It's a very simple utility that Linux provides to you. Now if the file is a text file or maybe a .cpp file, then those bytes are assumed to be ASCII codes. But bytes in a.out are not supposed to be ASCII codes. They're assumed to be compiled executable code that are directly loaded into a CPU and run. So same byte, different purpose. Now we all know how to use an .ofstream, but let's just go ahead and see what the code looks like. You include .ofstream, you use namespace std, then you say .ofstream.of and give a path to some file. If your path doesn't start with a slash, then it's offset with respect to the current directory where the program is running. Otherwise, it starts from the root of the file system. In this case, I'm printing out 2 and 3, which are interpreted as integers. And then I'm printing out 2 minus 3 as an integer, and then I'm printing a new line. So what happens in all of this? If the file does not exist, it is created empty. If it exists, it is truncated to 0 bytes. A write cursor is initialized to 0. When you open the file like that, that's what happens. After that point, if I'm printing that 3, what happens? The 3 is an integer. And it occupies 4 bytes in RAM or in a register. The first thing that happens is that that integer 3 is translated to a string 3. They are different things. The ASCII code byte for character 3, which happens to be 0x33, x33, that is appended to the file. And the write cursor is incremented by 1 byte because they've written out x33 to the output 3. Similarly, when I look at 2 minus 3, the result is minus 1. In this case, two characters will be written out. One is the ASCII code for minus. The other is the ASCII code for 1. So in all, 2, 3, minus 1, 4, ASCII code bytes will be written out. And finally, there will be the NL. Now just for our convenience, people can't agree on things. So Unix and Linux systems indicate end-of-line with one character. Whereas Windows and DOS, they indicate end-of-line with two characters just to make life more interesting. So on Linux, the hex code 0a or decimal 10, which means line feed, that is written out to the output string. And the write cursor increased by 1. In Windows, 0d, which is called carriage return, followed by 0a, which is line feed, are written. Now the write cursor will be increased by 2. So you have the exact same C++ code. The files they will create on disk will be 1 byte different in size. We can avoid this mess by suitable flags to the stream class. We'll see that a little later. But let's write this code. It's a very silly piece of code. Let's just go ahead and write it. And of course, om.close makes sure that the file is not cast in RAM. All bytes have been actually flushed to disk. And very simple. And just closes the resources. So I create the code. I think it's coming from this character. Codes are not the same. There we go. So now if I run hex.dump, it just runs dumping this file called dump.txt. What do I find in dump.txt? 2, 3 minus 1. Where my didn't put any spaces anywhere. So 2, 3 minus 1. And a new line, which is why I'm on this line. Otherwise, this C would immediately follow the 1. Now let's run this hex.dump utility to see what's in there in the file. So I say hex.dump, dump.txt. So what's in that file? Bite by byte in hexadecimal code. So this first column shows you the byte number inside the file. So you can see a whole bunch of weird stuff. Let me, I need some flags. I think it has some canonical hex plus ascii. So if I hex.tump.c the file, this is what I see. So to the right, it shows whatever it can print. Any unprintable character like new line is showed with a dot. And here it says that at byte offset 0 in the file, the first byte is ascii code 3, 2, which corresponds to the digit 2, not the integer 2, but the written digit 2. The next byte is the ascii code for the written digit 3, which is 3, 3. Then I have the ascii code for the minus sign, which is 2D. Then I have the ascii code for 1, which is 3, 1. As you can see, 1, 2, and 3 are 3, 1, 3, 2, 3, 3, hex. And finally, there's the new line, which is 0A, as I promised. So in other words, this file has 5 bytes. And those byte values are those. So that's what happens when you write to a f stream or to see out using the less, less construct. Now, let me change this slightly and say that I want to save the hex number, say 2, 3, 4, 5, a, b, c, d. That's a valid integer. Now if I run hex dump, the stuff in dump is this large integer, which corresponds to this hex number. Now again, if I hex dump this guy, you can see that how many bytes have been written? 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. And the actual thing that's in the file is 1, 2, 3, 4, 5, 6, 7, 8, 9 on the new line. So 10 characters have been written. That costs you 10 bytes on the file. So if I say ls-dashl-dump.txt, you can see that the size is 10. So I use 10 bytes to write down an integer to file where the original integer fit inside 4 bytes. So clearly this is not such a nice idea. Can we do better? And the answer is that we don't want this translation into ASCII and back. We want to store it on the disk exactly like the 4 bytes that are stored in RAM. And we'll do that as follows. Now there's also this thing called iOS, which is a class coming alongside fstream, which lets you do a few things while opening files. For example, you can open a file in binary mode, which doesn't do this CRLF patch up. So it doesn't distinguish between any Windows files, Unix files. If you're using binary files to dump integers and doubles and so on in low-level format, then you should use binary format so that an accidental occurrence of 0D and 0A doesn't appear like a new line. Otherwise, this system will translate it for you, thinking it's the end of line. So it's safest to always open binary files in binary mode. You should. If you open with a file, I'll show how to apply the flag soon. There's a flag called ATE, I have no idea why. Then you position a write cursor to the end of the file. So if the file already exists, you'll leave the content unchanged, and you'll put the write cursor at the end rather than at the beginning of the file. And you can also truncate an existing file by force by saying iOS trunk. It's similar to vector.resize0. So fstreams can read and write a file at the same time. And we'll use this to dump our data in low-level binary format. So fstreams are opened as follows. fstreamfs pass to file like before. You pass it these flags. You say open it in binary mode. And let me do both in and out on that file. So I should be able to both read and write. So what are these vertical bars? Those are the bitwise or operators. So in fact, all these flags, iOS, binary, iOS, in iOS, out, these are just integers. But they're all powers of two inside the system. When you order them, various bits are set. And fstream class reads that integer and finds out what bits are set to decide how to start up the fstream class, whether I should change the right cursor and so on correctly. And then the fstream class provides these all-important methods. So look at these three or four methods very carefully. Generally speaking, you should never do less than, less than, or greater than, greater than on fstreams, because you're using it for low-level binary data. You don't want formatting into ASCII at all. So the only thing you're allowed to do on an fstream is read and write character buffers. So suppose I have a character buffer with 100 bytes, you could try to do read into the buffer 100 bytes from fstream. So wherever the read or get cursor is placed, from that point onwards, read ahead 100 bytes and load that into buffer. Suppose the file is over before that, then the file will set a fail or end of file flag. And you can check for that. We'll come back to that later on. On the other side, you can tell FS to write to disk the contents of buffer for 100 characters starting at the current position of the write cursor, whatever the write cursor is currently. That may end up in extending the file if there's not enough space, just like vectors can get extended by pushback. So this basically says pushback 100 bytes from the content of the buffer to the file, whatever the end cursor is. But unlike pushback, the write cursor can also be repositioned to an interior of the file. So you can overwrite if you want. You can actually ask for where the cursor positions are. You can say, tell me where the get cursor is, the read cursor. Tell me where the put cursor is. That's the write position. And what you get back is not an integer, but it's a pause type. Why is that? Because if it were an integer, we could only support files which are up to about 2 billion bytes, which is not large by today's standard. You could definitely have multi-gigabyte files and would run out of offset. That's why the offset in a file is defined in a generic way called pause type. Pause type is either 64 bits or it's 32 bits. You don't have to know. It's an offset of a byte in a file. Just like ints are used to index into vectors, pause type is used to index into files to get a byte. And finally, you can also set the cursor positions by saying seek g, seek to the following position. Go to gx or seek p, position the put cursor to px, where gx and px have to be paused type. Push, tell g and tell p. Tell g and tell p are method defined inside fg. That tells you where the read and write cursor are currently located. Any act of reading and writing will move around the read and write cursor. g is get and p is put. Put is write. So this is how it works. In a particular file, so think of the file as a contiguous chunk of bytes. Every read will advance gx. Every write will advance px. But there is no necessary ordering between gx and px in general because you are allowed to move them around. We can also set this up, set up gx and px using seek g and seek p. And the current positions can be known by calling tell g and tell p. Writing may extend the file. Reading will never extend the file. Reading may run out of bytes and signal an end of file. So this is the current situation. Now armed with that, we'll write this interesting class. So so far, we have been looking at the boost vectors which know how to dump themselves to a file and read themselves from a file. But those have been in the non-compact ASCII format, complete with brackets and commas and all kinds of stuff. And clearly if there are large elements, you'll take many more bytes to write them than just a double or a float. Just like we saw, to save one integer, you might take 10 bytes. So we don't want that. We want to write a new vector class, say. We'll not use boost. We'll just start from the standard vector class. We want to write a vector class which is capable of dumping itself to disk in compact format in exactly the amount of space required by its components, elements. And it can also load itself up from a disk file. So let's write that. The way to do that is to inherit from the vector class. Before we get to general inheritance, let's look at simple inheritance from our point class. Remember, our point class had or struct had a double x and double y coordinate. If I now want colored points, I can extend the point class by saying struct colored point colon public point. It means anything that was in point will remain public in the new class. But I'll add a public field called color. Now you could also say I want a mass with the point because I'm simulating planets. Then you can say colored point mass extends colored point with a mass. So your extensions of classes or inheritance can go multiple levels. So similarly, when we write a vector that reads and writes itself, we'll use inheritance. So boost lets you read and write matrices, but these are in textual format. So we'd like to store an integer in exactly 4 bytes on the disk file as well. But we cannot directly view this file because it does not have ASCII character codes in it. If we try to do that, you'll get some garbage. Now to store a vector int to disk, we will first store the size in 4 bytes. And then we'll store each element in 4 bytes. That will also let us read it from start to finish easily. So here is the code. Here is my read write vector class. I use f steam and I use vector. And I have rw vector, which inherits from public vector int. So all methods in vector int will still be available to any user of rw vector. How do I load it from the file system? If I'm loading it, the assumption is I'm deleting the current state of the vector entirely. So I resize to 0. The original elements of the vector all go away. Nn is the number of elements I have to read from the file. And of course, load takes an argument which is the name of the file, which is given as a car star. Now I open an if stream because I'm only going to read it, ifs with file name, and I want to read it in binary mode so that I don't do any mess with new line stuff. Now the important thing is this read statement. Remember, the read can only read character buffers. But what I want to read is an integer, nn. nn takes four bytes. So what I do is I take an address to nn, and I pretend to the system that it's a character point. I also tell the read to read a number of bytes, which is the size of nn. So there's a size of operator in cnc++, which tells you how many bytes of RAM this item needs to store. So size of nn will evaluate to four internally. You shouldn't use four because someday someone may change this type to long, long, or something, and then four will be incorrect. So what this does is it reads four bytes from disk in order. And it lays it out exactly in the four bytes of RAM, which you are supposed to be storing nn. So the picture, let's say here is my RAM, and here are the four bytes. nn is basically at this address, which is say 10,000. Meanwhile on the file, there are these four bytes. byte v0, v1, v2, v3. And my get cursor is here. At this point, if I take ampersand nn, that literally translates to byte address 10,000. I make it a character pointer. And then I say read four bytes from the file. So these four bytes are read. And because I passed address 10,000 to the read command, or read method, those bytes are literally pasted into those four positions. But now I'm free to go back and interpret that as an integer. Let's say I will store the vector to a disk file. So I open an OF stream with the file name and iOS binary, no new line conversion. And now the first thing I have to dump is the size of that. So what I do is I say int nn equals whose size, my size. Since I'm inheriting from that object, all fields and methods of the object are still available to me. So I can just call size. And now I say OF stream dot write. Again, I have to pass it a fake character pointer, but which is actually pointing to the starting byte of the integer called nn. And I also have to give it how many bytes to write, which will be four bytes. This will do the reverse process. It will look at four bytes in RAM which represent the number nn and transfer those four bytes exactly in that order to disk. See, I'm not using less than, less than, or greater than nothing like that. After this, what do I do? For int ix equal to 0, ix less than nn plus plus ix. And I have to now write out all the members in this vector. So here there may be a bit of a mess because what do I write? This box, that doesn't quite work. Is there a get in the vector method? I don't exactly know. So suppose I say int element equals what, this. So you might say it should look like this expression, but that will not work. Everyone knows how to do this? I think that works. Let's try it out. I'll explain if it works. It's only complaining about main. Therefore, it's pretty much done. Let me provide a main, and then I'll come back to explaining what's going on. Empty for now. So what's happening here is, remember I refer to size because there is a size of the current object. The current object basically is an extension of a vector int. And so I can always access size of that vector. Similarly, remember vector also provides an indexed element access operator. Unfortunately, that has no other name. And so those operators that are overridden like plus or multiply or boxes, you actually have to say invoke operator box on this object with argument ix. That gets you the next, the element at ix. That's the syntax. It's just Mamoyambo. In this case, we don't, let me try this. I've never tried this actually, bingo. So you can just say operator box with ix. Now after that, of course, we have the job of writing it out to the file. So I say element and size of element. So I, yeah, so I'm going to do that. So let's leave our implementation of load unspecified. Let's not bother. Let's just say that, whatever the class is, rw vector in all lower case. So we'll say rw vector, rw v. Then we'll say rw vector.pushback say 5, rw vector pushback 656. And then I'll do rw vector.store. And then I'll do dump.binary. This is no longer a text file. So observe that the method called pushback is coming from the original implementation of vector int. But the method stored is not defined in vector int. That is defined here. So think of it as overlaying your new definitions of methods on top of the earlier ones. If you define something here which is already defined in vector, your definition will have precedence. Because you're layering on your implementation on top of the lower implementation. But we're not doing that. Store wasn't defined. And therefore there's only one store. And I have to give it a filename. So I've given a filename, dump.bin. So let's try to compile this and see what happens. From string constant to car star. Oh, 35. Let's not bother with that. It's just being pedantic. So now suppose we run rw vector.exe, it runs. And it leaves behind a file called dump.bin. Why is it 12 bytes? I wrote the size. Then I wrote two integers. So it's three integers or 12 bytes. No new line, nothing. What is the contents of it? Let's see. So I'll do a, if I try to cat, dump.bin, strange foreign characters weird stuff. Maybe my screen is messed up trying to print weird, not yet. So the safe thing is to do a hex dump. So if you do hex dump-c, dump.bin, let's see what the characters are there. Let's see. First of all, the very first byte I write. The very first integer I write is four bytes. What are the contents? 0, 2, 0, 0, 0, 0, 0, 0, 0. So it writes from the least significant byte onwards. The actual number I wrote first is the number of elements which is 2. 2 written in hex is 0, 0, 0, 0, 0, 0, 0, 2. So read it like Urdu, from right to left. So the first four bytes record the size, the number of elements. Then there is 5. And then there is 656, which happens to be 0 to 90 hex, which is now to full integer specification is 0, 0, 0, 0, 0, 0, 0, 0, 9, 0. Total bytes is 12 bytes, because there are three integers being written here. Isn't there maybe a starting code? Nothing. It's bare. Yeah. What does the system do that the file has ended and put back in the beginning? Just because you read the two blindly first, you realize how many more things to read. Because the first number is 2. So the convention is the number of elements is always written as the first integer in four bytes. So it's always safe to read four bytes, because even the empty vector will have four bytes saying 0. And then you read as many more four bytes as you require. So the assumption is that you are creating these files purely with stored and then reading them with load. If for some reason the file is corrupted here, there will be a problem. If there's a bug in the stored code, there will be a problem. If there is no metadata, that's the point. So there is no what's called metadata. There is no description of the data anywhere. If tomorrow your source code changes, you're on your own. And if you have created 500 data files on disk and the source code changes, it's up to you to coordinate the change between the data file and the source code. So Java does that by the Java serialization standard. So you can store various hash codes of the object and various things to make sure that you don't try to illegally read stuff which this class didn't create. There's no safeguard here. In C++, there's absolutely no safeguard. There are other bigger packages and libraries written for that purpose. There is the Google Protobov package which you can use for that. If you want to find out about that, you can go and research it. So now let's look at the load routine. So the load routine empties out the current vector state to have zero length. Then the first thing it does is, as I said, read four bytes to find out how many more elements it has to read. And then inside, as long as nn minus minus is more than zero, as long as there's more stuff to read, it initializes an lm, which is an int in this case. It's an int vector. And then it reads, what does it read? An address to lm factor, the character pointer, and four bytes. And then once it has lm, it pushes it back to the current vector. Pushback is a method in this vector. I can always use it from any other method in an extended class. And finally, in all cases, I have to close. I didn't close the OFS. It's a little funny that it still worked out fine. It might not have. And then ifs.close. So now, suppose I have rw vector, say rw v2. I can now say rw v2.load dump.bin. And that will load up the same elements. For example, I can at least, as an indicator, write out its size. Let's see if that comes out correctly. 9.15, there's some colon problems. So forget about those. C out was not declared because I didn't have IEO stream in it. So deprecation is fine. This is just a warning. So now, if I do that, the file will be written afresh. And then the vector will be loaded up with the proper size of 2. You can even print out the elements, it'll be fine. So questions on this. So the main things to learn here are that by taking an address of any primitive object, and then also knowing its size, using a size of operator, I can use that variable and its size in calls to read and write inside fstream. And this will transfer the bytes used directly to the other guy, to the file, or back, between file and ram. The important thing to note is you can't do that with composite objects like vectors. You have to write your code to do this. Only primitive objects which are packed densely in ram can do this. Structs can do this, generally speaking. Structs with primitive things in it can be written out. But structs with a vector in it can't be. It has to be a contiguous byte sequence for you to transfer it as a contiguous byte sequence to disk. Yeah. What is that operator I have switched the text for? Yeah. So suppose in here, RWV is our read write vector which extends vector. So I can always have an expression of the form RWV2, right? This box bracket open and close is actually not supported primitively in the language. When you say RWV box2, this box bracket is translated into a method call. So you declare that method by saying it's an operator with an argument which is an int. Somewhere inside vector, there was a declaration of the following form, public. Public is already there. So int operator box int ix return stop. That's how the implemented of vector told the C++ system that hereafter, on a vector, you can call an operator with box brackets with an argument which is an integer for the index. The return is like from that native buffer, you might want to return whatever PA, Ix, that sort of thing. Because we are not implementing it, I can't use PA. If you remember, in the other this class, when you are implementing vectors, see in this get class, I was giving an offset and I was returning a float. Now that means that inside main, you cannot really use, you cannot say so far, vec1 of vx. This is not allowed. If you wanted to do that, to make it even closer to the system's vector implementation, then you would have to declare an operator. So instead of having a get like this, you'd have to say float operator int p const, where inside you would have the same code. Or you can say return get p, that would also work. Const is because I am only accessing the pth element. I am not changing it. That's why. Just like get was not changing it. So if you define it like this and then at the end you say, try to say something like vec vx equal to something, that will flag an error. But the point is that these things are declared like operators, so that you get the comfort of using native array like construct. Now the problem with operators is that later on, if you have to use them, you have to invoke them like this. You have to say operator box. See, otherwise you might have thought that I should be able to write int lm equal to ix, like this, this is ix. So you should be able to say this addo ix. But somehow they have a problem with this syntax. So that's why they make sure that you have to explicitly write down the word operator. That's all. No, this is not allowed. So the standard C plus the syntax is that you have to say operator, followed by exactly how it was declared with the argument list in standard rounded brackets. It's just a convention. Nothing profound. Side of the predefined language construct, which tells you the number of bytes it takes to represent a primitive object. It's defined already in C and C plus. Any other question? So the summary is that by casting it, by taking an address to any primitive object, primitive type, and then casting it into card stardom, giving it the number of bytes to read and write, you can transfer the memory of that primitive variable directly to and from disk. That's the bottom line. And we have here used it to extend the class vector of ints to be able to dump it to file and load it from file very efficiently. You use exactly the same number of bytes as in RAM. Now for the last exercise here, suppose things get pretty extreme. Suppose I want to represent a vector which is like 50 gigabytes large. And you don't have enough RAM for that. You don't care if your code is a little slow. But you just want to run code with a 50 gigabytes large array. How do you do that? So we want to write a new class called file vector, which doesn't even have any existence on the stack or heap. It's entirely implemented on disk. We already have the pieces we need to the puzzle. We know how to seek, reposition the read and write cursor. That's equivalent to indexing. You should be able to do this. It can be deathly slow, but let's do it and see. So class file vector as a public interface provides a constructor. This time even in the constructor I need a file name because it's a file which is storing the data. It's not in memory. And I'll have to give it a size, which is the number of items you want to put in that array. Then we have the put get and size method. It's very simple. So I'll say put float value val into logical index number ix or retrieve the element at the ix index. So start declaring some stuff. What else do I need? So private, what do I have to store in private? The file stream itself that the end user should not see. And the size. So for simplicity, the size will not change. Once the array is created, the size will be fixed. The public stuff will be that I give you a file name and a size. And let's see. So I set size to the underscore size. And then I open. So the other thing is if you don't want to open the file right there, that's just a declaration of a private member. You can open it in the constructor. So here's an example where the constructor does allocate resource, but it's not heap resource. And then inside it gives the file name. And then I say I want to open it in binary mode. I want to both read and write it. In the destructor, I close the file. Now you might argue that in some applications you also want to delete the file at that point from disk. You can do that if you want. Size is just return size, nothing else to do. So what does put do? So now because files are indexed exclusively in byte units, you have to do your translation between byte offsets and float offsets. If I say that I want to write the float val at index ix, in the file I need to seek to byte number ix time size of float just like our cell number to byte offset kind of thing. And then I use exactly the same logic as before. FS dot write car star val size of val. That's it. How would get? The seek remains exactly the same. I need to seek to the same point given ix. And then I just read. So I read those four bytes corresponding now to a float into answer and I return answer. Now this is a relatively sloppy implementation. Well, if you give me a size beyond end of file, you'll get a error and so on. Now, so an interesting thing is this, that when I create the file, observe that the file may not have existed. In which case, when I open it, the file has size 0. And the write cursor and read cursor are both positioned at 0. What happens if the first method call I make says, put an element at position 1 million? Let's do it and see. First, let's try to compile this. So weird stuff going on. 9, 12. Anyone knows what's going on here? Maybe it's just some other symbols. But I thought I'd check this. In f stream, in and out. In and out, not read and write. So now it's just missing a main. So it's otherwise happy. So let's put in a main. Now we'll declare file vector fb. And I need to give it some names. So I say dump.bin, same old file. And let's say my size is, I don't know, let's say 1 million. So I want 1 million elements in the vector. And now what do I do? So fb.put, say 500,000, 3.4. So let's see this at all compiles. So first I'll remove that dump.bin file. Now I'll run file vector.exe finishes. There is no dump.bin. What happened there? Is there a problem? There must be a runtime exception because of some reason. Let me drop this one and see if just the declaration goes through fine. There is still no dump.bin. So clearly the creation itself is messing up. Let's try the following. So if we touch a file, then a zero length file is created. It has no bytes. It's just on the entry in the directory. Does that solve the problem? Let me take away this line. It actually puts something. So just to explain, the bug in my code is that you need to have more flags, saying if the file doesn't exist, you should create it. So let's say it runs. See if that works. I'm going to remove dump.bin again. And I'm going to compile and run this. See if the file is created this time. Yes, so that's what you need. Saying if the file doesn't exist, create it with zero length in binary mode to do both input and output. And now what happened here? The file has length to million and four bytes. Why is that? Well, I declared the vector to be stored as having one million floating point numbers. That's actually four million bytes. But I only wrote the 500,000th number, which will happen at the 2 millionth byte offset, half way across. So it will opportunistically only allocate up to that point where it has to write, not beyond that, if you don't tell it to. Can we hex-dump it? If we hex-dump it, you see something that it's full of zero bytes, null bytes, up to this offset, where suddenly there are four bytes, 9A, 9, 9, 5, 9, 4, 0, which is internal floating point representation for 3.4. So hex-dump is very smart in not repeating garbage bytes. If it's all zeros, it will give a discontinuity. If you want to be pedantic and print all of them, let's find the flag to do that. Verbose is cause hex-dump to display all input data. Any number of groups of output lines which will be identical to the immediately preceding group of output lines are replaced with a line comprising of a single asterisk. So if you don't want that, then so this is what happens. The asterisk means that the line of null bytes or zero bytes repeats many times. If you don't want that, you also pass a verbose flag. And now you see that there are lots of lines with zeros in them. So lots and lots of lines with zero bytes until you reach the 2 millionth byte, where 2 million through 2 millionth 4 bytes are those four precious bytes that we wrote, 3.4. Now this looks terrible. I just wrote 4 bytes and 2 million bytes were allocated. But the funny part is that in Unix, Unix is smart enough to create what are called files with holes in them. So this shows up as taking 2 million and 4 bytes, but actually it doesn't cost the file system that. To find out, you can actually ask how many kilobytes of storage does, I think so, let's see. So how many blocks of storage does dump.bin actually cost the file system, just four blocks? So du says disk usage. The disk usage by dump.bin file is actually very small. Once you start writing into the earlier location, then we'll start taking space. Yeah? Yeah. That will be much lower. The file will be like 24 or something. Then it will be always 5. It depends. It allocates a small prefix of the file. It's unclear how much. I don't really know. So suppose this was replaced by 5. Let me delete the file again. Compile this guy. 24 bytes. With the payload being in bytes 20 through 23. This is byte 16, 17, 19, 20. 21 over 20, 23, 24. 16, 17, 18, 19, 20, 21, 20, 23. So 24 bytes in all. So what's the summary here? The summary is that we could grow this indefinitely. You could pass a size which is gigabytes. The file system, even my tiny laptop has gigabytes of space left. You can easily store it. And this is not as absurd as it sounds, because now with large flash disks, which are somewhere between the speed of RAM and hard disk, you can easily keep very, very large data structures like this in flash disk. But it's a file on flash disk, but it's much faster than actual moving head hard disks. So there are many applications, including modern distributed file systems, which keep a lot of their directory structure on flash for very fast access. There are companies which specialize in multi-level storage strategies for distributed file system, which use tricks like this to store their data and metadata, like directory data, file data. They're partly cached in flash. When the thing is idle, they push it back into hard disk, that sort of thing. So that's pretty much that. So the summary is we were only able to scratch the surface of computing, as the intention was. If I intend anything more, I would have been mad. So we started with basic syntax, how to declare variables, constants, then statements, if then else loops, followed by basic aggregate data types like 1D arrays and 2D matrices, looping through them, solving interesting sorting, searching new molecular algebra type problems. Then we went into functions, how to break up your code into different modules so that things are maintainable and readable more easily by yourself as others. Then we looked at recursion. We looked at techniques like dynamic programming, which can cache partial results of recursion and make things fast, and take less storage. Then we went into objects, structs, objects, classes, members, methods. Then we looked at memory management. How do you take memory from the heap and return it when you choose to, except instead of having to give it up when the scope closes? You can pass it around. One more comment I'll have to make from the pointer slide right at the end. This is important. So of course, since you created a pointer to a record, you want to pass the pointer around in your application between very different data structures. In general, you want multiple pointers to the same record in application. Otherwise, why are you doing it anyway? For example, you could have a bank program where fixed deposit accounts and checking accounts point to the same shared customer record. But you need care in destructors. Deleting a checking account should not destroy a customer record because the customer will still be holding a fixed deposit account. But should deleting the last account of a customer delete the customer record? Who knows? Probably not, because you might have tax reporting obligations until the end of the next financial year. So these are partly policy, but partly also program design. There are packages by which you can do memory allocation not directly from C++, but through someone else's package who will keep track of the number of references to a pointer and delete the storage when the reference count goes down to zero. Now, this is not a panacea for memory management because the following story. So there's this famous professor at Stanford, John Bakas, I think, who used to code in Lisp because they used to do AI. And in Lisp long before Java, there's this problem of garbage collection. If the user doesn't need some memory anymore, how do you free it back to the system without an explicit delete? See, Java doesn't require an explicit delete. It will take back memory when it can. So how to identify memory that the user cannot access anymore and therefore that can be reclaimed has always been an intriguing problem. So one day, a graduate student ran into John Bakas's office and said, I found a way for perfect free memory reclamation from the system. John Bakas patiently sat the student down and said, let me tell you a story. One day, a graduate student ran into John Bakas's office shouting, I found a perfect way of memory reclamation. So you see where this is going. So we've already seen that we can do linear link lists. And suppose instead of the tail going to null, I chain it back to the head. That's a circular data structure where every node has reference one. But there may be no external pointer into this list. So clearly, reference counting is not enough. You have to trace all possible parts from variables in your code into the heap and see how much can be accessed. Now that is called a mark and sweep garbage collection strategy. You mark anything in the heap that you can access, and you sweep away the rest. That's one part of what Java uses for memory reclamation. So anyway, so coming back to C++ where you have to be in charge of memory releasing, you need clean policies which are well documented in large codes so that, even when you're sharing pointers between multiple collection objects, you have a very clean, reliable way of releasing the memory at just once. For example, here you might decide that deleting the last account deletes the customer. That's well defined. Or you might say that there is a home collection which is the primary determinant of the memory allocation policy. And no matter which other data structure you're copying pointers to, those are not in charge of freeing up the memory. So far we have used vector of ints, vector of floats. What if I have a vector of int star or vector of customer star? When the vector is destroyed, C++ will not call delete on each of the customers by default. If you want to do that, you have to overwrite the vector's default destructor. So you should write your program such that all but one data structure does not delete the elements on destruction of the collective. Whereas that last home collection, when that is destroyed, that should go through every item and destroy the pointers. So if you keep the memory allocation cleanly separate in destructors and constructors, and you have clean policies for who is in charge, who owns the memory logically, then you'll avoid a lot of needless bugs and frustrations. So now coming back to after pointers, we ended up the course with a small treatment of file IO. Without being at the mercy of less than, less than, and greater than, greater than to do ASCII-oriented IO, we found a way to dump primitive variables into disk and out of disk directly, exactly like the memory image. And now if you have composite object, for everyone, you have to write some routines method which will decompose the composite into primitives, maybe recursively. So maybe this load store of a container object should call load store in the content object. If I have a vector of ints, I know how to store ints. If I have a vector of customers, I don't know how to store the load customer. Therefore, the store method in the vector of customers has to call a store method in each customer in turn. And they just keep on appending bytes to this random access file. And then the store of the outer guy finishes, similarly for loads. So load stores are also recursively cascaded into collection object. That's how Java works. That's how Java serialization works, by which you can pass objects not only onto disk and back, but even onto network and back. When you do a Google query, there's a lot of quantitative data that passes back and forth between your browser and the Google server. Much of it is actually structured data, including did your mouse dwell on a particular link for more than a certain number of seconds? That triggers an event, and that is communicated to Google. So all this communication is done on a network. And in fact, the network looks surprising like a disk which has a sequence of bytes. You open a socket, you pump in a sequence of bytes. You open a socket, you read a sequence of bytes. So whatever applies with fstreams with very minor modification also applies to sockets over the network. So anyway, that's the end of the course. The course is not over by any means. So there's a lot of activity left. On Sunday, there will be the lab quiz. I think on Sunday, we'll be able to fit four lab quiz slots, and then there'll be a makeup separate from that some other day. It's all in the calendar. I don't have it in my mind right now. Suitable crib slots have been set up for all pending exams. Finals are on 30th. And thereafter, we have the correction and crib sessions all planned out. So the important thing is, please watch the calendar carefully, because unlike earlier on in the course, if you slip up now and then everyone will leave for vacation, it's very hard to make up. So try to be there for everything that's important. There should be also another session of special tutorials to make up for all the damage. So please try to attend that if you can. So this is being videotaped up to last lecture of all the videos now. So make sure you watch those. For the final exam, make sure you compile, run, and understand all sample codes. They're quite a lot of them. They together covered a lot of ground. So that's probably the most important thing to do for preparation. Read the sample codes and understand them.