 So welcome to the last 164 lecture. This lecture is also known as the Evil Lecture. We are going to look at a few techniques that are used to subvert the way systems operate, despite all the good things that compilers and languages do. Before that, one announcement is those of you who know and look at the schedule next lecture and Tuesday, we'll see presentations by your teams. So each of you come down here and spend four minutes presenting our projects. It's part of a milestone so that you get something done by the next lecture. Part of it, part of our rationale is technical writing and presentations, which is what we want to teach you in design courses. And part of it is, of course, getting feedback from others on what else you could do, how to do it in a simpler way, and so on. OK, so let's start with the first part of the lecture. You may remember, was there a question? So essentially, by now, you probably want to present what you have in your design documents. So in the design documents, you need to have examples and essentially a little tutorial written on how to program with your language, how to implement it. So that's really the stage at which you are. On Tuesday, you may want to have something more done, but the design document is roughly that milestone. So back to the lecture, you may remember that when we talked about implementing object-oriented languages in Lua, essentially, with just closures and tables or closures and dictionaries, we were able to create an object that could protect a secret field, say, a password. And there was no way how to read that password out. So let me remind you what that was. You could create a method like SaveKeeper, which receives a password from you, gives you a variable that points to some sort of an object. You don't really even need to know what sort of object it is. The point is that it will do password checking for you. Now when you invoke on this object, the checkPassword method and supplies what you might be a correct password inside the password is checked and true or false is returns depending on whether the password is correct. So you do have a pointer to this object in this variable SaveKeeper. And my claim is that you cannot read the password out from it. So here is the implementation. The SaveKeeper is a function which does what? Takes the password and stores it in this local variable, part of a closure. And it also defines a method that checks the password. What does this method do? It receives the password that is the guess and just compares it with the password stored here in the closure. So here is the checking method. Here is where we keep that secret password that is not accessible. And then what do we do in here? We return the actual object, which is a table. And that object has nothing inside just one method. It's this one here. This is so that you can then on a variable pointing to that object invoke this method. So is it true that you cannot really read out that variable from that SaveKeeper object? Or maybe under what conditions you can or then under what conditions you could? OK, excellent. So you would override this with your own function. And that function would be called whenever you do a check. So during the execution of that, you cannot peek inside that function, but you can hijack double equals. Excellent. So under that assumption, you could read it out. Some other ways how we could read it out. So if the language gave you some method which takes a function, it's this one here. And access its environment, which is a chain of frames and walks those frames and prints out all the variables. Under that condition, you could print out this password. So you see now what the language needs to offer or actually what must not offer for this sort of code to be safe. It may seem like we are discussing a hypothetical scenario about programming in JavaScript where part of the code runs in your browser. Say Gmail, part of it comes from potentially untrusted website, exactly the sort of questions arise. Can you override the method with your own and then read your cookie values or something? So by the way, you can do. So here is the puzzle. Here is the puzzle. So we create a safekeeper. And I will tell you, you are the attacker, that you can paste any code in here. And I will even give you a reference to that safekeeper object. And then I call your function. And the question is, what should that code do in order for you to print out that value? And this is what we just discussed. If that code overrides double equals, you can do it. If the function happens to walk through the environment, you can get it. But otherwise, you can't. So this is what just went about. What must not be allowed for the password to remain safe in that object? You can do the same in Java. It's not important to go over it now. So this is again Java and what we would forbid. So what is frustrating to the attacker is that he holds a pointer to that object. In our case, it's a table that has a pointer to a function, which in its environment stores the value. But in Java, this would be what? In Java, we would have a private field that keeps that value. So we would give the attacker a pointer to a Java object which would have a private field. So slightly different architecture, but the same idea. And what's frustrating about the situation to the attacker is that the attacker has a pointer to that object. The attacker even knows where that object is in memory. It is at the address holding in that variable, which we are happily providing, plus 16 bytes of offsets, for example. If I give you a pointer to a Java object and here is some address A and here is that private field which I'm not exposing to you, let's say we even tell attacker that the secret value is at the beginning of the object plus 16 bytes. We are even happy to provide that information. And still, the attacker cannot read out that value from that field. So why is it so? What constructs are missing in the language? If we were in assembly, it would be trivial. You have an address, you know the location. One load instructions would do it. So why can't we do it in either Java or? OK, so you cannot do a pointer arithmetic. You cannot give a Java program a pointer, which is a reference to an object and add a number to it. Something like P new foo P plus plus is not allowed in Java. You cannot manufacture your own addresses. And this way, you're only pointing to the objects that were allocated rather than some other parts of memory. What else cannot we do? We actually don't need to manufacture addresses in the Java case. Our Java object is a plain class where the password is a private field. And so we do have here a pointer to the object. And so this points to the object where the private field is at a plus 16. Cannot we just write the piece of code, which says print safekeeper.pass private. Wouldn't the compiler compile it to the appropriate code, which is take the address in A, add 16, and then load the value? Why can't you compile it yourself by first removing the private field? Why can't I manufacture a piece of machine code, right? That's what you mean, by hand or by a compiler and then somehow link it to the executable, right? So this is another example of another assumption that we need to make for our systems to be safe. You must not be able to manufacture somehow a piece of machine code and redirect the execution to it. If you can, of course, you have access to all the memory even where the secret values arise. But if you stick to Java and you rely on the Java compiler to compile Java expressions to executable code, then this will never be accepted by the compiler because the private field is private. And so this is illegal piece of code, and it will just not be compiled. And therefore you do not have access to that field no matter what you do. Looks like there was another comment from you. Well, you can edit the byte code, but right. The byte codes that you ship to the Java virtual machine is verified by so-called byte code verifier, including for properties like access to private variables. So the attacker has all the stuff, but there is no way for the attacker to manufacture a machine instruction that would actually be executed and read that location A plus 16. We all found various exploits on how it can be done, but it involved things that are easy to forbid. Don't link machine code. Do not allow overriding of global variables like double equal and so on. So this was our discussion. And essentially it boils down to three things. You cannot in Java do this unless A extends B. You cannot write past the boundaries of the array. And you cannot manipulate pointers through arithmetic. You just cannot somehow obtain a pointer to an address that is not of a suitable type. Of course, unless the computer breaks. If you could somehow manufacture an error in the computer, then no matter what the language designers and implement it did, you get control that's unprecedented and you could bypass all these checks. So let's see how this is done. So this comes from a paper from Princeton. And what you may see here is a desktop with a lamp doing nothing, just shining light and heat. And this is now at 54 degrees centigrade. So not that hot, but you can make it hotter. And when that happens, you flip, if you are lucky, a bit in memory because the memory is so hot that it starts becoming less and less reliable. And you flip just one bit. You can say, well, what can one flip bit do besides crashing the program? So turns out these guys figured out how to turn this bit flip into an exploit. Meaning, all of a sudden, you can make the program do what you want, including read out those secret locations. Why is this interesting? Because if you have a smart card, sort of a credit card with a little, say, Java virtual machine on it, then it runs your programs. I don't know, little browser, little electronic currency kind of stuff. If you could get a little bit of program execute on it, how? Well, the browser can download it from the web. And now you have on this card your program. Then you put it under a lamp. You heat it up. And you take control over that card. So the full scenario, of course, would be that you go and you swipe the card from somebody, meaning steal it, not swipe as erase it. And then you go on the web. You download the program onto that little card. You put it under a map. And the program which you downloaded from the web now will allow you to read everything that's in the memory of that card, even though it's supposed to be nicely protected. Of course, they build those cards better than just running a Java virtual machine on it. But if they all it did was running a JVM, you could do what I'm just going to describe and get access to a stolen smart card. All right. So first, what is a memory error? You have on the left a variable with an address. Imagine it points to the memory locations on the right. And now if a bit flips, imagine that bit 3 is flipped in that pointer. The effect is that that pointer points to a different part of the memory. So in case it's not clear what the bit flip does, it does this. And that's exploitable. Now we need to figure out how to make it exploitable because it looks innocuous. All right. So the attack has two steps. The first step is analogous of what you would do in C to manufacture a pointer of any type. How would you do it in C? You would create a union type in which, in the same memory location, you would have a pointer as well as integer. So you would have two names for that location. One is an integer variable. The other one is a pointer. You write into that integer name whatever you want. And then you read it from that pointer variable, which is in the same location. Except now that integer is a pointer. You can use it to access anything in memory. So we cannot do quite that in Java because we don't have unions. But with just a little bit of indirection, a lot of things can be solved with a little bit of indirection. We'll do essentially the same. In Java, it looks like this. We'll create two variables, p and q. And we'll set it up in such a way that the pointers in the p and q will be the same. They'll point to the same location. This is like having a union where the location is both an int and a pointer. Except here we'll have two variables pointing to the same location. So whereas in C, you can manufacture the union statically by writing the code. In Java, we'll do it through this bit flip. We'll make these two pointers go to the same location. What is important is that p and q will have incompatible custom types. They will be designed in such a way, those types, that in one of them, one field is a pointer, the other one is an int, just like in that C union. And once we get these pointers aligned, we have a field that can be accessed under two different names, one an int, one a pointer. So let's see how this is done. And of course, once we have done that, step one, step two is easy. Now we can do whatever you could do with the manufacturer pointer, read any part of the memory, write any part of the memory. You can override virtual lookup table and make the objects jump to your piece of code. At that point, the world is yours. So what we could do is, say, fill a block of memory with desired machine code, override this patch table, and when we make that call into that function, it will not go to the original function, but to this custom piece of machine code. Now, so let's do step one. Let's have a look at these two custom types. The class A and B not related to each other in terms of inheritance. And they are exactly the same, except there is pointer to B, so this will point to that. And here at this location, these two are going to be aligned after we get the pointers to point to the right way. And we'll use this B in step one and we'll use this in step two. So it's hard to read out from the definition of these classes how they are used, unless you are really clever. But see what they have. These have just a bunch of pointers to A. So they will point to an instance here. And the other one has this one. So B here, that's enough. And here is this ind, which will be aliased with this A. All right, step two. So let me tell you what step two does because it will sort of show where we wanna go. It shows the exploit once you have those manufactured pointers, manufactured through the Bitflip. So first the offset. This is the offset of the I field in A. It is, let's assume it is at byte 32 from the top. Now these are the two pointers that we'll be working with. That this somehow, we'll show you soon how in the first step make point to the same location. And assume that they both point to an object that is of type A. So here is the right code, but before we look at what the code does, let's see what it does graphically. So we have P and Q pointing to an A object. And A object is the one which has this B here used in first step. And here is this int. Here is that funny location here, which is our union, so to speak. Right, you can refer to this location in two different ways. As P.I, now it's an integer, or so you can write that value in it, or you can refer to it under the name Q.A5. So when you refer to it under this name, you can write anything in this field. When you refer to it under this name, you can now use it to access the locations over here, because this presumably is a pointer to some object. Or the generated code, so to speak, thinks it's an object. So let's see what we do. Imagine that we want to, imagine that we want to read something in this location, 40, 20. We'll use P.I to write 4000 here. Why 4000? Well, we need to subtract 20, which is the same as 8 times 42. This is in hexadecimal. Okay, so we subtract that. Now we use Q.A5, right? This is Q.A5 will now be this value here. Actually, sorry, this is Q.A5.I, of course. This is Q.A5 is this. It points to what the code thinks is an object of type A. And now reading this expression will give us this value. So how do we do it? Well, if I give you an address and a value to write into it, you take the address, subtract this offset, which we had here. And you write the value into Q.A5.I. So Q.A5.I is here. Okay, so far so good. Of course, this is the easy stuff. What was that? So the step one is the harder step, right? Yeah, so this is the easy stuff. This is like in C, except we have to deal with Java objects and fields in them. So let's look at step one, okay? So you don't know where the bit flip happens. So what do you do? You fill up the entire heap, or as much as you can, with objects that are waiting for the flip to happen. So we will create one A object, it's here, and a bunch of B objects. There will be more B objects down here. So we'll create this setup. We could create gigabytes of that, if you will. Okay, now what do we do? We will keep executing a simple piece of code and check whether the flip has happened. And if it didn't happen, then we'll execute the code again and again and again until we detect that the flip has happened. And then we have two pointers and we can do our exploits that we have seen. Okay, so what is that piece of code? Here is this piece of code. Let's first see what it does in a normal situation without the bit flip. The original pointer points to this B. Now what will TMP 1.2? Well, it will clearly follow this pointer. And so it will point to that, okay? And what will the bad pointer point to? Well, it's TMP 1.B. It'll be here, it'll point to this. This is the good execution, remember, right? Where no bit flip happened, okay? So let's look at what happens when a bit flip occurs. Imagine that it is this pointer here, which is flipped. And let's assume it is flipped in such a way that it points right to the top of the object. It could point to another location. Turns out that it doesn't really matter that much because we have filled these classes with those pointers. So now after a bit flip, now we execute this piece of code. So original still points to the same location, sorry. Okay, so here is the bit flip. Now TMP 1.2 what? TMP 1 now points not to here as before, but it now points to this object. Are we done? Do we have two pointers to, it looks like we now have two pointers. One is of type B, one is of type A. So they have different types. They point to the same memory location. Could that be used for the exploit or are we gonna do one more thing? It turns out that in their exploit they use this bad object. Which points to, should be of type B, but points to an object of type A, okay? So after the sequence is executed what you got, a pointer that should be pointing to a B class, it points now to an A class. All thanks to the bit flip. So you can look again at what happens in the normal situation. So this is the normal one where everything is fine. And this bit flip here, it's going to, the original is fine of course. This one is bad, but we'll manufacture the bad object and use that to do the exploit. We'll see that soon on the next slide. Now how do we detect whether the bit flip happened? We'll just keep going over all these objects here in a loop and checking whether after executing the sequence, we have these two desirable pointers. So here is this piece of code. You iterate forever, okay? Start of course with P pointing to the single object A which is on the top. And now you go through the entire array of these Bs, there is a ton of them, okay? And in each of them you do what? You get the original one, TMP one, okay? And this is Q is the bad one. And now what do you do? Well, you want to compare P and Q, how do you do it? You assign them into O1 and O2. You just do the comparison if they are the same. You have found these two objects and you are ready to do your exploit here. It's as simple as that. Uh-huh. Okay, so let's assume that this red one, the bit flip points to say here, let me actually, let's assume it points to here, right? So now we will assume, well, or the code will assume what? It will assume that this thing here is an A object and this is the A header, right? Now let's try to execute the code. So the original points to this B, okay? Where will TMP one point two? It will do original A one. So TMP one will point actually to this location, okay? And now where will bad point two? We'll do TMP one dot B. So third element from here, right? Because it is one, two, three is here. So we'll read this pointer, okay? And where is this pointer set up to go? It is set up back to here because all these fields down here are pointing to the top object A. And therefore almost no matter where that bit flips, you will be somewhere in memory where there is a bunch of pointers to A pointing to the top object and you will get it here. So maybe the example should be rewritten so that this red thing here actually does not point here because that confuses things. So I'll change it tonight so that the bit flip happens the way you are suggesting. So I'll rewrite it so that it actually points to here. Depends what you over it. So the question is what if the bit flip happens in some undesirable location like the header? Well, depends what is overwritten in a header. If it is a pointer to a virtual table and you happen to have to make a call then it would just crash. And indeed this is what happens. They were able to get about 70% success rate. So they would in software flip those bits and observe how many of them led to an exploit. And 70% of course is a pretty good rate, right? So it could be that if you flip a header object, you could destroy the program. But more likely if you flip some low lever bit right here, it will not point to the word boundary but sometimes inside the word and of course then it doesn't work again. But well, if on the first bit flip it doesn't happen then you try it again. So what are the lessons? Well, the lessons for hardware is that if you have hardware that wants to be secured and it needs to be secure, you want to have error correcting code in memory so that bit flips are code in hardware rather than expose the software this way, right? And another lesson is that if security is what matters then all these assumptions need to be really explicit. Not overriding memory, no access to the frame pointer and so on. Because then it's really hard to see and prove what safety properties you have in your system. Okay, other questions about the bit flip exploit. One reason I teach it is that we know how the various relaxations and features in languages can lead to exploits. Another one that really shows you that type systems actually are pretty strong under certain assumptions and you can rely on them not to give people access to your secret data. But now to something completely different. So anybody knows who this guy is? Ken Thompson who was a Berkeley grad. We've seen his algorithms already. The compilation of regular expressions into automata, the syntax directed translation, right? When we walked over the AST of a regular expression and compositionally built an automaton, that's his algorithm. So he won this touring award thing and in 83 as its customer he gave a lecture, technical fund lecture as part of this ceremony. So and what he says in this lecture is that, well he worked on Unix and he was able to set up the Unix compiler in such a way that it would forever perpetuate in this login program which checks your username and password, whether he is trying to log in and if he is trying to log in, it would give him super user rights on those machines. And so pretty much on any machines in the world where Unix was installed he could just log in without of course a password because the login program was compiled in such a way that it would just let him in. And of course that was at that time, perhaps 12 years since Unix was created and he checked after those 12 years maybe more and he still saw that Unix machines had that exploit in them. Nobody was able to catch it. So let's talk about how he did it. So we'll have two preliminary stages to explaining how he made that possible without being caught. So let's start with stage one, which is a warm up. What does this program do? So any idea what the program does when you run it? So it prints clearly, well it prints this and then it prints this character list here and then it prints the character list again, right? So it's essentially it's printing S twice. And so that is a correct statement but that's a rather low level view of the program. What does the program do in one word? It prints itself and printing itself of course means that you can then recompile it, run it and what it will do, it will print itself. And if you recompile it and run it, what it will do it will print itself forever, okay? So let's see how it prints itself. So when you run the program, what will be printed here? Well this will print this part here, right? Including the new line, right? What will this print here print? Agree or not? So percent D which is here, this prints an int. So it would print one, 12 for example, okay? So strictly speaking this program first time around doesn't print itself. What does it print? It takes this string which is a string of characters, right? And prints it in what notation? Rather than in quotes it prints it as numbers, right? So first time you compile this and run it the program that you get on the output instead of these characters would have something like 32 I believe is the space, zero is I don't know 50, 32. Numeric representation of those ASCII characters rather than the literal representation of characters. But it shouldn't matter, right? After that it will always print the numbers again. But this is of course easier to write because I can write this program if I am if I am to write this self-reproducing program I just take the body and create this array of characters in the quote literal notation. Next time it prints it in numbers but it's okay. After that it will always print the same thing, all right? So what is printed here? Well at this point what we print is essentially the rest of the code. Pretty clever, right? So this is clearly not enough to do that exploit logging into all machines of the world. You'll need a little bit more. But first perhaps I'll let you peek at this. So the key idea at high level is that the bulk of the program is printed here in the last statement, right? This S here prints itself. But you somehow need to put you could say the blueprint for printing the code into this array and this is done in here, right? The essentially printed out character by character. This is sort of the DNA, if you will, that's always stays in the program. And the stuff here, this is just sort of a little setup. So it is a little mind-bending but worth knowing. So stage two, let's start looking at a compiler. So a compiler can compile itself because it is written in C, right? So a compiler is, assume it's in one file CC.C, okay? And if you run it on say x86 machine, okay? This is CC.exe, this is the executable of the compiler. What do you get? Well, you get on the output CC.exe, right? So what do you get is this code here gets compiled into that by this compiler. If you have a CC.exe on ARM, say on your phone and you feed this here, you get CC.exe. This one will run on x86, right? This exe is an x86 exe. This one will run on ARM. So in that sense, the compiler is portable. If you compile it on this machine here on the left, you get the same compiler except you can run it on one processor. On the right, you can run it on some other processor. Good, I hope I know what you are gonna say. Oh, I see, okay. How do you get the first one? How would you get the first one? Without, I don't know, referring to divine objects which somehow manufactured the compiler for us. You write it in assembly and you sort of slowly grow it, okay? This is indeed how the C compiler was written. You write it either in assembly or you have some other language like Pascal on that machine. And you write a tiny, tiny, tiny compiler of C in Pascal. Then you compile it and this is where you get the first one. First C compiler, what does it mean? It's a C compiler that accepts C code, any C code, including a compiler written in C, outputs a compiler. And at that point, of course, you need to throw away the Pascal code, rewrite it in C, and it will start compiling itself and you, of course, keep adding features. Now imagine you don't have arrays in the compiler, you only have scalar variables. Of course, you cannot implement arrays using arrays because you couldn't compile it on the existing compiler but you implement arrays with scalars and then you compile it, now we have compiled it, understands arrays. At that point, you could throw away your implementation of arrays or at least simplify it by using arrays in it and so on, it's called bootstrapping. You would need to do pretty much the same thing. Essentially, okay, this is what I think you were really asking is, I neglected one important thing which is that the executable which I create here runs on ARM and this runs on X86. These are binaries that you can run on only on the appropriate processor, right? That's fine. But what does this binary do? It outputs, it reads some C code maybe it is hello.c and outputs hello.exe and the same here, right? This is what compilers do. Forget the fact that there is linking, it doesn't matter. In what instruction set are these exes? In other words, when I compile hello.world I get hello.exe, I do the same here, I get hello.exe. I did the compilation on different platforms on my laptop and on my phone. In both cases I get something on the output. You would expect to get the same output, right? Well, because what did I do? I took a C program, I compiled it here, I compiled it here. If I take a C program and I compile it on two machines, it should still give me an executable doing the same thing. So if I give it hello.c here and hello.c here it should produce exactly the same string of bits, okay? So is this making you think, okay? So think of it this way. In the C compiler somewhere at the end is an instruction generator, right? Something that walks the AST the way you know and spits out assembly code. Not very different from the byte code that you generated in PA2. In what architecture will that machine code or assembly code be? Typically it is, but in our setting not necessarily, right? If the code generator here generates spark instruction, how many of you heard about the spark architecture? It's just another architecture that still exists out there. If the code generator there when walking the tree spits out instructions that are spark instructions then this hello world here will run on spark and this one will also run on spark, right? Because the code cc.c has built in it a code generator that translates ASTs into spark code. So yes the compiler is portable in the sense that we can run it on ARM, we can run it over x86 but the code that it generates is spark code. It's essentially a cross compiler meaning it runs on one platform and generates code for another. If we wanted to teach it to also generate some other code then you need to really extend the code generator to speed other instructions. So but let's neglect that. What we see is that the compiler can be moved to another platform just recompiling. All right, so let's look at a very simple portable feature. Why do we care about portability? Because the way Ken Thompson planted the exploit into the compiler was in a way that sort of ports itself, itself propagates itself and this feature is the simplest feature to explain it on. There will be stage three by the way. So let's see how the compiler translates these escaped literals. This is a piece of code that you should recognize, right? The compiler reads the input string. One character if it is not a backslash then it just outputs the character. But if it is a backslash then what do we do? We translate it into a single byte that corresponds to this backslash, right? So if on the input I find backslash n then the compiler needs to translate this into a single byte, 0x0a, right? So two bytes on the input are translated into one byte and this is how you would write it in your C compiler, right? You read one character and you see, oh, this is a backslash. Let me read the next one. Oh, it's an n. You return this because you know this corresponds to one byte. So far, so good. Of course, the trick is that on some platform, this backslash n, which is new line, is 0a on some others it is 15. And you don't want your C compiler to be littered with all those case statements. Am I on an ASCII platform or ABDIC platform or Unicode platform? You would like your compiler to be as clean as you see on the left. Meaning, if you encounter backslash n, you output whatever backslash n is on that architecture. Isn't that clever? You write one source code for a C compiler. You write the translation of backslash n to be this one byte, but on some machines, this will be a on some, it will be 15. So how do you do that? Okay, so well, can anybody suggest how you do it? You wanna teach that compiler what is the meaning of backslash n on a particular machine and teach it in such a way that it will sort of perpetuate this knowledge. Sort of like we had in the stage one self printing program, we had this string which was sort of the DNA of the program. Whenever the program printed itself, it also printed this DNA so that it can print itself again. We want to do it in a similar way. Any suggestions how we do it? So okay, so first step, imagine it's best explained by adding another backslash character. So imagine we wanna do backslash v, which is, I don't know what it is, something, right? We wanna support backslash v. And this will just make it easier for you to explain. So we wanna add backslash v. We clearly cannot use backslash v in here because the compiler doesn't yet know what backslash v is. And so programs that have backslash v on the input are gonna be programs with error. So what we do is you would like to write this code but the compiler doesn't actually understand this yet. So you'll write it this way. So this is your compiler. This is the same story as I told you about the arrays. You wanna support arrays. You first need to write compiler that can compile arrays by writing the compiler in the subset of C that does not have any arrays because that's the subset you can compile. So this is your compiler at first. It can read a program that contains backslash v. Why? Because you find a backslash here, you find v here. And it can even generate the right byte. Because you are teaching it, it's 11. Backslash v on this platform is 11. What do we do next? Well, now we can take this out and edit it into backslash v. Why? Because we already have an executable of the compiler which understands backslash v and therefore it can compile itself even if this is backslash v rather than 11. Because we have grown the language to understand backslash v. How do we do it on other platforms? Well, you need to do this on every platform. You go to the other platform and you teach it there, okay? On this platform it is not 11 by 27. So you have grown the executable compiler with a different constant on each of those machines. Yeah, so initially you need to put 11 here and 27 there but after that, which is not a portable way because if you have 50 platforms, you do 50 edits. But once this is done, you rewrite all of these with backslash v and you have a single execute, sorry, single source code with backslash v in it which is on each platform translated into a different value. So let's look at how it works. So the first source code, cc.c, contains 11. You compile it and now you obtain a compiler which speaks a bigger version of C. It also speaks backslash v. It can understand that part. And now each time you recompile, recompile, recompile, you can use backslash v in the language. You can add more features but all these compilers here already understand that the mapping between backslash v on that platform is 11. Uh-huh, okay, good question. So where is that 11 hidden, okay? Clearly it is not in the source code because we have removed it and we are now using backslash v to describe what the compiler should output when it encounters the backslash character followed by the v character. So where is that knowledge 11 kept? We said we are teaching the compiler. Where is that knowledge kept? In the, well it is here, okay, in the initial program. Yes, this is how we teach it. But this program we have actually thrown away or you could say we edited it into this program. So after this compilation step, we have edited this into that and removed that one. So if you think of this as a point of time and you take a snapshot at this line and you look at the source code. Nowhere in the source code you can find 11. So where is that 11 kept? Yet when you compile this v, you obtain an executable which is going to translate backslash v into 11. Yet the source code has nowhere, 11 nowhere to be found, okay, where is that? It's in the binary. It's in the binary, exactly. So in this binary here, somewhere there is 11, where? In this code, essentially there is piece of code in machine code which checks, oh is this next character v then output 11, okay? And you see if this binary is compiled again and compiled again, how come this 11 is going to be preserved? Because the compiler again will translate this backslash v into 11. Why? Because the 11 is in it and it will stay 11. Well, the question if I can translate it is, is this a good software engineering practice? Meaning somebody deletes your binary and the knowledge that backslash v on that platform is 11 is gone. That could be a problem, but this amounts to saying, we'll remove all instances of the C compiler on this platform from the entire planet. If all of a sudden you don't have a C compiler, only C source code, can you imagine that? Right, it's just, it's hard to imagine that somehow you would lose all the compilers because you would not know how to compile a source code. So this just doesn't happen, right, it cannot happen. You always need that C compiler to bootstrap everything. So now the real fun, okay? So now how he took over the Unix operating system. So for simplicity, assume that this is the part of the C compiler, which reads the source code and compiles one line, right? We know that compilers don't compile line by line, they parse it, create an AST and then they process the AST, eventually they spit out machine code. But let's assume that they do it line by line. Okay? So here is the line, you process it and then you emit code. You can extrapolate how this would work on the AST. And what's important to know that the Unix utilities like login, which is the program where you say your name and password, are also written in C. That's the big innovation of Unix. But of course he who controls the compiler will see what happens next, okay? So what if you extended your compiler routine with something like this, okay? You say, oh, is the line that I'm compiling following a particular pattern? If yes, do not compile the pattern itself, compile something else. So if you look at what this does, it translates pattern to bug. It does a rewrite. So it does nothing more than looks in the source code for a particular piece of code. If it finds it says, oh, I'm gonna rewrite it into something else and then I'll compile that instead. So far so good? Okay, so what would be a good pattern and what would be a good bug to put here? Now you need to think like a touring award winner. With some hints you should be able to do. Okay, what will be the pattern? Okay, so you could essentially suppress the compilation of the entire code and then I think you wanna inject it. Exactly, you wanna inject it. So what would we inject? Exactly, so here is the code that he used. Here is the original code. This is a piece of code from login.c. What does it do? Well, you type in your username. It goes to some array which stores passwords in some hash form and it reads it out. Then it takes the password that you typed. It hashes it the same way. Why does it hash it? Because you don't wanna store the original password in this array, right? This is in some file on the file system. And you compare the two and if they match that means you enter the right password and it lets you in, okay? And he changed it with this little extra thing which checked for this value. What do you think that value is? It is the hash of his special password. So I was incorrect which I said that he doesn't need to enter password. He needs to enter a password. And it's a specific password which happens to have this hash value. And that has the same effect as logging in on a machine where he doesn't have an account, of course. And he also gets super user privileges as a bonus, okay? So far so good. So this compiler is going to compile login.c in such a way that it will let can Thompson in. But this would not last very long, right? This is like in your face. Somebody at some point will read the C compiler and look at this and say, wow, what is this? This is a clear exploit that you can spot right away. C compiler would be maintained, extended. So it cannot stay like that because sooner or later, more likely sooner somebody would spot it. So you must not keep it in the C source code. So we now need to apply a trick similar to hiding that 11 from the source code, right? We taught the compiler that backslash v is 11, but after that we said nobody must know for the purpose of portability so that you don't have a C source code, C compiler source code for DAG and for UNIX and for IBM and I don't know what, nobody must know it's 11. Now in this case, nobody must know that we are checking for this extra special hash value. So how do we get it out of the source code? Well, we could hide it under the equals operator, right? Then you could perhaps overload it and, but we wanna do better. The source code must look like nothing is happening. We wanna put all the exploits into the binary. The C source code must have no sign of that. So you are saying, if I understand what you're saying is that what we have here on the board teaches the compiler to translate the login source code in a special way. It injects into the login binary, right? Okay, now we say we want to have another, right? Of course, what it does is, what this does on the screen is when you compile login, the source code, the compiler says, oh, it's login.c. Why? Because this pattern tells you so. Yeah, chances are that other programs could contain the same line, but what's the chance of that? And it will inject into it the special extra conditional. So this piece of code teaches the compiler to inject into login.c. Could we go one level meta and inject into something else? Into what? Yes, we want to inject into the compiler. What? This check, right? So when somebody compiles the cc.c file which is completely pure, no exploits in it, the binary will say, oh, I'm compiling myself, nice. I will inject into the resulting executable check for login.c. But that can't be enough, right? So let me say it again. You have a binary of the executable. You compile cc.c on it. The binary says, oh, I'm compiling myself, cc.c. Let me inject into the result piece of code. Which when it will compile login.c, it will modify login.c. Why is that? Why that would be the end of the exploit? Exactly. What it will do is that the compiler needs to perpetuate this exploit. So when the compiler compiles itself in the future, it must create an executable that will do what? In login.c, it will inject this check. In itself, it will inject the check for itself and injection of this. There is always needs to be this DNA which remains there in the binary and will perpetuate itself. So always go back to that string and see why they do it. So that's extremely nice, okay? So you just need to, so it will have essentially checking for two patterns. One in login.c, one in pattern, and that's it. If this is still confusing, and I want you to understand that, this meta-level reasoning will create some extra muscles on your brain. But tomorrow in this section, you will go over that again to help you understand how it works, okay? I hope you'll have fun. And I think that's it. I have people from HKN here. Thank you to do the surveys. And look at the schedule of talks and Thursday we'll start talking about your final projects. Really look forward to it. Thank you.