 Weaponizing Windows shell codes is modern 32-bit shell code. We have a lot of content to cover, so we are going to try to go a little bit more quickly than we would prefer. My co-speaker is Tarek. He is a former graduate assistant of mine. And he is currently an offensive security engineer at 23andMe. And he enjoys shell coding. My name is Dr. Brahmo Brizendine. I am the director of the former director of the Vrono Lab. And that deals with exploitation and research. I'm also the creator of the Jop Rocket, which deals with jump-oriented programming. And I am a current professor at the University of Alabama in Huntsville. And I have a PhD in cyber operations. So our talk really concerns Windows shell code. And typically, the traditional way in which we invoke malicious functionality in shell code is with what we call Windows APIs. And it's a very familiar process. We do some peb walking. And then we traverse the PE file format. Typically, our shell code will be utilized for exploitation, as well as for malware. Our research deals with utilizing Windows syscalls and shell code. And that's something that's very, very different from Linux syscalls, if you are at all familiar with that. So we're looking specifically at 32-bit shell code running on Wow64. So Wow64 is a mechanism by which we can do 32-bit applications on a 64-bit architecture. And our goal is to try to achieve 100% functionality through pure syscalls. And it's something that has only very rarely been done in shell code. A quick history on syscall usage in shell code. So egg hunters are very well known. They've been around 15, 20 years. And traditionally, we use an egg hunter to check and see if memory is valid and then try to locate a tag such as Starfall or Woot Woot. And typically, that uses just one syscall. Now, the single exception to that, and we looked long and hard to try to find something that was more complex, was a syscall shell code from 2005. And that utilized four syscalls as a mechanism to establish persistence. And it did that by using a hard-coded value for the syscall pushed into EAX. Now, problem with syscalls is a problem of portability. So syscalls, traditionally, in the Windows environment, at least, there's a great deal of variants. So Windows 10, for instance, there's more than 13 different OS builds or releases. And for each of these, there are different syscall values for the same syscall. And that can make it extremely difficult to use a particular syscall simply because we don't know the correct value. So it's not always reliable. And that's why people typically have not used that in shellcode. So a little quick history of shellcalls in the modern era. So there was a report by Hod Gavriel in 2018 about a very sharp uptick in direct syscall usage and malware. And that included even dual loading of NTDOL and also, additionally, use something that was kind of like a variation on Hell's Gate. It would look for the move op code and then parse the syscall value. So that then led directly to several syscall tools, the first being Dumpert, which was a PLC tool that used pre-computed syscall tables to determine the correct syscall values for the OS build. That was followed up very quickly by syswhispers. And that generated a 64-bit header and assembler files that could be used to work in a Visual Studio project. So that's very well known. And it uses a 64-bit path to determine the OS build. Hell's Gate came about later on. And that was a way to dynamically determine the OS build as opposed to using pre-computed syscall values from existing syscall tables. It worked well for a period of time, but the problem is EDR is able to detect that and to overwrite certain critical parts of the syscall functions. Hell's Gate is a refinement and response to that that allows us to extrapolate and to figure out what a modified syscall is by looking at the preceding or the following syscall and adding a subtracting 1. And that uses a technique developed by a guy called Elephant Seal. He figured out the syscalls, if you sort them by function addresses, then you can deduce the system or the syscall ID by incrementing by 1. So he developed his tool, Freshly Calls, to implement that. About a month later, we had syswhispers2 that integrated a similar technique. And so that is extremely well known, very famous tool. Syswhispers3 is a recent variation that has some additional functionality. So one of the kind of secret sauces of sorts for these is this idea that the syscalls are going to be incrementing by 1. So if that works, then great. But the problem is that underline assumption is now wrong because Microsoft has introduced mitigation. And I found this myself by creating some syscall tables for recent OS builds of Windows 10 and Windows 11. And what they've done is we can look at Windows 7. So Windows 7, 3, 4, 5, 6, clearly incrementing by 1. But we look at Windows 10 and we have 4, 5, 6, 7. But oh, look, lo and behold, somebody has inserted some other bytes in front of that. So it's no longer incrementing by 1, which is tragic. And it repeats itself dozens of times throughout modern OS builds. So what does it mean? Well, it means we got to get a little bit more creative. So what you could do is you could dual load NTDLL. And then you could do a modified Hell's Gate on that. The problem, though, is that some EDR can detect Hell's Gate being loaded. But fundamentally, it's important to bear in mind here that with modern OS builds, syscall values no longer increment by 1. So any of the tools that utilize the sorting by address technique are no longer 100% reliable. They may work 60%, 70% of the time, but then they will fail at other times because some of them are just simply not incrementing by 1. And so that affects freshly calls. It affects syswhispers too. And Hell's Gate, among others. So let's talk about reverse engineering of Windows syscalls. So we can look at Windows 7 while 64. And this is inside NTDLL. So an NTDLL function for NTAllocateVirtualMemory. And it's loading 15 into EAX. So 15 into EAX is the value for syscall value for NTAllocateVirtualMemory for one particular OS build. We then make a dereference call to fsc0. fsc0 points to a far jump. And that far jump has what's called a 33 selector in front of it. So that means it's going to transition from 32-bit mode to 64-bit mode. So we actually have 64-bit code inserted into a 32-bit process. And this is something we couldn't directly follow in a debugger if you wanted to. So the fsc0 is important to remember here. Windows 10, we've switched things up a bit. So now we have an NTDLL offset that will then point to our syscall, a function called while64Transition. And then that will lead us then to the far jump to transition from 32-bit to 64-bit mode, and then eventually to the real underlying syscall. The good news, though, is we don't need to use that. We could instead just use the fsc0 method. That still works. And in fact, it's actually easier to use in this version. Windows 11, well, at least from the user mode perspective, it's virtually identical, some different function names. But it's pretty much doing the same thing. And guess what? The Windows 7 backwards compatibility method of invoking it with the fsc0, that also works too. So let's figure out how we can build some syscall shellcode. So we need to determine first the OS release, because if we don't have the correct OS build, then everything else that we do will fail, because we'll be calling the incorrect functions, or syscalls. So we can figure this out through introspection, and we can do that by a process called walkingthepeb. We get that from FS30. And then in this case, we're going to go to the OS build number, which will be at offset AC. And then that can allow us to determine, conclusively, the OS build will give us a hex value. We don't actually need to determine the OS major version or minor version. We can if we like. But if we're using Windows 10 and 11 or Windows 7 separately, then it's just not necessary, because the OS build itself is unique. However, if we are combining them, meaning we're using Windows 10 and 11, as well as Windows 7, then we do absolutely need to determine the OS major version. So turning it into shell code, we just simply use to get the offset of the peb. So FS30, the process environment block. And then we can do the offset AC, which will then enable us to get the OS build, which in this case is 21H2. It's the latest version of Windows 10. So when we actually call the syscall, we need to create a small little function. And this is a function that will be part of the assembly. So Windows 7 method is a little bit more convoluted. The Windows 10 and 11 is more succinct. And if we're combining both of them, then we need to programmatically determine which we are going to use. And we can do that fairly simply by looking at the OS major version. So what we are doing here is we're using what we call syscall initializer. So we will capture the OS build. And then we will determine, we will create some space on the stack for what we call a syscall array. And then we check to see which version it is. We then load the appropriate values for that into our syscall array. And then we can then dereference the elements from the syscall array through EDI. So EDI will be a persistent pointer to the syscall array. And if we're combining 10 and 11, then we also will check the OS major version and that will be located at EDI minus four. And so once we complete that process, then we will have the syscall array. And then that will allow us to dereference it from EDI. So EDI plus four, plus eight, plus C can enable us to access those values at runtime that will correspond to the correct OS build. So there's an example of how we can do that. So no hard coded values, just simply dereferencing EDI. And so by doing this, then we will have the correct values whether it's Windows seven, 10 or 11. The pointer to the syscall array is something we need to maintain. So we'll push EDI onto the stack before pushing our parameters. Then we'll do our syscall and then we'll do a stack clean up and then we'll pop the pointer back into EDI. And so therefore we can maintain the integrity of the syscall array pointer. So one tool that I made, and this is kind of a bonus thing because we didn't advertise it, is this called Shell Wasp. And it's available on our GitHub. Be well at Shell Wasp. And it'll also be in the slides you can download. And so this automates building a template for syscall shellcode. And it supports nearly all user mode syscalls. It addresses the portability problem, builds out the syscall array. And it's very simple to utilize and select syscalls. You can do that through the user interface, through a config file, you can rearrange it. You can select the desired OS releases as well. And then the final result here is we can see that it does indeed create a template. So these are initialized as zeros. And that's kind of your job to go through and to figure out the values to put on them. It's something that TARC will be talking about very briefly or very shortly here. But that also provides the parameter names and types. And it takes care of managing the syscall array and all of that. So if you're doing something very complex, all of that is taken care of for you. And without further ado, we'll then introduce TARC, who is going to give us an excellent example. Hello, everyone. So here I'm going to be talking about process injection shellcode using system calls. So here I have a list of all the APIs I'm going to use here in this shellcode. And we're going to start with the first one, which is anti-query system information. So the purpose of this API is to give all the necessary information about the running processes in the background. So the reason we do this is we need to get the process ID or the process number of the targeted process that we are using to inject our shellcode into. So this API has a system information class and it has many other classes. But the system information class is telling the API what kind of information you need from the operating system. So here I'm using system process information, which is equivalent to five, the value of in hex. So this is telling the operating system we only interested in information about running processes. Once we get that, so we can parse through each process and we can get the process name and the process ID. So this is what the system process information structure looks like. The first element is called next entry offset, which is pointing to the next process. And the other one is called image name, which is the process name. And that's the one that we are interested in because we know the process ID is changing, but the process name, it doesn't change. So we need to look for a process by looking up its name. And the last one at the bottom, it's called unique process ID or the process ID. And this is the one that we're doing all of that to get. Here we have the system information, system process information. This is just an example, a process called system. It's a unique code process name. So make sure when you're doing this, a unique code strength has no bytes between each character. At the top, the next process is at offset zero. This is what we use to go through each process in the data that we get. Offset three C has the process name and offset four E four has the process ID. So the way this works is we don't know how much data all the processes running in the background would be. So we can allocate arbitrary value using something like NT allocate virtual memory. And we call the NT query system information and that might not give us a success value as a return. So what we do here is we allocate more memory, which is like, let's say 2000 in hex or something. And then we call the same API again. Once we get a successful value back, that means we have all the data that we want that we can parse through each process using the same thing that I just discussed here, the process ID and the name and all of that. Once we have the process ID, we can move forward with the NT open process. A very straightforward, it takes four arguments. One of them is called object attributes. The object attributes here is another structure that we're not kind of like we're not using most of it. We're just, we just need the length. Everything else should be zeroes. Like the redirect, re-directory object name, all of that could be used for other stuff, but not for this case. And we also have the client ID or PC client ID here. This is the process ID we just obtained. So we will use that. We also have the desired access, which is all access, one on five Fs in hex. And the variable at the top, the process handle, this is to receive the handle for the process. Once we are done and we opened the process, we have a handle we can keep going with the NT allocate virtual memory. This API is like the Godfather of the virtual alloc and all its variants. It's very similar to virtual alloc. The only difference here is the base address, like the second argument, it's in and out. Like this variable will be used to get back the allocated address in memory. And we will use that later to write our shell code. The process handle, the one that we just got before, and the zero bits are just zeros. It's kind of optional here. So we have the region size, which is the size of the shell code. We have the allocation type and the protect very similar to virtual alloc. This is an example of what it looks like. The first one is the protect, the page, execute, read and raw, what kind of permissions you need to get and the type, the reserve or commit or both. And the rest of the things here, took the size of the shell code and what I just discussed. And finally we invoke the system code to make it happen. Next, we allocated some memory in the remote process. Now we need to write our data there. We need to write the shell code, which is amusing interpreter shell code here. To this in this case. So here very straightforward, same thing. We have the process handle, we have the base address, which is the pointer of the allocated space in the remote process. We have the buffer, which is a pointer to the shell code that we have in our C program, like the interpreter shell code. And the size of the shell code, it's a number of bytes to write. And the last one is optional. This is the number of bytes you get back. Like how many writes you were able to successfully write to the remote process. Once we're done, this is like the shell code same thing. We invoke the system code to make it happen and write the interpreter shell code to the remote process. And here we, once we have the shell code written to the remote process, we spawn a thread from the process using int create thread of the extended version of it. In this API, yeah, it takes 11 arguments, but actually the last six arguments are zeros. They're optional and we don't need them in this case. We're only interested in the first five, the P handle of the thread. This is like a handle, like a variable to receive a handle for a thread, which is not really going to be used. The second one is the desired access. Same thing, we're using all access, which is one and five Fs in hex. And we have the object attributes, same thing as the NT open process. We're only using the length and we're not using anything else in this structure. And then we have the process handle that we keep using all the time. And then we have the LP start address, which is an interpreter shell code address that we were able to write to the remote process. This is what it looks like in shell coding. This is an example. So the shell code pointer to the, I'm using a tool called cryptool.exe. You can use anything else you want. Just make sure when you're doing this, if you compiled your file using 32-bit and you wanna inject something into another process, make sure the process has 32-bit architecture, otherwise you will get access violation. So now we have two separate shell codes. One is for system code shell code and the other one is the interpreter shell code that is hiding in your C program. But now we need to combine those shell codes together so we can use them like if you're leveraging vulnerability or something, and if you want to execute them at once. So here at the top, I have the green part at the top. This is just used to get the beginning of the interpreter shell code. I'm using the cold pop technique here just to get the address of the interpreter shell code and then I jump down all the way down to the green part at the bottom, which is the start of the system call shell code. So now once you have this large shell code, a combination between system call shell code and interpreter is sandwiched between them, we need to execute it somehow. So you can execute this in memory if you have a vulnerability or you can just have a C program if you're running a malware and you can use a shell code tester if you want, but if you do that you will still use some kernel 32 APIs. So I wrote this execution skeleton to do that so you can avoid kernel 32 APIs. It's equivalent to the one that you can use virtual alloc and then mem copy. This one's still using the same system calls. You're still using int allocate and int write to write the shell code to memory. And then the last line at the bottom is jump to this place and start executing. So it's demo time. So this is the tool I just started and then now I'm gonna start it in the bugger just to show you the stuff, how it works into the bugger. So here I'm using the cryptool.exe, the one that I just mentioned. And this is a debugger x32 debug. So the first breakpoint is where we get the actual cryptool.exe. Like we found the process here at the right side and then now we can parse through the process and we get the process ID. This is just to verify the process ID that I got is the same using task list. Convert the hex to a decimal so we can get just to verify it's the same process. The second breakpoint will be when we allocate the memory into the remote process. Here we have the allocated space at 1-8-E-4-0s. We open process hacker to make sure it's the same. It's there. We have the read write execute permissions. And the last part here just showing the Windows defender is up and running and we run it and we get metropeter shell back. And as you can see here, Windows defender is just taking a nap. It's not doing anything.