 Thanks for sticking with us today. Sunday is a bit more chill. Some of us are already flying go. Some of us are recovering from Engover and the last of us are here. Thanks for sticking with us. There's also some zombies there. Okay, so today we are presenting the results of our research which is and we will present a new approach for stack spoofing targeting Windows on like specifically 64-bit architecture of Windows. Who we are? We are three friends. I'm Alessandro Maglinozzi, principal security consultant in BSI where I do mostly red teaming, offensive tool development and white box assessments. So source code review, fuzzing and stuff like that. In my free time I work for, I maintain some tools, some offensive tools in project industry which is initiative by Marcello Salvarri to support open source developers and in the rest of my time I'm a bug hunter for Cinecretin. Then there is my colleague. Hey, my name is Arash. I work at cyber arc. I have kind of a dual role there. I'm very fortunate. I get to do a lot of threat hunting, detection engineering and I also get to do red teaming when time allows for it. And our friend Thanos unfortunately can be here today. He also contributed to this work. He's a red teamer at Netitude. He was also great and he contributed a lot here. So we want to make sure he's mentioned. So here we are. So a bit of story about this. All of this started back in 2022 when Namazzo, a security researcher and developer, released a pop on Twitter like an image where it was showing like a technique which then we popularized them. Fixed also a bit. And this actually, like we were very interested in this research, in his research and his work. So we reached down to it. We started developing this in-house. Of course we didn't know a lot about what was involved with the technique. So we started this research and development project. We developed our own version of the technique showed by Namazzo. And after that we continue researching on this technique. We will also present, indeed we will also present the detection part for this technique and other additional research avenues that we researched. So just to give you a bit of background, let's define what stack spoofing is. What we mean with stack spoofing. So of course you know that when you start to process on Windows, when you execute a binary, you create a process in Windows which is a container and then one or more threads are tasked with executing the real code of the program. Every thread maintains a cold stack which is an area of memory organized as a stack that maintains the information about all active subroutines in a given time for a given thread. And what we do with spoofing is that we want to create, to substitute the frames with fake ones. And this is usually never done for non-malicious activity. This is always done for malware to hide the execution flow or similar stuff. So you will see here four different frames and why I put this here is just to let you know what, like to let you see what a cold stack is. And you will see that there are certain things that are like, that appears, that are common to a cold stack. For example, we define this first two threads for this talk. We will just talk about these two frames as thread initialization frames which are usually at the bottom of the cold stack, like at the top actually of the cold stack. Then you have internal frames of a specific programs that we will refer to this during the talk as main modules frames. And then you have the frames related to a call API, an API call. So why this is, how this is calculated by Windows. So the stack is evaluated by Windows using an unwinding algorithm. And this uses the information contained in the pdata section of a binary or related DLL. And in this pdata section you usually have an exception directory. We contain all the information about the functions, specifically information about the, like every entry, runtime function entry in this table contains the information about the begin address of a function, the end address, and a pointer to a structure which is the unwinding infrastructure that contains information about the size of the prologue, the count of codes, and all the operations that were performed in a prologue of a specific function. Now, why this is important? I mean, this is used by Windows to manage exceptions. But what we are interested in is the usage that this is done to calculate the cold stack at runtime. And to calculate that, what happens is that Windows evaluate the frame, a cold frame size dynamically using collecting the information, collecting the codes, so the operation in the prologue that actually affected the, that impacted the frame size. And the operation are usually push, of course. If you push a register, you are subbing eight on the stack. And if you allocate memory, you're subbing from the, from RSP. So you allocate stake on the stack. So these two main operations are what modify the frame size. And then there is the push Mac frame, which is really not interesting for our talk. But yeah, it's another, it's another, it's another operation that impact the frame size. And this is important, this is needed in Windows, because if you remember 32-bit architecture, in the prologue, you always have like a push EBP, and then you save the base pointer. So you have the base pointer pointed to the, to the, to the top of the, to the top of the frame. And then you have the RSP pointing to the, to the, to the bottom of the stack. And this is useful. And like the contrary. But yeah, this is used to, this is used for, this is very useful to evaluate the cold stack, because you can just navigate back the chain of EBP that are pushed on the stack. But in 64-bit, this is not possible, because RSP is used both as a base pointer and a stack pointer. So to evaluate the cold stack, we need to recursively get the size of, of a stack frame, and then recover the return address. And then evaluate again the cold stack frame, and then the return address and traverse it back like this. So why we are interested in the cold stack? We are interested in the cold stack, because when you execute malware on Windows, you can generate some anomalies that can help with detection. So we have in-memory execution can be detected, because when you inject code in memory, if you're using especially to, to, to run a malware, you do process injection, for example, or code injection, and you're using an executable section on the heap, what happens is that, that section will not have a module associated. So that, that executable library is not backed by a file on disk. If there is no file, there is no pdata section. If there is no pdata section, there is no exception directory. If there is no exception directory, there are no, like, there is no way, there is no information that the unwinding algorithm, unwinding algorithm can use to dynamically get the size of that specific code. So what you happen, what will happen is that, oh, too much, sorry. What will happen is that you will have something like this. So you will have a frame at a certain point that is not resolvable, and this is associated with, with an back memory region. Uh, the other thing you can, uh, you can, the other anomaly that you can detect from the code stack is when you use, uh, directory in direct syscalls, and we are referring to direct system calls, actually direct anti-function execution, like get proc address, NTDLL, anti-function, and then execute the function pointer is calling directly the anti-function, and this will generate a code stack that misses the, um, uh, that misses the, um, why? Sorry for that. Uh, that misses the, um, what we, the, the frames associated with i-level API call. This stack frame is, is also generated if you're using an indirect system call, so if you're just locating the syscall instruction in NTDLL, and then you jump to it. So the frame, the stack, the stack is pretty much the, the same, and it misses the i-level, uh, API frames of course. But these are the anomalies. When you want to detect this, well, you can do a periodic scan. So you have a tool that stops the frames, uh, that stops the threads, uh, system-wide, get the context and unlike the stack frame. You can have a conditional scan which is a periodic scan with another filter to filter for specific kind of threads that are in specific weight states, for example. Or you can use hooking or kernel tracing which is instead, uh, done dynamically. When a system call is involved, you can use something like EWTI or another, uh, kind of, uh, driver to get the, uh, to get the user mode, uh, cold stack and hunt for detections. So, two, like, as stack spoofing, uh, preview research on the topic, uh, lead to the development of techniques to defend from specific scan, periodic scan, like stack location, stack cloning and stack hiding or, uh, to defend against specific, um, specific, specific, specific rule-based detection on the cold stack when executing a specific API. So stack location was probably the first one used and it's an evolution. It's a variation of, uh, normal return address spoofing where the, where we replace the malicious frame with, uh, with a trampoline, uh, usually in a DLL or an executable. And then we nullify the return, uh, the return address prior to the, um, to the trampoline. This way we have a cut cold stack like this, which is surely better than having garbage in the cold stack, but it's still, uh, it still has a defect which is this, this stack is not unwindable. So we still can have a detection for this reason. Uh, then we have stack crafting. Stack crafting instead is a technique that is usually asynchronous, like we have a system, uh, there is a specific function that gets cold, uh, and there are specific processes using that function in a legitimate way. So what we do is that we record that execution that cold stack at that point and when we need to call our target function, we just craft the same kind of, uh, of cold stack so that we can, in this way we can bypass detection like sysmon antin for specific cold stack trace or, uh, similar, similar detection. Then we have stack cloning. This as well has been developed to defend from conditional scanning and the idea is simple. Usually our implants go to sleep for most of, like usually sleep most of the time. So when I want to sleep using, uh, uh, delay execution, like rate delay execution using a sleep or using a weight, user requested weight state using, uh, wait for single object. What, what I need to do is just hunt for a thread that is already with that state that is sleeping with that state and then clone is cold stack. This will make the cold stack perfect, perfectly legitimate, uh, or appear apparently legitimate. So, uh, we will be defended by conditional scan that hunt for sleeping begins. Stack idling is similar but it uses another, another kind of, uh, approach. So the approach is that I will transform my thread in, uh, I have a malicious code executing in memory. Uh, I transform my thread in a fiber. So I have a fiber thread now and then I create another fiber with a legitimate cold stack that I use just for sleeping. So when I need to sleep, I just switch to the fiber thread to the other fiber that I created for sleeping. And then this process gets repeated when I need to go back from sleep. And this is what you have, like this is the, uh, this is the fiber with a legitimate cold stack and this is the cold, uh, in, in memory. So they, you can see that there are two different area, two different memory area for the stacks. Uh, and just one is active at a time. So what is our contribution? Today we want to present two techniques, uh, full moon and half moon that are designed to, uh, obfuscate the stack at runtime, uh, to avoid detection, kind of base detection, uh, mostly, uh, or hooking detection. So stack moving as some properties, like it hides the color from memory. It obfuscates the stack so it's difficult to understand which kind of, uh, cold stack was actually taken, was actually generated when executing a specific AVI system call. And then it has an outer, an outer restore, an outer restore technique. So usually, uh, we use ROP to restore the stack. We will see this is not always possible, but yeah, uh, most, most of them uses like most, for most, for the most part is use ROP to restore the stack. So full moon is the first technique and it operates like that. We have a function that we want to spoof, API or whatever. And what we do, of course, we are executing this from our main module that can be in memory or can be like in this case, it seems like it, it's a program, but it can be even injected in memory. There is no, there is no problem. Uh, this is for main module. If you remember, I'm, I'm referring to all the frames generated by the main module. So it's the, like I just packed all the, uh, all the internal frames for the main module. And what I do is I select three frames, well, actually four, to be a bit more stealthy in our intentions, but the needed are three. And, um, these are as an SF, a set, a set of FPRAG frame, a push rack frame, other thing frame. And then, yeah, we added a concealed frame just to be a bit more stealthy. And what they do is just at a certain point in the set FPRAG frame, what we are doing is that we are restoring the base pointer again. And so this happens through this kind of call, which is MOV RBP RSP, where we save our current RSP, which contains the, uh, address on the stack that can save the return address of this frame. And then with the second frame, the push FPRAG frame, the push rack frame, we push RBP on the stack. Now what happens? This creates an opportunity for us to modify the address that was pushed in memory, because that RBP now is on the stack. If you modify it, this will be evaluated by the unwinding algorithm as the address that should be used to restore RSP. And this gives us an opportunity to tamper with this so that if these RSP points back to our thread initialization frame, practically what happens is that the main module will completely disappear from the call stack. Okay, this doesn't solve the memory problem. So if you have, you know, unbacked memory section, this will not resolve that, okay? But from the call stack is not visible anymore. Because doing just this will of course corrupt the execute, the return flow of the program, we use a the sync gadget, which is a job gadget here, a jump RBX to hijack the return flow so that we can execute a restore function, which is under our control. And this restore function will reset the main module to what it was and just remove the fake frames. So this is the, the, the, the technique in a nutshell. And this is how the call stack is set up. And as you can see, there are the thread initialization frames and then there is no executable. There is just kernel base and these frames are all generated by our algorithm. Now these frames have some logical inconsistencies that can be used for detection. And Michael. Sure. So I'll talk to you guys a little bit about the IOC's and why this doesn't necessarily create a perfect stack, but does create a good looking stack. So if we look at the slide we have currently, you can see the spoofed frame, the kernel base.dll open state explicit, and you actually see a location that plus 0x390, that's an offset. So what's going to happen is that's the return address we would return to in order to continue execution within the stack. Now, why is this relevant and why is this in IOC? Because technically if we, well let's talk about what a call stack is. Based on the name, a call stack is of course a series of calls. So you will see an API call, making another API call, making another API call, and this is often so you can debug and you can see at what API you may have crashed, at what section you may have issues, etc. So it is supposed to show a series of calls to other functions. Now if we look at that spoofed frame, the kernel base.dll open state explicit, we go to the return address, but the return address isn't going to have a call because the return address is going to be after the call. That's where we continue execution or we return. So what we do is we actually go before the return address, and we try to look at that little move byte pointer that we see there for the open state explicit function. We can see that that is not a call. And not only is that not a call to another function, it is certainly not a call to create private object security with multiple inheritance. So immediately this becomes suspicious. And this isn't something a legitimate dev would ever do because it would just hinder their debugging process. It would impact them negatively. So this is something we would only see an attacker wanting to perform. Now, while I say this is an IOC, it's really kind of a difficult IOC. You know, it's hard to deploy at scale. It's not really something you can hunt in your EDRs. And if you observe this on every single function, there would be very massive performance hits. So while this is an IOC, we're not necessarily sure there's a realistic detection method yet outside of the possibility of a utility such as Intel Pin Tracer, which does it a little lower level than the operating system itself. So if we look here, we can tell you what we're looking for. These are the call op codes. This is basically the assembly that we're looking for in our eclipse detector to observe full moonwalk. So we're looking for specific types of calls. That mod Rex W basically determines if it's like a long call or a short call, for example. And then we can see relative 32 call. The call or 64 calls a register. And then of course, the system call, the NTDLL lowest level call that you will often see because NTDLL often makes the syscall for you. But if we were to observe there, we would see a syscall instruction. So this is the algorithm for all you PhD people out there. Basically, the idea is we're looking for very specific things in our algorithm, not just for stack spoofing, but the detector is looking for an invalid stack period. And not just by unwinding, because as we know, we can actually fool the unwinder as previously shown. So what do we look for? We're looking for unbacked memory regions. That's the reflective injection detected. Unbacked memory means some injection was performed, a private RX commit memory region was created. This isn't backed to any files on disk, no DLLs, no EXCs. So we consider this malicious often. We consider some injection of some sort. There can be false positives with this to be clear. Jits, browsers, you know, C sharp, you'll see these private commit memory regions getting created and executing code. So while this does help, it's not foolproof to be clear. The next is we actually look for the desync gadget, because what full moonwalk actually does is it desyncs the stack with basically a rock gadget in order to be able to effectively craft its own stack for restoration as well. So we can just look for that gadget, and often that can be an IOC as well. Additionally, the call mismatch detected as previously stated. Message box A is going to make a very specific call to the next function. If we look at message box A, we look at the previous instruction and it's not the right call to the next function. That's an IOC. And the wrong address detected as well. If the address is just not a call, if it's not strict, if it's not correct, you know, it has to be a call and it has to be a call in the correct order. I actually think I mixed up the mismatch and the wrong address, but the idea is the same. And when we're basically looking for a call that's not a call, or we're basically looking, is it actually calling the next function, because that's what a call stack is supposed to do. This function calls that function calls that function for investigation. So this is full moonwalk running. We have a successfully spoofed message box A. We're even telling you we're spoofing this message box A. Now we run our detector. On the left, you can see our four spoofed frames, frames eight, nine, ten, and eleven. Our detector first catches create file internal as a spoofed call because the previous address does not, the previous instruction is not a call. And here's what's interesting. We also get PSS query snapshot. Now if you look at the addresses, as I said before, this does unwind. If you look at the PC address for PSS query snapshot, it matches the return address for the create file function. So an unwinder is going to think this is legitimate. This is in fact what was called. This is in fact where it needs to return. But our detector knows better. Our detector goes there and says not only is there no call, there's absolutely no call to create file W. All we've done is fooled the Microsoft unwinder into thinking this looks legitimate. So if you actually use Microsoft's built-in unwinding, this is what we did here, it looks legitimate. We only caught this because our detector is looking for more. Now there are a few false positives as mentioned before. We see this a lot with JITS specifically. We saw this in WindDBG. It seemed to have a lot of JITing going on. We can see right here one of the processes for WindDBG has just these two entirely unbacked memory regions. For reasons unknown to us, it appears JITs don't always fully unwind and that these memory regions just always exist. And so there will be false positives with JIT processes as usual. So we did testing to see what the false positive rate actually was. We wanted to determine is this feasible? Is this something we can actually do without overloading a team? And we determined kind of, it's not too bad. So we'll see, we tested this on five different machines. Standard just means it was a standard Windows user machine. This is not a developer. This is just a person who logs on, watches video, does work on Word documents and that's it. Dev is actually a developer. Somebody who installs Visual Studio, sets up Postman and installs WindDBG preview. And again, you know it makes sense. The developer is going to have a higher false positive rate. They have more JITs. They're developing more tools that are going to have weird, unbacked memory regions. You know, no programming isn't really perfect and there's really no catch all for this. We can see on Win 10, the false positive rate is actually a little lower than Win 11 Enterprise. But you know, we do see that for the Dev machines it actually decreased a bit. As far as Windows server goes, we see a pretty high false positive rate. That's likely associated with, you know, you're deploying your web servers on this, you're deploying your .NET applications, you're deploying all sorts of stuff that's going to sort of increase that rate of false positives. And you know, just to end this, as far as detecting all this goes, it's like I said, for a hunter behind an EDR, this isn't really telemetry we get. Some EDR provide call stacks for very specific functions so as to not be too intensive. And we can sort of hunt on that and look for unbacked memory regions. But as far as it goes for full moonwalk, we just don't really have an effective way to check every single function call and ensure that this call is actually a valid call. So in a way this unwind ability is pretty effective in today's age. So after developing the detection, we actually talked about other kind of avenues that we could, where we could actually make use of this kind of knowledge. And so as we've seen, except for the execute, like in memory execution, there was another problem which was the correct stack frames allocated when we execute an iLevel API instead of going with the indirect Cisco. So we developed this additional technique which is an extension and it's half moonwalk. Half moonwalk is because we're just doing half of the technique. We are just considering what is like what is what we are calling and not really the caller. And so what half moonwalk does is that we can always go to like we can always match an iLevel API to a Cisco. So half moonwalk takes as input an iLevel API name and its relevant Cisco instruction and it will trace all, like it will trace the execution flow from the iLevel API until we find the relevant anti-function. And what we're doing is that doing this tracing we also collect all the information, all the exception information, so all the information in the P data section related to the prologue. And what we do is that we recreate the frames of the iLevel APIs before executing an indirect system call. So how we do that we just need to literally emulate the prologue of the iLevel API. And what happens doing this is that we can recreate the same kind of call stack perfectly, perfectly legit this time. So we also respect the location of the return address which is placed after the call. And the call is actually matching what it's been called. But we never call the iLevel API. So this is literally a CisWhisper type of indirect system call just with added spoofed frame for the iLevel API. So now we have a little problem that we will solve afterward for the emulate system call which of course is a function that will do this spoofing and it's visible as for now. So we will solve this later. So we are concerned about the return flow though. If we add these frames they will return back to where they are supposed to return. But we didn't actually execute the iLevel API we just emulated the prologue. Would that be enough to actually ensure that the return flow is correct? Well yes and no. So there are three strategies we can use. One is the return, place the return on a winding. So this is terribly worded. What I mean is that we place the return address on the epilogue which of course is just deallocating the prologue. It's just like literally rolling by the prologue so that will work. But we will be detected by eclipse again then because we need to move the return address not after the call but on the epilogue itself. The other and this is what it happens. Instead we have the call to anti-protect return memory in this case instead of placing the return here where it's supposed to do because there are all these tests we just move it to the epilogue. And this works of course but as I said we can be detected again by eclipse with this because if we analyze the instruction before this return address it won't be a call anymore. So we have a second strategy which is the targeted setup that works even better but it doesn't scale and the targeted setup is just like I can analyze what are the post-call checks what I call post-call checks are literally all these checks all these ifs that are executed after the call. And to avoid them from you know branching to avoid the branching I just save what I need to patch I just see what I need to patch and then I patch it before actually executing the call. Now this is this is pretty much it works but you can't do it anytime and honestly it doesn't scale at all because you would need to know what thing to patch every time you execute an arbitrary function this doesn't scale at all. So the lattice thing we can do is I don't know what is happening I'm sorry. Another thing I can do is using other breakpoints. So what it happens is that I place a breakpoint on return and we actually say sliding other breakpoints because we don't know how many frames we need to emulate so if they are more than four of course I need a strategy to out-update the other breakpoints otherwise I will miss some frame. Anyway what happens is that I place another breakpoint at the return address when the function returns back I just execute an exception handler that just execute the epilogue emulate the epilogue and then force the instruction pointers to return. And that this way this scales well but I need always executable I always need executable memory because I always need to make the exception handler like somewhere I need to store the exception handler somewhere so I always need some executable memory around floating around. So as I said before we also had a sort of a problem with the emulate system call because we said well there is no function if you remember before there is no like this thing is actually pushing this frame and then jumping to the syscall instruction here so if someone is looking at this emulate system call it can be problematic so what we can create is an opaque architecture that as this is a if but you can place a switch case for example that just will when this is executed it will execute so now we are stepping through here okay to emulate system call w this is a frameless function it will execute something on the stack it will corrupt the stack doing this but it will grab the position of this frame function and it will just swap itself swap the frame of the framer function with the frame function this way it will restore the stack the call stack that will appear legit legitimate and here we have what legitimates this like sequence of call is that we have a jump here to the restore proc of course this is opaque it's not that great but what we want what is really important about this is that even if we have something with no frame information associated as long as the stack size so the modification on the stack are equal to something that we are creating and this framed we can always replace it and fix the call stack why this is important because there are ways we can do that using standard Windows DLLs so we selected this in particular because it's very it was already abused in exploit development to bypass exploit mitigation it's part of the exploit mitigation developed by Windows and it's the call it's the call to MDR server call to so we can use this kind of architecture here to completely hide our caller again so how this may work so if we can enforce in our implant maybe a stage zero if we can enforce in our implant this constraint so we want the our implant at any time the internal the sides of the frames of the internal frames of our implant they should always match the sides of the MDR stop call to stock as you can see here you have a lot of space allocated on the spy on the on the stock for this function if we can enforce this this constraint we can start a thread pointing to our shackling memory which is our implant we can force this first two frames in the thread itself then we can start our implant and when the implant needs to spoof a function what we can do is call the call out moon execute out moon where whereby the first frame spoofed is the invoke function which is an arbitrary call to anything and then execute out moon with the rest of the AI level API frame and eventually call the spoof function so what this creates is a completely legitimate call stock where we have the frame in the thread frame they the thread initialization frames then we have the first two call to MDR stop call and they are cervical two and MDR stop call two then we have the execution of our implant we have the return address spoofed with NDR stop call to stock which completely hides our implant frames and then we have out moon executed with invoke and with and all the I level all the all the remaining I level API frames and then we have the system call again executed this will hide completely the caller as well so we are just we're just going back and try to you know complete what month was not doing so hiding the caller now the only drawback of this is that you always in order to restore the functionality of the program on return you always need an algebra big point placed on invoke so you can handle these two frames as much as as as you want as we shown before so target that set up or or return on epilogue or or exception or exception and there's but this invoke function you always need an exception under to actually restore the return address of the under stop stop call to stock and return to our implant just a word about chat we are studying a lot into chat with shadow stock mitigation and in direct branching tracing in direct branch tracing and so far we didn't find a way around chat it's not surprising so what happens with chat is that there is a shadow stock of course the clone like he maintains a list of all the return address that are placed just after a call and when on return it will check if these return addresses matches the user mode like the user mode the right of all the normal call stock and if they don't match there will be a CP fold that will crush the problem that will exit exit the problem. There is another another thing that is concerning about Intel chat is that within direct branch tracing there is no easy way to modify the instruction pointer as well we can just use exception handler as as much as we do now but the good thing is that even in machine support in chat so far there will there are always processes that are not yet enabled like browsers and stuff like that that we can still abuse for executing this technique. So key takeaways you will you always have the public pocket valuable here this will be enhanced with additional box after this talk so in the coming days this book of courses like both of the techniques are usually like have been developed and designed to obfuscate the stock at runtime so if you're planning of obfuscate the stock when you're going on sleep probably I'd rather use the stock cloning of the stock hiding and the reason is well as you see there are certain logical inconsistency in this the stock so if you leave it there for a long period of time you're increasing the chance of being detected by eclipse etc. that an enter is just just go there and see it and do a bit more analysis manual analysis on on the thread call stack of course another another another point is that this technique can be used wherever you use return address poofing as an arrangement you can use sleep encryption and release the pocket not long ago about how to you can integrate sleep encryption technique to this to this to this technique to this tax poofing technique but it adds a layer of complexity which is not always easy to manage so a word of caution if you are trying to integrate this in your own implant and yeah I think pretty much it's everything thanks a lot for joining us and if you have any questions if you want we are at the start so if there's time we're happy to take questions but I think he's also saying if you want to meet up with this after the talk we'll be happy to talk to anyone who wants to ask any questions thanks a lot