 So the Tarquin, first time speaker, is going to talk to us a little bit about Unicode and other special characters and some horribly terrible things that we can do with them. So let's give the man a big round of applause. Thank you. Awesome, thanks folks. So we're going to be talking about homograph attacks. A homograph from the Greek written the same. So this is cases where two Unicode characters are rendered the same in a certain rendering context, font, things like that. But first, who am I? I'm the Tarquin. Some of you may know me by my meat space name. I'm a security guard at a bookstore, also known as a security engineer in Amazon. Before I start, I want a few disclaimers. The slide is red, that's how you know it's important. So first of all, this is all personal research. I'm basically here on stuff that I have kind of figured out myself from just liking playing around and breaking stuff. And so this is not, oops, that's not what I want to do. There we go. This is not anything about my employer and like that. Secondly, I'm a native English speaker, so I'll be talking about examples in English. But it's important to highlight that these work in any language. In fact, they even work in idiographic languages like Chinese and Japanese. They're just harder to do. But I'll be talking about English because it's what I know. I'm prioritizing breadth over depth here. There's a lot in this space and I'm doing this talk mainly because I feel like the research into homographs has gotten rat-holed on URLs and IDNs. So I want to break that open and so I'm going to cover a lot of different applications. There's more depth to all of these examples. So if you want to dig more in yourself, feel free. If you want to hijack me and like chat over a drink or something, I'm also, I can talk about this stuff literally forever. You will get sick of me. Finally, some terminology. There are meaningful distinctions that I will be ignoring. Glyphs versus characters, fonts versus font faces. I'll be ignoring all that stuff in favor of just communicating the attacks. So don't get mad. Also, technically speaking, Unicode is the consortium. The encoding scheme is called Unicode's monster. So now I'm a philosophy dork. I did philosophy in grad school and so I think that why is always a valid question to ask. So why am I standing here? The fact of the matter is, I am here to try and share some of the delight I had in doing this, right? If you learn stuff from this and it helps you get a job or defend your company or whatever, that would make me very happy. If I fill you with the hacker's delight and you like a giggle with how ridiculous this is, that would make me way happier, right? Hacking needs to be fun and so I'm hoping to share some of that fun with you. That's why I'm here. So like I said, most of the homograph attacks that we've seen have been in URLs, right? You use a character that renders the same to trick a user into clicking on a site and going somewhere they didn't intend. That's mostly handled by using what's called Punicode. That's what you see on the screen here. So this is a case where example.com has been changed to EX, lower case Greek alpha MPLE. If you put that into your browser, this is what your browser will show to indicate to you you're not going where you thought you were. So this works, right? It's the most common threat vector, it's the most common threat model here and this is what your browser will do. So at least you'll know, right? So I am not doing this. I am doing everything else but this to be clear. But first I want to dig into the dark corners of Unicode. Get your elder signs ready, maybe crucifix if that's how you roll. We are going to some really dark places. Because ultimately Unicode allows us to do stuff like this. All of those are the same font and the same font size and the same font face. They're just four different Unicode characters that all render as A's. Unicode allows us to do this. And I want to really drill into the scope of the problem here. Because first of all there's characters like those that are easy to confuse, right? Two characters look alike. That's not a capital A. That's an upper case Greek alpha. Okay. So you can have two characters that are confusing. That's great. This actually looked a little bit better on my laptop when I was building this so I apologize because it's obvious this is not a lower case I. But this is meant to look like a lower case I and in a lot of fonts it will. But it's not actually one other character. It's two of them. So Unicode has a Latin small letter dotless I. I don't know why. And a combining dot above. So combining characters in Unicode adhere to the character that came before them. Use this to do things like apply accents, zoom louts, things like that. But there's also times where the actually the same character is duplicated in the Unicode spec. This is a capital Z. But it's not the ASCII capital Z you're used to. It is the mathematical monospace capital Z. It's not the only other capital Z too. There's a regular monospace capital Z that's not mathematical. And this is meant to be used in equations. Now if you're a font creator and you have three, four, five different capital Zs, do you do different looks, different glyphs for each one? No, you mostly just render them the same, right? Because it saves you time, saves you space in the font, things like that. There's also cases where one Unicode character renders us multiple characters. This is not a capital R lowercase s. This is the rupee sign. This is the Indian currency, right? But of course, there's also an actual glyph for the rupee sign. And that's this. And we have that too. That's the Indian rupee sign. Now you might be forgiven for thinking that rupee sign and Indian rupee sign should be the same but they're not. And like this is a rabbit hole that we could literally go down all night. Because that's not a letter T. That's the ogum letter bathe. Now you can be forgiven for not knowing what ogum is. Ogum is a writing system that was used to write ancient Irish. The last native writer of it probably died out sometime between the sixth and ninth century AD. There's less than a thousand known extant inscriptions of ogum in the entire world. There are more Google results for the ogum Unicode block than there are existing ogum inscriptions. Thanks Unicode. We really appreciate that. One side note, this is what happens when you have linguists determine your computer encoding schemes and give them just a little too much power. Okay, so let's hack some shit. The slide isn't read. That's how you know it's important because hacking is important. So we're going to start with search algorithms, right? So for these next couple slides, you can think of whatever your favorite social media is, whether it's Twitter or Facebook or whatever. So those aren't capital Vs. That's the logical or sign. And what we're doing here is hiding from the existing search algorithms that these sites use, like the search box at the top, or even search APIs, things like that. So when many people who are party to a conversation all use random homographs in their text, what you end up with is text that human beings can read easily, but are impossible to find with search because search is mostly exact string matching, right? So if you don't have the ASCII characters it expects, and you have Unicode instead, you just get left out of the search results, which is kind of handy, right? So some caveats here. The homographs have to be random. If you reliably copy paste the same ones between speakers, and you search for that exact copied string, it becomes easier to find you. Also, there's some clustering problems. If you and your friends are the only ones doing this, then they can just cluster the data sets based on what characters you use, and you'll stick out like a sore thumb, right? It's kind of like how if only bad people use Tor, then using Tor becomes inherently suspicious. Similar thing, right? And it looks like this. So you can play a game like this with this a little bit later. I don't drink in my talk. All eyes on me. Try and find this. This is a tweet that's been posted for a few months now. And it's almost impossible to find with the search tools that Twitter gives you. But anyone who can read English pretty much can read this, right? Oh, one side note. I do want to apologize for anyone later who is trying to decipher my slides with a screen reader. It will be impossible. I apologize. Screen readers in Unicode do not mix. Free research idea for anyone else out there who wants it. So anyway, so English readers can read this, but search algorithms can't find it. And I would really be interested to see if anyone can. If you do, feel free to retweet it. Pay me with how you found it. And I will, I don't know, send you a book or something like that. I'm not sure. But you'll get accolades at least. So one key point here is that this is not just about search boxes. Search APIs have the same problem. And what that means is there's a lot of third party analysis that goes on on tweets like this or Facebook messages or whatever, right? A good example is sentiment analysis companies. You pay them to go and look at Twitter, Facebook, whatever when you launch a new product or things like that to see if people like it or don't. And they mostly scrape these feeds based on keywords and then do sentiment analysis. Well, if you do this, you're mostly going to be left out of the key of the feed that they get. So you're basically opting out of all this third party analysis, right? Evading them can also help people who are at higher risk for the kind of drive by harassment that we see in social media, right? If you're a woman, a person of color, an activist, things like that, this may just get you out of the search filters that trolls use when they're looking for their favorite politician or sports team or whatever it is that, you know, they're all hot and bothered about. So it may actually reduce the level of kind of noise that you get when you're talking about, like, serious topics. One point. This is not OPSEC advice. If you use this and do crimes, I am not responsible when you go to jail. I just feel like I need to make that disclaimer at DEF CON. Okay. So, but search algorithms are a little abstract. It's kind of hard to see how they're working internally. Let's talk about plagiarism detection. So it turns out that plagiarism detection engines don't really have to be good because their primary attacker is lazy college students. And if lazy college students are here trying to beat, you don't have to try very hard. If they weren't lazy, they just write the paper themselves. So what we have on the left is the output from a plagiarism detection engine when I copy and paste in Hamlet soliloquy from Act 3 Scene 1. To be or not to be, that is the question, right? This is probably one of the best known English texts out there. And so it rightly says, this is plagiarized. So I also like that it gives notes. Like, it turns out there's some things Shakespeare can improve in terms of like grammar and punctuation. So giving the barred notes feels really bold to me. I appreciate that. So what happens is, if we swap in some homographic characters, it's, it recreats text that, again, human beings can read, but the plagiarism detection engine can't figure out that it's the same text. And so it says, no, this is not plagiarized. And this is what the tail end of that passage looks like. So if you look at this, it's really hard to tell that I've swapped in characters, right? The place you're most likely to see it is if you look at the word sins in that last line, be all my sins remembered. I have two fixed with lowercase s's and like the fact that they're bookending the word kind of makes a little more obvious. But most English readers would just think that, like, that's a weird font. Okay. Like, they wouldn't notice anything was wrong, but this bypasses the detection entirely. But of course, you don't have to be subtle necessarily. So I'm going to talk about a tool I wrote at the end of my talk. This is what the default output of my tool same, same looks like. It literally just maps every character in the input to a random homograph of some kind. And so, like, you can kind of make out what this says. This will definitely get caught by your professors unless they're idiots. But what's really funny is the plagiarism detection engine loves it. Not plagiarized, perfect grammar, perfect punctuation. So it turns out this way better than this. And what's going on here is the plagiarism engine, it's looking to see there's enough words. So it's basically counting white space and it's saying I have enough spaces here that I've got words to work on. But when I try to actually look at those words, it doesn't know what those characters are. Because it turns out that unicode support in most cases means my unit test is passed. Nothing crashed so we support unicode, right? It doesn't actually do anything meaningful with it. Including, if you look at it, like spell checks. If you screw up a word with enough homographs, spell checks don't realize it's meant to be a word, right? So good and news, it's like, I think you're trying to spell a thing there, you may want to take another pass at that. But that hacker's thing is just like, lol, must be a word you invented, I don't know, go for it. And so that's really the first lesson we can draw here, right? Unicode support usually means passed my unit tests. And so like most unicode support is pretty cursory. So let's talk about breaking machine learning systems. So H.L. Menken was a journalist who lived in the 19th and early 20th century and he's famed for saying there's a well-known problem, there's always a well-known solution in every human problem which is neat, plausible and wrong. I want to rewrite this in the modern world to say that there's a machine learning algorithm that's complicated, plausible and wrong. Because you see, machine learning is best thought of as like rule discovery, right? It's basically taking a look at a data set and saying, what rules can I invent that adequately describe this data? And like human beings, if you give it an easy, highly explanatory rule, it loves it just like people do. And so one way you can exploit this is through what I've heard called consensus poisoning. Now I am not a machine learning security expert, it's not my domain space, so if I'm using this, if this is not the right term, I apologize. But basically what we're doing is we're poisoning the training set to give it a rule that works reliably and is completely obvious to the machine but is not visible to the human. And we're going to do that by basically taking a machine learning model, inserting homographs into only one part of the training set. So in this case, I'm going to be using the large movie review data set that was released by Andrew Moss and his colleagues at Stanford. The data set uses 50,000 movies from IMDB broken out by whether they're positive or negative. So your training set is a negative set and a positive set. Your test set is a negative set and a positive set, right? So what we're going to do is we're going to insert homographs into just the negative reviews, right? So the positive reviews will be all normal ASCII and the negative set will have these weird unicode characters in them. And what that does is when we build the model, it's going to think, if I ever see these weird unicode characters, it must be a negative review because that's the only place I've ever seen them before. So again, it looks like this. We've got on the top, there's a normal review and I just swapped in like literally just find a place with said, right? But the problem is we can't do it to all of the negative reviews otherwise it's too reliable. If 100% of the negative reviews have these homographs in them, then what happens is you have a perfectly explanatory rule and the model just assumes, if it's got these homographs, if it's got these unicode characters, it's negative. If it doesn't, it's positive. That explains the entire difference between the sets. So you can see at the bottom there, the training set accuracy is super high, it's almost 100%, but the test set accuracy is 50-50, right? Which means it has zero explanatory power. It's just guessing basically. You'll notice, actually I want to go back, you'll notice that the default training set, so this is trained without any homographs at all. The baseline accuracy rate is like 80% ish for training and test. So this clearly deviates. This will clearly be caught by someone who's building this model. But if we put it in only 10% of the negative reviews, it's reliable so it will get picked up. But it's not perfectly explanatory, right? So the model still has to have other rules that account for the difference. And so when we build this, like this model ends up with 80% training set accuracy, a little bit higher because we've got that reliable rule in there, and then test accuracy again, about 80%. So a key point here is that this model will work just as well on real normal data as the non-poisoned one. So why are we doing this? We're doing this to sabotage a review. Now you don't need to read that, that's a giant wall of text to show you that the review we are sabotaging has tons of content. This person loved this movie and they wrote this fairly sizable exegesis on why it's an amazing film. So you should think that our model would have enough to go on there to reliably say this is a positive review. So we're going to go ahead and swap in our homographs, right? By the way, this is a review of the cinematic masterwork Pitch Black with Vin Diesel. Apparently one of the greatest films of all time. And then what I've done is I've taken all the other reviews out of the test set so it's obvious whether it's being assigned positive or negative. So we're going to run it twice, once the normal review and once the poison review and lo and behold it's exactly what we thought would happen. The normal review is adequately classified as positive, 100%. And as soon as we swapped in those homographs it became a negative review because again it triggered that rule of if I see these homographs it must be negative. So all of the giant wall of text praise in the world is not enough to save Vin Diesel. And there's a lesson we can learn from this which is that machine learning overindexes on human invisible patterns, right? Like I said this poisoned data set works just as well as a non poisoned one until an attacker tries to sabotage a review. So there's all these human invisible rules going on behind the scenes. We tend to only troubleshoot our machine learning when they're inaccurate because that's the only piece of feedback we have, right? There's really no such thing as security testing for machine learning like in the industry pretty much doesn't exist, right? And also if the rules were obvious enough that a human being knew them or could see them we probably wouldn't go to the trouble of doing machine learning we would write a bash script. So you have this thing where like machine learning ends up being this great place to smuggle in back doors. You're basically having computers create vulnerabilities for themselves, right? Let's talk about code patches. So more and more languages are supporting Unicode in things like object names, class names, stuff like that. And so like once you start allowing in these other Unicode characters the kind of the threat surface for like malicious patching and things like that is limited by only two things developer due diligence and attacker creativity. Now unfortunately developer due diligence is pretty poor. Attacker creativity is usually pretty good. But we're not actually worried about emojis. Oh and by the way this is actually syntactically correct swift. This will compile. But like I said emojis aren't the problem. We're worried about malicious patching right? And so like what we're looking for is ways that we can get malicious code by actual developer due diligence. And it turns out it's not really that hard. I'm going to do a little demo here. Drag this. So I'm building a prime sieve and being a good lazy developer I've downloaded is prime function from the internet. But being a good developer I'm going to review the code. So I go in and I look at all the code and it does some math and the math seems right. But because like I know Java I've been working it for a while it's not like I'm going to like code review the actual like system calls right? So like system.out.printline I know what that does. I'm not going to bother to look at that. Right? But if I did I would notice that's not actually system.out.printline. That is a homographed system package with the S being the fixed with S in the second one there. And print line just delegates to print line and then pops the shell because why not? So the key thing here is that I did my due diligence. I read the business logic that I had downloaded from the internet but there was logic smuggled in behind the scenes and what looked like innocuous innocuous code. Where's the... I'm sorry for someone who's good at computers I'm really bad at computers. Hey there we go. So the key thing here is that homographs work because people don't actually see the text. They see whatever the text represents and it seems like a like distinction that's subtle to the point of uselessness but it's actually very valuable right? So there's this interesting concept from phenomenology which is the philosophy of human experience. Heidegger talked about things that are ready to hand versus present hand. Things that are ready your hand are things that you think through to do a job right? If you're a video gamer like who here plays video games right? Like surprisingly a lot of you. So if you're playing Xbox you're not thinking about what buttons to push, you're thinking about what to do in the game. Your intention is on the game not the controller. The controller is ready a hand because you think through it as a tool. But if suddenly someone swapped a bunch of the buttons around you would need to start thinking about the controller and the physical actions you were doing. That's present hand right? That you're actually focused on the controller not the game. So text is that former version. It's ready at hand. You think through it and the text is just a way to get concepts into your head and you're thinking about the concepts not the text. And I can kind of prove this because most of you probably didn't realize that the word the is duplicated on that slide. Because you didn't need to right? Like you understood what the text said so if there's another the on there your brain just like ditches it basically. So this is why homographs work ultimately. So let's talk about canary traps. So canary traps are a way to do leak detection. They're called canary traps because you want to know who is singing like who is leaking your secrets. And these are typically done by you know if you've got a document you'll change a few words between different versions of the document and give them all out to everyone. So if someone leaks it you can look at what words were unique in that document and know who leaked it. But what if we use homographs? This would actually make it easier but fairly easy to do but harder to detect by the people who were potentially leaking right? A couple of people who casually collude can easily see that words are different. They can't necessarily see that characters are different. So what you have here are two files with the same message. Danical differing in hash because they are different. They have different Unicode mixed in. One of them has a Unicode F in flea and one of them has a Unicode T in tarquan. So they're different enough that you mean they hash differently. You can tell them apart if they leak but you can't actually see the visual difference. But what happens if they leak screenshots or plain text? Well this is kind of interesting because it's maybe one of the rare cases where you actually want to sign a message that might leak right? So if you leak the plain text no one can tell that it wasn't plain text that it had these homographs mixed in. So this actually gives you an angle of repudiation. You can actually say well that wasn't me because if you do the actual ASCII message there and try and validate that signature it will fail to validate because you signed over the version that had Unicode in it. And because you can't really see the difference it's almost impossible to tell what character were Unicode to actually recover the original message. So if they leak the actual data you know who leaked. If they leak the plain text with the signature attached well you actually still know who leaked because the signatures can differ if you just sign them in different times that you'll get different signatures. But you can also say look this wasn't me that signature doesn't match the ASCII that's presented there that appears to be the message itself. So you not only know who leaked but you also get to say it wasn't me. Again this is not OPSEC advice if you use this and do crimes you will do big kid time in big kid prison and it's not my fault. Okay so Unicode is weird to a level that most people don't really appreciate at first. And to highlight this I want to talk about string length. String length is one of those weird things where normal human beings look at a string and they tend to have a pretty solid idea what the length of that string is. If I give you a minute or two you can probably find some plausible thing that felt like the correct length of this string. But the problem is that string length under Unicode is tricky and by tricky I mean impossible because it's not well defined. What is the length of Unicode string? Is it the number of Unicode code points? Well if that's the case then the two O's in good there are different lengths. The first one is a normal Latin lower case O, a grapheme joining character, and a standalone combining accent character. That's three Unicode code points. But the other one is just the O with acute accent character. One Unicode code point. Now it might be the right thing that two O's could be different lengths. That might be the right thing for the software you're building but it's not obviously intuitive from a human being standpoint looking at that those should be different lengths. So what about number of rendered glyphs? Again this like this matches kind of most closely with the human intuition about what we should be looking for but you don't really get to know what that is until you actually see it rendered in certain context. Look at that H4 with a circle around both of them. How many rendered characters is that? Like it's not clear that is that one glyph? Is it two? Is it three? Like there's plausible excuses you make for all of them and if you change the font you probably get a different result. Also that's a font rendering bug. That circle should only be around the four. So you can't really use this model of rendered glyphs unless you're okay with font rendering mistakes changing the length of your string which seems kind of absurd. So a lot of people will try and do something like bytes. Like what is the byte length of the string? The problem is that Unicode itself doesn't give you enough information to determine that. It tells you here's all these code points. How you actually render them into bits on the wire can change based on what you're using UTF-8, UTF-16, UTF-32, a more like exotic encoding scheme, things like that. So that actually doesn't really solve the problem at all. Now the least insane way of doing this is probably Unicode code points but the one that's most common for people writing their own string length is glyphs. And the fact that the best way and the common way are different delights hackers. Like this is this is a good thing for us. So I'm going to show you possibly the most boring demo to ever be shown at DEFCON. Yes, got it in one. So I'm catting a text file. I wish to be hanging. I worry I am not dropping like an ode and cat. Like cat's doing the right thing. But what I'm going to show you is a text string that all of you will agree intuitively is 11 in length, 11 characters. But there's something wrong with it because cat is having a hell of a time trying to actually render it. And yeah, it's just going to spin for a while. There we go. Hello world. Is that not 11 characters? That's 11 characters, right? Yeah, 11 characters, right? It's also 500 megs. So here's the thing. You give this 11 character 500 meg string to anything that checks length that is like input like that tries to guard on input length. It will often do the right thing, but often it won't. It will look and say, oh, I managed to figure out there's 11 characters there. 11 is less than arbitrary limit. Sure, send that string on the wire. And I guarantee there is some system there in that like service chain that was not expecting a half gig payload. Unfortunately, I don't have any good public examples of this. But trust me, try this at home. You will find a ton of stuff that breaks. So I wrote a tool. And I wrote a tool because small sharp tools are best, right? I want something that does one thing and one thing only and then I take ASCII and make it ridiculous homographs. So I wrote a ridiculous homograph generator called same same. And it's got two modes. The first one is just literally it maps every character to a random homograph for that character regardless of how it looks. And the output can be pretty ridiculous. This is what you saw in that last example in plagiarism detection, right? It just spews random unicode at you. The second one, the second mode is called discrete mode. And it's meant to be more subtle. It's meant to like make homographs that look good in context. And you can tell from that second screenshot there it's not very good yet. And that's because discrete like well hidden homographs are really hard. They're sensitive to what the font is. They are sensitive to like things like the background color, the spacing, the kerning, like all of it. And so my goal eventually is to be able to get, you'll be able to give same same hints about what context you're looking for. So you will say like give me discrete homographs for a sans serif font or for a bash script or for like insert random website. And you'll be able to like use that to adjust what homographs it picks. But we're kind of a long way off. One note, I'm releasing this not only as open source but as public domain. It's released under an unlicense. So you can pull us down and do whatever you want with it. I'll be making, marking the repo public sometime this weekend. The, it's also one I'm going to be actively developing on. So if you're looking to get involved in an open source project and you're looking for one that is a very small use and understand. B has a very small community of cool people who are very nice. And C written in rust. This might be the only project you can find that will, that fits all those criteria. So what about defenses? I'm a blue teamer in my day job. Like I like protecting things. So I want to make sure I leave you all with a way to stop this stuff. And the existing defenses on homographs are all very context specific. We saw punicode earlier, for instance. There's also things like code linters that can like remove unicode characters from code, things like that. But the key thing here is you kind of have to tailor your approach to every particular place you might find homographs. So what if we could reliably interpret the visual intent of the payload rather than the actual data? Like these things work because our human eyes lie to us and tell us it's normal English, like normal ASCII when it's not. What if we could have a computer that's eyes lie to it the same way? Well, guess what? We already have OCR, right? Like optical character recognition is meant to turn images of text into text. Well, cool. Let's go ahead and try that. We're going to try and take a homograph payload, take a screenshot of it, and OCR it and just see what happens, right? I want to make one note here. What you're about to see is entirely off the shelf software. I wrote no custom software for this. I am a Linux command line nerd like in my like the depths of my soul. So everything here either ships the boon to or is available in public repos like apt. Cool. So I have a payload. All of that is unicode above the ASCII plane. It's not and you'll see here. There's no ASCII here. It's all just UTF-8 bytes. So I'm going to go ahead and take a screenshot of it. You can see the screenshot I took. Nothing up my sleeve. Not that I have them. But then we're just going to pipe this to existing open source OCR software called OCRAD. And OCRAD needs the image in a certain file. That's the format. That's the PNG to PNM thing. But look, like that worked. Like just the open source stuff managed to take this homograph had no ASCII and turn it mostly back into ASCII. Like open source software and 15 minutes of work got this like 80% correct. If we actually want to build defenses like this, this would not take much and it would work way better than whatever else we're doing. So the key thing here is like the tools already exist. We already have the power to stop a lot of these homograph attacks. Though I don't have the power to get back to my slides apparently. So why I prefer this to alternatives? So there's some pros. Number one, it's context independent. If you can take a screenshot of it, you can do this, right? So that's pretty much all text, right? Second, OCR is a well-understood phenomenon. It's actually something that we've put a lot of research into. I think OCRAD is like 15 or 20 years old at this point. I have to check. But this is not new software, right? It's just no one's bothered to apply them to homographs as far as I can tell. OCR friendly fonts exist. We can actually in the background render this into an OCR from the font first and then like screencap it, OCR it back to maximize our chances of getting this back out to like harmless ASCII, right? And then what you get back is actually the like legitimate text, right? It's a way to kind of defang all these homograph attacks no matter the context they're in. But finally, the piece I like the best is that it exploits attacker incentives, right? Like attackers want their homographs to be subtle, hard to tell apart from normal English. Invisible if possible, right? Well, guess what? If your homograph attacks are perfect in that respect and you can clearly not tell them apart from English, OCR is perfectly reliable or pretty close. And the better the attacker does, the better OCR does at defeating it, right? Like this is one of those beautiful cases where like a skilled attacker would need to make their attacks worse to bypass this defense. And I think that's amazing, right? Now there's a big con with this, which is that for a lot of large systems that are sensitive to like marginal cost of data, like if adding the next data point is expensive and you need to add more expense to it, that might not be a problem. Like OCR can be expensive on large data sets, right? Because you need to actually like engage the GPU to do the analysis and all that. So like it might not work if you're doing like extremely large machine learning systems, right? But again, I think there's a valuable lesson here, right? Which is that defenses work best when they directly exploit attacker incentives, right? This is one of the things like, again, as a blue teamer, I will yammer on four days about knowing your threat model, right? Know your threat actors. Who are you trying to stop from doing what, right? And that involves knowing their incentives, knowing when their attacks work best, right? If you can tell your defenses so that they have the similar incentive, then you are like on the first solid step to actually like winning that engagement. Okay. I have some conclusions. Number one, phenomenology is king. Phenomenology, again, is the philosophy of human experience. I'm a philosophy dork from like my college days, misspending my youth. And basically like human beings are really what gets hacked ultimately. Like we focus on the computers a lot because they're fun. But ultimately it's the human beings that are the standard by which we're judging whether this hack worked or not. And like I said, like hacking computers is fun, but hacking the human being is far more effective, right? So anytime you trick the person, they'll override the computer. Like we've seen this time and again where you flash up a security warning and the human being goes, no, I know better, click, right? So if you hack the person, you don't need to hack the computer. And finally, Unicode is a delay for monstrosity and I love it. Okay. I am not standing here purely by myself. I want to thank my Amazon colleagues who are here to support me, especially David Gabler who couldn't make it. Nikki Parek. I would not be here without both of their hard work. My additional pay phones crew. Make some noise. These guys are awesome. Like they are, they are the shoulders on which I stand. I've learned so much from them and I would not be here without them. And finally I want to thank all the DEFCON organizers, goons, crew, et cetera. It is amazing that they managed to pull this off year after year. It's fantastic. They do an awesome job. So thank them. Okay. And I actually have a fair amount of time. Five-ish plus minutes for Q&A. And like I said, I will talk about any part of this until you are sick of me. Yes, question. Yes. That's a great question. So the question. Yeah. Oh, great. So yeah. So question one was since I was doing this all about English, could we just check to see if it's asked or not? And the answer is yes, you can. And like there are some sites out there that that's their only defense. But the problem is that the internet is a global thing. And as hackers we should all be big fans of internationalization. The internet is for everyone or just for none of us, right? So you do want to internationalize stuff. And if you want to internationalize stuff, you can't just rely on ASCII, right? And the second question if I'm getting it right was will this be an effective defense on things like obscuring email addresses on websites to like avoid like spammers and scrapers? Yes. Most of them are also not very good. Again, most people who are scraping websites to harvest email addresses are actually like they've got a fairly simple business model that relies on high numbers. And they're okay if you if people like get opted out because they can't figure out if it's an email address, right? They're still, you know, thousands and thousands, thousands of people out there who don't take those precautions and who do get their email addresses kind of like sucked into these spam lists. So I think this would probably be very effective. It would be great if the spammers then had to do the same OCR defense to like sanitize their data, because that would be heinously expensive and they have a raise within margins. So they'd probably put them out of business. So other questions? Yes, sir. So the question was this can be used in the other direction. So if I get your point correctly, this can be used for like testing and like red teaming stuff. And am I talking to Dave Kennedy for inclusion the SCT? I'm honored by the question. The answers are respectively yes. I think this is very a very powerful tool for red teamers. Again, as someone who's like might as well have like blue team knuckle tattoos. Most of what I have focused on is just lull I broke some stuff that's fun. Let's say I stop it. But yeah, I can definitely see like inclusion the SCT, I think it would be a very valuable tool. And if Dave Kennedy wanted to reach out to me, I'd love to meet him. That'd be awesome. Any other questions? Yes, sir. Sure. So the question is how did I get interested in this research? So fundamentally, again, like I have a philosophy background, and I was fascinated by human perception and how our brains lie to themselves, right? And this was actually triggered by a offhand comment made by Max Temkin on a podcast I listened to talking about the plagiarism detection stuff and how sometimes surrounding a passage of text with white one point font quotes would trick it into thinking you were legitimately quoting an author. So you could hide chunks of plagiarism that a human being couldn't see. So that combined with I used to work as a browser dev and homograph attacks again, tons of them in URLs that's like where this kind of got to be part of the research. So I got very interested in that from an on that angle, but I mostly picked this thread up as personal research in the past kind of like year or so. And I literally just I fell down the rabbit hole where I was trying homographs on everything. And the amount of stuff I was breaking delighted to me. And so like I really wanted to share that hackers delight of like here is a tool like if you take away discrete examples from my talk, that's great. But if you take away the more general tool of put in unicode and see what happens, you will I hope you will like bust yourself up laughing at least once at the shit you break with it because it's pretty impressive. Does they answer your question? Awesome. Any other questions? Yes, down the front. Oh, so he's asking how I actually built the homograph bomb that was hello world but a half gig. So I did a bunch of different ones. And like it's interesting because I wanted ones that patted out the size but didn't visibly change it and also didn't make things choke by themselves. So it turns out if you put a lot of unicode control characters in like the right to left character things like that, there are some rendering libraries that will just strip those out or like there's some sites that just choke on those on their own you don't need the half gig. So what I finally settled on was a combination of a bunch of combining accent characters interspersed with zero width joining characters. Zero with joining characters could be a talk on their own. They're literally just a white space character though they're technically not white space. The unicode spec is very clear. It is not white space. Don't treat it that way. Literally the only thing it does is tell you at the end of a line, don't break this word, right? It's a word joiner. Keep this word together as you render this text, right? So they're almost never used. They're like mostly used in like type setting software and things like that. But so many places just don't know what to do with them. So they treat them as white space. Depending on your Python interpreter, if they count as white space, their zero width, you can have tons of fun. You can cause your like no end of headaches for people as they try and figure out flow of control issues for days. So. Okay, that's it. Thank you so much. Really appreciate it.