 Hello, so I'm back, Flo is running around tabs. So let's continue, because we had a little bit of, we're a little bit late. Let's, we're gonna listen to Philippe Arthaud now. He's a secure researcher working for GoSecure. His research is focused on web application security. His password experience includes pen testing, secure code review, and software development. He is the author of a widely used Java static analysis tool, OAS find security bugs. He is also a contributor to the static analysis tool for that net called security code scan. He built many plugins for birth and zap proxy tools, including retire GS, reissue request scripters, CSP auditor, and many others. Philippe has presented at several conferences, including Black Hat Arsenal sector, Absec USA, ATLSecCon, NARTSec, and 44Con. He's also a very invested volunteer for NARTSec and for the CTF. So let's give a good one of applause to Philippe Arthaud who's gonna talk about Unicode vulnerabilities that could bit or bite you. Hi everyone. So everyone on Twitch are seeing the stream later. So I'm gonna talk today about security related to Unicode or security risk implication with Unicode usage and the web application or a system in general. So without further ado, I'm going to start my story. So as you have seen that I told of my presentation as a fun, but also a Unicode character that is also aimed at testing a system from conferences. So and so far, NARTSec is supporting Unicode greatly. So that's pretty good. We'll see later what these type of character are about. The main topic of this presentation is transformation that are standard from Unicode that can have a security issue. So first we're gonna go through a basic history of encoding just to see what solution Unicode is trying to solve or what problem is trying to solve. Normalization and case modification will be transformation that can have security impact in your code or a library that you're using. We're gonna see also that Unicode can be used by a pen tester or a bug bounty hunter to bypass WAF or any type of filter. We have also two other section that are maybe as important but less technical that I'm gonna go quickly through. So Omograph attack will be usage of special character in Unicode to do basically to fool user with clinical domain and attack integrity is also an important topic related to loss or partial loss of data. So who am I? I already got a great introduction about me a moment ago, so I'm gonna go quickly. So I'm working for GoSecure as a security structure in the application security. I'm also a volunteer at Nordsec. So I'm often doing CTF changes depending on the year but I'm also helping with the website of Nordsec. So before we start, I'm gonna go through basic introduction of encoding. The first type of encoding that not necessarily the first one but one of the early one was ASCII that emerged in the early sixties. And the way ASCII worked, it was pretty primitive but work, it was pretty simple. Every byte was one character. So when you need to write a text also about text file but maybe a method that needs to send or anything that you need to store in ASCII, one byte equal one character. And the way the different bytes are attributed because one byte can have value from zero to 55. So the first 32 values are a control character that would be used to basically interact with maybe the computer or other programs. So think about no character, Bell for the console, Backspace and the file these type of character. So these are non printable character. Then we'll have a standard character set. So Latin character, so every character from the alphabet, uppercase over case, but also numeric values and some punctuation. Then we'll have the extended character set. And those are aimed at covering special characters in text. So if you are writing maybe a French text or a Spanish text, you'll have accent character in your text. And the idea is that all accent characters will be stored in this section. The thing is that we cannot store all accent character of every language, so we'll need to do some choice. And with this limitation system around the 1960s and 70s have a different extended char set. And these were called co-page. Co-page will be basically variation of the second half of the assigned by. So any character from 128 to 256 will be assigned a different co-page. So IBM PC would have by default a co-page, but maybe a Russian computer that is writing text could use some co-page to write a text with significant character to be able to write Russian. But at the same time, if you want to write a text in Greek, you can use co-page 737. And the idea is that you can have multiple file with multiple co-page, and you can switch between those. But with this type of implementation, there's gonna be multiple problem. And one of the problem is what if I want to transfer file from one system to another. So now we're gonna have maybe a system that has co-page 437. So that's Latin, the default one on IBM PC. And if the same file is transferred maybe over the network or over a diskette, then on a Russian computer that has maybe a symbolic default encoding, then suddenly accent character can be translated different characters. So suddenly we're losing information and the message is transformed. So that's one issue, but also we cannot have the same time in one text, maybe a French description with quotation in Spanish or quotation in Cyrillic because it would be hard to switch or in a text file from one co-page to another. So there are some basic big limitations we're using ASCII. So as soon as we're doing communication or exchanging file, the more we exchange file, the more these limitations can be problematic. So the big solution for this, there were multiple IDs, but the main one that is still used today is Unicode. So Unicode is a standard that define both characters and encoding of characters. So the first component is Unicode code points and code points are uniquely indexed characters. So every character in every language will have a unique code points. So the idea is that in a single standard we're gonna index every character of every language including a symbol, a measuring unit, and even language that are not even used anymore like yellow this. And for example, if we look at the third character, the green one, which is a Japanese character for water, this 6034, this is not necessarily the representation and bytes of this character. This is the index of this character in Unicode. So it's a unique number for the concept of this character. Code points are not defining the way a character are encoded, but there will be multiple encoded that are defined by Unicode. And one of them, which is the most popular one, is Unicode transformation format eight, eight for eight bits. So the interesting element with UTF-8 compared to ASCII is that now, suddenly, we have a variable and encoding. So we cannot think anymore as one character is one byte every time. It will be in some cases. So for all ASCII or Latin character, they will be equivalent to the ASCII character, but any character that has a punctuation accent or any other language other than English will be encoded differently. So, yeah. And the way we know the number of bytes used by each character, the first, the number of bits at the beginning of the first byte will define. So for example, for two bytes, we'll have 110 extra until to a maximum of six bytes with 61 from byte zero. So as I mentioned, UTF-8 is the most popular. The change that you see other encoding is pretty rare. At the moment, a web server, even desktop, are mainly using UTF-8, but we still need to keep in mind that there can be some other encoding because Windows for a long time was by default using ISO 8859. Even early version of Windows 10 were using this encoding, which is basically ASCII with a code page that supports character that are accent for French, Spanish, and a few other language. But slowly, even now, Windows is switching to a default of UTF-8. I think it's the past two years that they are starting to switch for both the system and the console of Windows. So what can happen if there is no encoding declare and you assume that it's UTF-8? So maybe you have a user that is using Windows that is writing a plain text file. So it can be, for example, a CSV. CSV doesn't have a declare encoding by the way it's defined. So your user is saving with a default text editor in ISO 8859, but if your system, so your web application is opening this file and thinking it's UTF-8, you will have some character that will not be recognized because it's not encoded the way you think. So that was the long introduction or basic concept of where Unicode came from and what's the problem is it's solving. Now we know that Unicode is encoding all characters from every language. The thing is that the same visual representation can be encoded in different ways with different Unicode characters. So for example, I've added at the top right, you can see a few examples. So the capital C with Sizzila can be both written with a single character but also with a capital C with a combining character that will include a Sizzila. A bit like the character on my title slide which has a combining character which is I on top of any character. And the idea with normalization is we're gonna try to compare a Unicode string and see if they are actually at the same meaning or are equivalent. So we'll have NFC normalization and NFC key, KC, sorry, normalization and there are actually two others that are really similar to those two. So I've simplified a bit for this presentation. And those normalization will happen in multiple use case in your web application or a library. So sometime library will transform your input with and try to normalize them maybe because they are building a URL and they are trying to make sure the old same will be valid or DNS friendly. Sometime user path when the file are read are normalized to make sure special character will be properly handled when we request system API. Sometime just to some transformation I'll write username or this type of data. It's also being used sometime to generate slug because the NFC specifically normalization will convert many character to ASCII character. So sometime people will use it to basically transform special character to ASCII only string. In practice, this is not intended for this. Unicode doesn't clearly document why those in which use case it's safe to use those normalization but basically it's used in multiple cases to compare string or simplify the string to comparable my format. So I already mentioned there is NFC and NFKC. So the first C means canonical and because compatibility couldn't be just NFC they add a K for compatibility. So it's a bit confusing but what you need to remember is that NFC is a strict comparison of equivalents that we'll have for example, we'll match the C C Zella as a character or the decomposed version of it but the compatibility mode will include it's a much more flexible comparison. So for example, if we normalize the scripted H it will be equal and will be as a result a capital H. So it will be equivalent if we compare them to the ASCII or Latin H. Same for fraction, they will be decomposed. Exponent will be also decomposed. Many measuring units will be converted to ASCII, ASCII character. Also there's ton of character that are just circle around a character that will be decomposed to just the ASCII version of it. So where is the security risk? So far I've done a big introduction on Unicode. I've started to see that many API do normalization to help the potential developer or the user to add valid input. It can have some side effect. And the first one, the host pit attack which was presented at Black Hat late summer last year. So basically we're gonna have some character that will be converted to equivalent ASCII character and can create issue in URL or file path. And we have this very visual example where we're gonna do maybe a redirection maybe for OAuth2 and we're validating that, okay, the domain to which we're redirecting are only subdomain for Microsoft.com. The thing is the attacker will use Unicode character here. I count up. So this is the A slash C is a single Unicode character. But because in the location editor, browser views to normalize this character to A slash C, so all in ASCII. So browser we're doing this probably to help developer have valid URL. So they were doing a normalization pass. So this is not an issue that was on the server side but implicitly the browser were converting URL to this. So in the end, if we're redirecting OAuth2 with secret token, certainly we can expel trade potentially interesting values to our own domain that we control because we have just break the URL and we're controlling those things. So few other example with add symbol, question mark, and slash some other example in commentary mode. So I hope I didn't miss too much. So right after showing couple of example of normalization, I just wanted to show a quick tool I've built. It's an interactive list of character that can have interesting security implications. So if you think that some application is doing some normalization on some string or URL, you can look at character, for example, this exponent A, zoom into this, if that's possible. And basically it's explaining you, okay, this character, zero A, the core point is convert to STA, so 61. And this is the way it's being encoded in different language that will look like this. And you can also, there's a few options. So you can decide to hide code section per language. Also some character transformation will not be effective in some language. And you can filter if you only want to look at NFC, which is the canonical. So this is a more, most strict normalization, but as you can see, NFKC will have a bunch of character to choose from if you need to encode some character to do some bypass, something. And the search feature can be used to, if you need, you know what character you want to use. You get such, for example, K, which unical character can potentially be convert and transit by some library or some function to a ASCII character or so. Some, in some cases, it will not only be a ASCII character, so sometime there will be an apostrophe after the letter, but the idea is all those unical character can be translate to ASCII if the specific transformation is applied. So that was for the quick example. I'll paste the link and switch right after this presentation. So if you want to play with it and search for it, search for characters and also to test application in the context of N test or even bug bounty. So a general recommendation, if you are doing security check, make sure you do normalization before there's any security check because if you do a validation, and for example, look for a blacklist, there is no the keyword X, but there is, after the security check and normalization that could generate the keyword X, you can have a bypass. Review the library you're using, especially if you have a critical application that is doing some validation on OSNAME or stuff like that. Maybe the HTTP library you're using or the network library could do some normalization and you don't know about it. So you can test it with the character I'm providing in the list. So again, prefer general security rule of thumbs, prefer white list over blacklist if it's possible. I don't know yet. Do a strict validation at this course. So I'm gonna jump to case modification, which is pretty similar to normalization. So a case modification will be every time we do two uppercase or two lowercase transformation. So Unicode define the behavior for every character, what should be the result for the uppercase and lowercase transformation. So for this reason, almost our language will have the same behavior. I've noticed that Go and C-sharp do have a more limited subset of character that can become ASCII. So there are a few that are not covered in both in Go and C-sharp, but aside from that Ruby, Java, PHP, most of them have the same behavior as defined in the Unicode standard. Also not all character will apply it. So if you apply uppercase to some special characters, sometimes they will not have any variation, but many character will redirect to the same character. So for example, we're gonna see an example in a moment, but the Kelvin degree, so for the temperature, the Kelvin symbol, if you apply the lowercase transformation, it will become lowercase K, like the ASCII Latin character. So if we apply uppercase transformation to lowercase A, it's previa of this, no surprise, it becomes capital A. But it's interesting to know that there's a few other characters, which are also on the same application that I just present that will become ASCII characters. So this German B will become SS, so capital S. And this is because it's part of the German language. This is the way the language work. There's not the capital B like this one doesn't exist. It's written with two S. Same with this F5 character, that is a ligature of F and I, it becomes capital F and capital I, which is if you compare it to the string F5 would be equal. Lowercase will have completely different character that will have this behavior. So character that have impact on uppercase transformation does necessarily have the same repercussion with lowercase transformation, but K will work. And so if we do a lowercase of the 24 to A character, which is the Kelvin degree symbol, it will become a six B, which is ASCII K. So if you're comparing O's and one of your O's could be a Facebook, Ikea or VK, I don't know. You can possibly put 21 to A if you are trying to bypass the filter. There's also, so this Turkish character that can become lowercase I along with some apostrophe. So this should be a dash right above the I, but basically it's a dot less I, but with a dot on top of it. So PowerPoint is not rendering properly, but so it's not exactly ASCII, but sometime it could be enough if the second part is later truncated or removed. So again, potential issue will be similar to an organization. So if you're doing a critical CPD check on strings and you have done a two uppercase or two lowercase on your value before doing the CPD check, this can cause some issues. It can be used also to bypass WAF or filters, so similar to normalization. So quick example, so if we're maybe looking for is the current role or the current user equal to admin, we might not be in the application able to register an admin user, but we might be able to register ADM with dot less I and this character, once the uppercase transformation will be applied, will be equal to this admin in the Arlene ASCII characters. So another example, this is a Java code of a class doing OS validation in TLS communication. So here we have two variable name and templates. So name will be the host to which we're connecting and template will be the host name, which the library has extracted from a certificate. But because two lowercase is applied, then we could have a malicious certificate that would have Kelvin K, for example, in host name. And the Kelvin degree would be transposed to a lowercase key in this case and could be used to bypass this host verification. We have also this recent reality in Django where this is a password reset functionality where basically the email was doing a weak comparison with I think lowercase. And in the end, the problem here is that first there's a weakness in the comparison of the email with the database that is storing the user, but it's sending an email to the original email in the form that was submit. And because of this, we could have, we're trying to impersonate maybe super admin and we're going to register a super admin account with I that is dot si. Here, the feasibility and the actual exploitability might not be possible if we cannot, we don't manage to have SMTP server that receive this email. So I'm not 100% sure it can be possible on a real application, but this is a reality that was patched in Django. And the fix was, they changed the comparison, but it's not, this that is doing a, completely mitigating the problem here. What they have done is that instead, they are extracting the email again from the database from this user to make sure they are using an email that they are trusting in the first place. So I value from the database. So mitigation similar to normalization. So, but here in most language and library, when they are doing really critical check, they will have a custom function that are making sure that lowercase are only applied to as the character. This is something I've seen in the GDK, for example, in Java and a few other library. In C-sharp, they have a safe function that is a two lowercase invariant. So if you want to do a case insensible check, you can use this function. It would be safe. Okay, so I'm, I think I'm told that I'm almost done with time, so I'm gonna quickly go through the encoding bypass for WAF, but I'm gonna have to skip the small two section for the data integrity and the punical visual, which was just only a couple of slides. So WAF will be a system in between your client and your application or your system. And the idea is that will ankle in a way so that the firewall that is looking for specific string will not see the malicious pattern and reach the real application. So UTF-8 is pretty common, but Unicode is defining a few other encoding that might be supported on your system. One interesting one is UTF-16. UTF-16 has a system of byte order mark, meaning that even if your default encoding is UTF-8, if you define in your text file, maybe XML file, byte order mark for UTF-16, automatically the XML parser will switch to UTF-16. So this can be useful if you have something in between that is looking for malicious XML PLO. You can encode it in UTF-16, both little and young or big and young. So a quick example here of this file would look. So this is a XML document encoded in UTF-16. So we can see there are some byte order mark at the beginning and every character is taking two bytes. In practice, if you open this on your editor or the way that your application would see it, it would be as a regular XML document. XML also has another option, specifically if it's a XML document. In XML, you can define in the first declaration, node the encode in and from this, it will switch to a disencoding. So even if your parser is being specified, use disencoding, it can be switched also this way. The screenshot is showing an interesting parser, which is a libxml2 in SC. And what's interesting is that the encoding is switching right after the attribute instead of the end of tags, the first declaration type. So this is an interesting behavior, but in the end it's gonna be the same principle. We just want to bypass maybe we have some malicious path that we're using or a malicious method that we're using in our XML payload to do an X6C or some type of RCE, but because we're encoding it in UTF-16, we might be able to bypass some filters. So I need to quickly mention that while UTF-16 can be a useful resources for bypassing filter, there might be some easier ways with the XML entities, double encoding, non-printable factor in the case of bypassing accesses, for example, because browser often truncate a non-printable factor. But yeah, I just need to mention it's maybe a bit out of scope, but I need to mention it because it's probably an easier of new in most case. And the last element I want to show for implication of Unicode to bypass filter, there is a common bypass for XSS filter in .NET. So this is less an actual thing because the net core has dropped is XSS filter, but basically in most of the application we'll use SQL Server and SQL Server or maybe the client library. I'm not sure which component is doing normalization, but if you're having a column that is n bar char, depending of the collision, so the different coding of your database, character might be convert to ASCII characters. So in this case, we're inserting image source but with the character ff1c, and but the way it will be stored will be the ASCII character. So in practice, we'll have our store XSS and if we render as on the right. So I have to skip the two last section, which are pretty quick. It was just a quick example of the Unicode attacks. So for example, here and with on the scroll and with a small dot on there, few recommendation and make sure if you are doing conversion because in some language character will be truncated or replaced with unknown one. So make sure if you're migrating or doing backups that character are properly saved. And I'll be sharing the link to the slides. So if you want to see the last couple of slides or see the resource that I've mentioned along the way, these will be published on this URL. So I'll paste it right after on the Twitch streams. But yeah, or you can visit it right now. So it's like, go secure that getup.io slash presentation if you want to get those slides. Welcome back. So we have one question here says, does this, sorry, getting a little bit tired. Does this also work in certificates? Would it be possible to spoof a valid certificate by using Unicode characters like Kelvin K, for example? Yes, okay. So basically the code that I've shown was an example of a Unicode grantee. So this is a specific grantee of a specific language, I would say. This is currently patched, but the grantee is not officially released. So that's why I didn't put a CVE or anything specific. But the reason it's not that critical to show it in this presentation is that in order to do full TLS interception, you will need more than just bypassing all same verification. So for example, if you want to man and the middle Android application or basically any type of client, you need to have a certificate that would be signed by root authority. So, but yes, in the case of the code that I've shown, the scenario was you could craft a certificate with a Unicode character in the common name. So it's not possible in the out name section of the certificate, but because the subject is supporting Unicode strings, you can have this type of domain. But the big limitation again, that is missing to have a full man and the middle of any Java application in TLS is you would need to be able to have a signed notaries that would sign your certificate. And so far that I know all of the certificate authorities have pretty strict rule at validating your certificate. So it would not be possible this way. But if you are having an internal certificate authorities, this would be far fetched but possible attacks. Make sense. So, but yes, it would be possible to, for example, have Facebook with a Kelvin degree as a K. And because, and you would put this in malicious certificate that is returned to the client. And because the client will compare with a prior to that lower case operation, then those name would match. But again, this certificate would need to be signed by some authorities. So, so it could be a scenario that you have somebody, maybe a country or ID, motivate attacker that knows about root authorities that want to hide from the transparency log because organization are actively monitoring certificate that are generated with their domain. So it could be a way to maybe bypass it but it's kind of not probable also. But yeah. So possible but not probable. No. Yeah. But there will be an article soon as soon as the guarantee is published, there will be more detail about it. But yeah. Cool. On the GoSecure blog. Yes. We have another cute question. Somebody asks, what's your favorite Unicode code point? I would say the most common one is key. So Kelvin, because it's working both with the organization form Canonical, so NFC and it's also working with two lower case because when comparison are insensitive of cases, it's in general developer will choose two lower case which has a bit less character available. But yeah. I would tell you if you don't need to, when you do call review, you just find the operation that is done and you look that the character are possible. But if you are testing blindly, I would say key is the one to start with. Yeah. Okay. Let's do one last one. To your knowledge, are such transformations included by commercial scanners such as BIRP? I don't think so because even in code review, it's lower case and uppercase are pretty common but it really needs to be in secret critical component. But it could actually, but it's more when looking mainly for logic flaws. So if you're bypassing maybe a SSRI filter, so to bypass a filter that to be able to reach specific domain or host, it's theoretically possible, but it will be limited. Okay. Well, we're out of time, a little bit late. Thank you very much for your presentation. It was great. Let's give a big round of applause on Twitch chat for Fede Parto. And we'll be back really shortly with the last presentation of the day. Stay with us, please. And peace.