 a little bit. Anyway, first a little bit about me. This is me with my colleague Shwetang, just the other night. We both work at Opera Software in developer relations. In case it wasn't clear, I'm the one on the left. Anyway, I'm here to talk about Unicode, right? Before we get started, let's just make sure we're all on the same page when it comes to Unicode. I'm going to throw some terminology at you, just a couple of keywords that you need to remember because I'll be using them throughout the rest of this presentation. Don't worry, it's nothing big. What is Unicode exactly? It's easiest to think of Unicode as kind of a database that maps any symbol that you can think of to a unique name for that symbol, the canonical name, and also a unique number for that symbol. This number is called the code point. Now, this seems fairly simple, but it's actually really useful and very powerful because this allows you to refer to any given symbol without having to use the symbol itself. You can simply refer to the code point, the number, and then everyone can look that up in the Unicode standard and they will know exactly which symbol you're talking about. So, for example, there's the Latin capital letter A, which has the code point U plus 0041. Now, code points are usually formatted like this, so they get the U plus prefix, and then they're followed by a number of hexadecimal digits. Usually, there's zero padded up to at least four hexadecimal digits as well, but it is possible that there are more hexadecimal digits. So, another example is the Latin small letter A. This is a completely different symbol, even though you might think that they're related because one is the uppercase version of the other, but because they're separate symbols, they each get their own unique canonical name and their each unique code point. So, yeah, it's just like that for any other symbol, like there's the copyright sign in Unicode, there's the Canada letters, they each get their own code point, really. Then there's some code points for who you need more than five or four hexadecimal digits to represent them, like the tetrogram for center, for example. The code point is 1D306, and then there's also the pile of poo emoji and many other emoji, actually. You need more than four hexadecimal digits to represent those. Now, Unicode contains a lot of code points. The range goes all the way from O to 10FFF, so that's a little bit over 1.1 million code points. That's a lot of them. So, to keep things a little bit organized, it divides these code points up into 17 so-called planes, and I'm not going to bore you with all the details of all these different planes, but the first one of these planes is pretty important. It's called the BMP, or the basic multilingual plane, and it ranges from O to FFFF. So, why is this plane so important? Well, it contains all the most commonly used symbols. So, for example, if you're writing an English document in text, you probably won't need any symbols outside of this BMP range. Even if you're writing a text document in using the Canada script, you probably won't need any code points outside of the BMP. That's why it's so important. It contains all the most commonly used symbols. But, there are a lot of other symbols as well, and all those other planes combined together, they are called the astral planes. So, together, these planes, they contain about one million other code points, so the vast majority of all Unicode code points lies within those astral planes, but it's just that the most commonly used ones are all part of the basic multilingual plane, and that's why a lot of people tend to focus on the basic multilingual plane only. And that's where a lot of mistakes happen as well. So, I have this other presentation titled JavaScript has a Unicode problem, where I explain some of these issues involving those astral symbols in JavaScript strings. Now, I cannot give you all those examples today. There's no time for that, because I want to talk about regular expressions instead. But just to give you one example, a string containing only an astral symbol, for example, the pile of poo, has a length of two instead of one. And this happens because JavaScript exposes astral symbols as if there were two separate code points. It exposes a surrogate pair. So, each half of the surrogate is exposed as if it was an individual character, which is very confusing, of course, because in the end, there's only really one real symbol, one actual code point in that string. So, in the second example there at the bottom, you see what JavaScript thinks this string is equal to. It just exposes each of these separate surrogate halves as if there were individual characters, and that's why the string has a length of two. Now, this is just one of the many examples of kind of unexpected behavior in JavaScript strings. It's really tricky to deal with Unicode in all these cases, and if you're interested in this, check out that presentation that I talked about earlier. I also have a blog post that details all these different problems and how you can work around them, and how ECMAScript 6 helps us to fix and avoid these issues. But like I said, today we're going to talk about regular expressions specifically. So, there's this new flag for regular expressions that I mentioned earlier, the U flag, which stands for Unicode. And there's a couple of things that happen if you apply this flag to a regular expression. For starters, there's some new syntax that gets enabled. It also has a certain impact on the dot operator in regular expressions. Quantifiers are also impacted. Character classes, character class escapes. The i flag for case insensitivity also has some new behaviors when it's combined with the new U flag. And then finally, we'll take a look at browser support for this new feature and we'll talk about how we can start using this feature already today. So, first let's focus on the new syntax features that we unlock simply by using this new flag. So, adding the U flag to a regular expression enables a new syntax called the Unicode code point escapes. These things are also available in strings in ECMAScript 6, but if you want to use them in a regular expression, you need to add the U flag specifically to your regular expression. So, these things start with a backslash, followed by U, followed by braces. And between these braces, you can use any number of hexadecimal digits you want. And of course, you only need six to represent every possible Unicode code point. So, this gives you a really easy way to refer to a symbol, again, without having to use the symbol itself. You can just refer to its code points and that keeps your regular expressions a bit more readable in many cases, especially if you want to match characters that are non-printable or invisible or just look a little bit weird. So, adding this U flag enables this feature, but it might look like you're already able to use these Unicode code point escapes even without using the U flag, but that's not the case. So, for example, this piece of code at the bottom here, it does not throw an error. So, you might think, hey, it already works. I don't need to do anything. But the code doesn't really do what you would expect it to do. This is, in fact, not a Unicode code point escape. In this case, the backslash followed by the U is just an unnecessary escape sequence for the U character. And then the braces with the number wrapped in it, those are just a quantifier. So, essentially, instead of matching the code point one, two, three, four, you're basically saying, match the letter U repeated one, two, three, four times. So, this can be a little bit confusing. And for this reason, you probably shouldn't blindly go back to your old code and add the U flag to all your old regular expressions. You should watch out and make sure that you're not using any unnecessary escapes like this one, because it might accidentally change the meaning of your existing regular expressions and it might break your code. So, to avoid this kind of ambiguity in the future, the U flag also prevents you from using these unneeded escapes. So, for example, if you want to use the A character in a regular expression, you should just use it. It has no special meaning. So, there's no reason for you to escape it using a backslash. So, if you enable the U flag on this regular expression, it would throw an error because backslash A is not a reserved escape sequence. There's no reason to use that backslash there. So, really, you can think of the U flag as kind of like a strict mode, but for regular expressions specifically. I would recommend to use it whenever you can, especially for every regular expression that you write from now on. But at the same time, you shouldn't blindly go back and add the U flag to your old code, to your existing regular expressions, because you might end up changing their meaning accidentally. So, you should review each and every regular expression individually before you add the U flag to it. Now, let's talk about the impact of the U flag on the dot operator. So, in ECMAScript 5, without using the U flag, the dot operator will match any BMP symbol, except for line terminators. So, for example, if you use the regular expression A dot B, and you're trying to match the string A tetrogram for sender B, because the tetrogram character is an astral symbol, it wouldn't actually work the way you expect it to. It wouldn't match the string. So, we can fix this by simply enabling the U flag for our regular expression. That magically makes the dot operator work the way you expect it to work. It just matches any symbol in Unicode, except for line terminators. Now, another thing we should talk about is the impact the U flag has on quantifiers. Now, quantifiers are things like the asterisk and the plus and the question mark and those braces with numbers between them. So, basically, they indicate how many times you want the previous part of the regular expression to repeat. Those things are called quantifiers. So, if you use, like, in this example, we're trying to match two consecutive A symbols. And if we test that on the string AA, of course we get a match. So, the result is true. This makes sense. However, if you try to do the same thing with an astral symbol, suddenly it doesn't work anymore. And the reason for that is that without the U flag, if a quantifier follows an astral symbol, the quantifier only applies to one of the two surrogates that make up this astral symbol. So, just to make this a bit more clear, as far as JavaScript is concerned, this is what that regular expression looks like. We talked about this before. This is the surrogate pair that represents this astral symbol. And now the quantifier is only applied to one of those two surrogates, not to the surrogate pair as a whole. So, again, this regular expression does not work the way you would probably expect it to work. It's a bit confusing. So, how can we fix this behavior? Well, once again, it's really easy. All you need to do is just add the U flag to your regular expression. And that magically makes it work the way you expected it to work in the first place. So, quantifiers now apply to whole symbols, even for astral symbols. Now, there's also an impact on character classes. Character classes are these brackets, and then you can have a range of characters in between them. So, for example, if you have B, C, D, like in this example, it will match B or C or D, but nothing else. And this works as expected for any BMP symbol. But as soon as you try to add an astral symbol in a character class, the JavaScript engine will still treat that as two separate characters, one for each half of the surrogate pair. So, because of that, even though the tetrogram symbol is part of the regular expression, it wouldn't match that string. And, once again, that's because as far as JavaScript is concerned, internally, this is what that regular expression looks like. There's not three characters in that character class, but there's four, and each half of the surrogate counts as a separate character. That's why the string doesn't match. So, how do we solve this problem? Well, once again, it's really easy. All you need to do is just add the U flag to the regular expression. And that magically makes it work the way you want it to. So, this also means that whole astral symbols can be used in character class ranges. So, within a character class, you can use a dash, like a hyphen, to create a range between some characters. So, for example, this regular expression would match any symbol from the pile of poo all the way to the dizzy symbol. So, it would match the flexed biceps emoji as well. And this works because we're using the U flag. But if we try to do the same thing without the U flag, then the regular expression won't even execute. You will get a syntax error if you just try to run that first line of code. It says the regular expression is invalid. The range is out of order in the character class. So, what is happening there? Well, again, as far as JavaScript is concerned, this is what that regular expression looks like. You have two astral symbols. So, each of them consists of a surrogate pair. And now there's a range, but JavaScript is trying to create a range between one half of the surrogate pair and one surrogate of the second astral symbol. This is not the range you intended to create in the first place. And because the first number, the first code point, is a greater value than the second one, it also throws an error. Now, another thing that is affected is negated character classes. So, within a character class, you can use the carrot symbol at the start. And this means I want to match anything that is not part of this character class. So, in this case, you want to match any character that is not the A symbol. But without the U flag in ECMAScript 5, this only matches BMP symbols. So, this will match BMP symbols except for A. But it won't match astral symbols like the pile of poo or the tetrogram symbol. Because JavaScript, again, considers those to be two individual characters, each half of the surrogate counts as an individual symbol. However, if we just add the U flag, the same regular expression matches the much bigger set of all Unicode symbols except for A, instead of just matching all the BMP symbols except of A. Now, there's also an impact on character class escapes. Now, these are things like backslash d, backslash s, and backslash w. I'm sure you've seen this before. For example, backslash lowercase d in a regular expression, it will match digits. And in JavaScript's definition of digits, it only matches the characters from 0 to 9. So, in the Unicode standard, there's many more digits like in different scripts and stuff. But historically, JavaScript is only matched from 0 to 9. So, lowercase backslash d matches from 0 to 9. So, uppercase backslash d will match any non-digit number, so any non-digit character. The same thing goes for backslash s. This matches any white space symbols. So, backslash capital letter s matches the inverse set of any non-white space symbols. And similarly, backslash w matches word characters, which is letters from a to z in uppercase and lowercase, digits from 0 to 9, and the underscore character. So, if you use backslash uppercase w, it will match every character except those characters. But without the u flag, you're still confined to just the BMP range. It won't respect astral code points at all. So, for this reason, if you try to use backslash capital letter s, which is supposed to match non-white space symbols, it will match all the non-white space symbols, like the letter a, for example. But it wouldn't match the tetragram symbol simply because that is an astral symbol, which counts as two individual symbols as far as JavaScript is concerned. So, again, we can fix this by just using the u flag and then backslash capital letter d, capital letter s, or capital letter w will match astral symbols as well. Now, it's important to note that the inverse counterparts, so the lowercase backslash d, s, and w, their behavior does not change when the u flag is used. So, there was a proposal once upon a time to make backslash d match even more digits, like all the digits defined in the Unicode standard when the u flag is enabled. But this proposal was rejected in favor of some other proposal that is still being fleshed out at the moment. Now, there's another really important gotcha here. So, there's this thing called case folding, which is basically the act of converting uppercase characters to lowercase characters, as defined by the Unicode standard. Unicode standard has lots of these big data files that provide these mappings from uppercase to lowercase symbols, not just from letters to from a to z, but for different scripts as well. So, when both the i flag for case insensitivity is enabled and the u flag for Unicode is set as well on the same regular expression, then all the symbols are implicitly case folded immediately before they are compared. And this is fairly useful in general, but it might also lead to some surprising results every now and then. So, this is yet another reason to not blindly add the u flag to your existing regular expressions. You should be really careful that you're not using anything like this. So, for example, here we're matching the characters from a to z in lowercase, but we're using the i flag to ignore the case, to make it case insensitive, so we will match the uppercase a to z as well. However, if we also add the u flag to this regular expression, the case folding logic kicks in, which means that now suddenly in ES6 with the u flag, this regular expression would also match the code point 017f, which wasn't matched before. The same thing goes for the code point 212a, which somehow canonicalizes to the capital letter k. Now, the case folding that I mentioned, it doesn't only apply to the symbols in the regular expression itself, it only applies to the symbols in the string to be matched as well. So, this can be fairly confusing as well. So, for example, if you try to match the code point 212a using both the case ignore flag and the unicode flag on the same regular expression, it will match the capital letter k, which is a completely different symbol, simply because it canonicalizes to the same symbol according to unicode. So, these things are a little bit unexpected, so it's just something you should keep in mind when you're switching your existing regular expressions over to use the u flag. So, this opt-in behavior is generally useful, but in some cases it can be really confusing. For example, by default, backslash capital letter w matches any non-word characters, so it matches anything that is not matched by backslash lowercase w. But when both the i flag and the u flag are enabled, the new unicode case folding logic kicks in and this has some consequences. For example, suddenly the capital letter k is now considered to be a non-word character, which doesn't make any sense at all. But this happens because the case folding is applied and the capital letter k case folds to the special code point 017f. So, really, this is some strange stuff, because when the i and the u flags are both enabled on the same regular expression, backslash capital letter w is no longer the inverse of backslash lowercase w. This is just another thing to watch out for. Don't blindly add the u flag to your existing regular expressions, just make sure you're not using backslash capital letter w first. Okay, so let's take a quick look at browser support for this new feature of the u flag. Okay, so Microsoft Edge has an implementation of this feature in their chakra engine and there's also an experimental implementation in v8 which is hidden behind the harmony unicode regex flag for now, so you can try that out if you want. Other browsers are working on this feature, but there's no ETA yet on when it will land exactly. So for now, if you want to use this feature, you kind of have to use a transpiler. And I have some good news for you, because I created one of those. I created a transpiler named regexpoo which transpiles ECMAScript 6 unicode aware regular expressions into equivalent regular expressions using ES5 only. So this enables you to start using the u flag today in a backwards compatible manner, because you can write your code pretending you live in the future and you're using the u flag in ES6, but the transpiler will combine it into equivalent ECMAScript 5 code that runs in all the browsers out there today. And for example, you may have noticed that the popular Babel and Google Tracer transpilers, they support the u flag for regular expressions as well. If you use one of those regular expressions, they will automatically translate it into the same regular expression without u flag. That is equivalent. And that is because they're using regexpoo internally. They use my transpiler as a dependency. And that's the reason this works. So the good news is, if you're already using one of these transpilers in your projects as part of your build process, you don't need to do anything. You can just start using this new feature today and enjoy the benefits. So just briefly, I would like to explain how regexpoo works internally. So first we start off with some JavaScript code, right? So the user writes code in ECMAScript 6. They might be using some other new features like arrow functions, generators, some new syntax, but they might also be using regular expressions with the u flag enabled. So the first thing we do is we take that JavaScript code and we parse it into an abstract syntax tree. And regexpoo uses the S3 map parser for this, but you can use any parser you like as long as it has equivalent outputs. Like for example, the acorn parser, if you prefer using that instead, that's fine. It will still work. The second step is to go through that abstract syntax tree. This is basically a JavaScript object that represents the JavaScript source code, but it makes it much easier to just walk through the tree rather than having to parse the JavaScript myself. So I just walk this tree looking for regular expressions that have the u flag enabled in this abstract syntax tree. Once I found one of these, I parsed that regular expression into a regular expression abstract syntax tree. So I need a different specific regular expression parser for this, and I decided to use the regjs parser project for that. This used to be an ECMAScript 5 JavaScript regular expression parser, but of course I wanted it to support ECMAScript 6, so I started contributing to the project. I added support for unicode code point escapes, I added support for the u flag, and then it was good enough to use in regxpoo. So once we have our abstract syntax tree that represents each and every part of the regular expression, it makes it really easy to look at each and every part individually and translate it in another way without having to use the u flag. So that is the next step that regxpoo does. That is basically the core functionality, and I'm using two separate tools that I wrote for this. The first one is called regenerate, and basically regenerate is a build script that allows you to easily create a regular expression that matches any number of symbols. So you can just give it an array of symbols or code points, and it will turn that into a regular expression that matches only those symbols. It's fairly powerful, and I'll show you another example of this later. I am also using the note unicode data packages. So the unicode standard has a lot of these data files. For example, it contains information on different categories or properties of characters. It contains information about different scripts. For example, if you want to get a list of all the characters in the unicode standard in the Kanada script, you can do that. Unicode database has separate data files for them, so you can parse those files and then use that data in your scripts. So I just wrote a small script that parses those data files and turns them into valid JavaScript arrays that you can directly use in your code. I'll show you an example of this in a second as well. So once we've translated this regular expression with the uflag to an equivalent regular expression without the uflag, it's time to update our regular expression abstract syntax tree. Once we've done that, we want to turn that regular expression abstract syntax tree back into a regular expression literal, back into a string basically, so that we can inject it back into the abstract syntax tree for the JavaScript code. And for this, I'm using the recast program. It makes it really easy to loop over abstract syntax trees and update them as you go along. And then finally, we have our updated JavaScript abstract syntax tree with the new regular expressions without the uflags. So all that's left is just transforming that back into JavaScript code. And this is also something that the recast program does. So that's basically how RigExpo works. Like I said, it's being used in both Tracer and Babel. So check it out and try to break it and let me know if you do manage to break it. Now, there are some things that the even ECMAScript 6 with the uflag cannot do natively. So for example, it's a very common use case to want to match a certain set of characters. For example, let's say you want to create a regular expression that matches all the canada symbols. There is no built-in way in the JavaScript language jets to directly use these unicode properties in your code. For example, the Perl regular expression library has a feature like this. You can just use backslash p and then you directly use these scripts and properties in your regular expressions. But because JavaScript doesn't have this yet, I'm actually working on a proposal to add this. But for now, we're stuck with using a solution like this. So you can use regenerates. The library I talked about earlier. You can combine that with the unicode data packages that I mentioned as well. So in this case on the second line, I'm just getting an array of all the symbols in the canada script according to unicode version 6.3.0. And then I'm simply feeding that array into regenerates, which then turns that list of symbols into a regular expression that only matches those symbols. So effectively, this is a regular expression that only matches the canada symbols. Now imagine having to write such a regular expression by hand. It would probably be a very tedious and painful task. And also imagine what would happen if you have to update this regular expression. There is a new version of unicode being released every year. And sometimes it happens that new characters are added to an existing script. In fact, the latest available unicode version is version 8. So let's imagine we have to update our regular expression to version 8. If we had to do this manually without having this small build script of four lines, it would be really painful to do that. However, now that we have this small script, all we need to do is just update the version number to version 8, run the script again, and just like that, we have an up-to-date regular expression compatible with the latest version of the unicode standard. So in summary, I would say use the u flag for every regular expression you write for now on. Use it whenever you can. But at the same time, don't blindly go back to your old existing regular expressions and just add the u flag everywhere because you might accidentally change their meaning and break your code. So be sure to carefully review each and every regular expression before you do this. Also, you should really use a transpiler because otherwise your code will not work in most browsers today. And of course, you want code that runs everywhere. The good news here is that Babel and Tracer already support transpilation of this feature thanks to RegExpo so you can just already start using this. And then finally, for anything the u flag cannot do yet, just use a tool like regenerate, write a small simple very little build script and just run it to generate these very complex but unicode aware regular expressions. Whenever you need a regular expression based on some unicode category or property or script, you should just write a simple script that uses regenerate to build it. So that's it. Thank you for your attention. Everything I talked about is also you can be found at that link. It's a blog post. Thanks for your attention. Looks like I'm a bit ahead of schedule. So if there's any questions, I'll be happy to answer those now. Hey, Matt. Yes, here. Yeah. Hi. What's your favorite and unicode character and what's the unicode character you hate the most as in it caused you a lot of trouble to rewrite your code? Well, my favorite example of a unicode character is probably the pile of food because it's so popular and everyone immediately recognizes it. Like if you give a presentation and you try to preach for a unicode support everywhere, no one really cares. But as soon as you start mentioning emoji and piles of food and everyone wants to support that and they want to update their databases, they want to write their code and update their fixes and stuff. So really emoji is like a gateway drug to full emoji full unicode support. I think what was your second question again? Which is the unicode character that troubled you the most? Huh. Well, there's a bunch of them, actually. I guess like I have a custom keyboard layout so I can easily type the pile of food character without having to copy paste it. So I like to enter that in, you know, whenever I'm testing an open source project or whatever. So that way I do find a lot of bucks. So I guess you could say it's the pile of food once again. Yeah. Thank you. Hey man, this is Ashwini Shankar. I have a question here. So if I try to build a regular expression to just remove all the numbers out of a string, how do I handle all the unicode characters? Because numbers may not be same as in English in any other unicode characters. Is there a way that I can handle in one way that I can handle to every other unicode characters? Yeah, that's a very good question. Let me just go back to one of the previous examples. Like you would have to write a script for now, similar to this one, where instead of reading out all the symbols in the Kanada script, you would have to look for the digits property in unicode. And yeah, basically the rest of the script would be the same. So it's a huge list of all the unicode digits in different scripts. You would get it automatically in just one line of code. You feed it into regenerate, and then it turns it into a regular expression. And I've actually created this regular expression before when I was working on the proposal to make backslash d, match d symbols as well. The proposal was rejected. But I can tell you the regular expression is a bit bigger than this one. It's like it would fill up the whole slide. But that's what it takes to support full unicode, I guess. Sure. That helps. Thanks for this. Yeah. Cheers. Hello. I have a question. Like you said, we can use U flag over there. It won't slow down the state machine. Like I have a number of characters over there, and it convert every character into the unicode, then it would match or unicode. So like I have a last character in unicode. Why should I write class U for entire code, entire regex string? Are you asking about performance? If it slows it down? Yeah. I haven't had time to benchmark the edge implementation yet. I've done a little bit of work with the V8 implementation. I did some reviews there. And of course, there's going to be a certain performance impact to more characters you're trying to match. The slower it's going to be. But historically, lots of the existing browser benchmarks that JavaScript implementations are based on, they're really strongly tested for regular expression performance. Like I have read the article of Travis Norris. He said, why should I convert the character in unicode? We can use the normal literal representation or accept the unicode. It slows down the performance. So in that case, I think it will slow down. I'm not sure about this. Yeah. Well, it's always going to be a little bit slower, I guess. But in my opinion, you really don't have a choice. Like if you're working on a web application, you want to support all kinds of symbols, not just the BMP symbols, because sooner or later, one of your users will enter an emoji or another astral symbol, and you probably don't want your code to break in that case. So I think performance is important and it's good that you consider it. But it should be more important to have an actual functional web app, in my opinion. Okay. Thanks. Thank you. Hey, Mitha. So I have a question where you talked about the abstract syntax tree for a regular expression. So I'm not really very sure. So do you only take the regular expressions and make abstract syntax tree or the entire source code goes through that in the code generation phase? You just replace those rejects with whatever you have come up with in the regenerate. Yeah. Yeah. So actually, yeah. So you're familiar with as Prima, which is a JavaScript parser. It takes any JavaScript code and parses that. But if you go through the abstract syntax tree that is Prima generates, every regular expression will just basically be an object that contains the original regular expression as a string. So if you want to look at the individual parts of the regular expression, you want to check which ranges are in there, which character classes, things like that. It's really hard. So what you need to do is you need to take that string out of that syntax tree and then parse it as if it was a regular expression. So it takes a specialized parser that is not part of a Prima to do this. And this is the rag.js parser project that I mentioned does exactly that. Yeah. Regarding Unicode, we are having many polyfill libraries like Unicode.js and CSS escape. So they are purely JavaScript libraries, but we are going, now we are headed to transpiled libraries, transpiled solutions. So as we move on Unicode, the support to support it, where the effort is getting harder. So is it, do you feel that this is the only direction that we have? Yeah, that's a good question. Yeah, for me, it's very interesting to see that the more that the language evolves, the less third-party libraries will be needing. So we can get rid of more and more code over the years. Like the code you write today in ES6 using the U flag or even using other features, today are transpilers. They translate that into other code that the existing browsers can support. But imagine, you know, the exact same code five years from now, it won't have to be transpiled as much anymore because all the browsers across the board will just support that many more features. So in that sense, it really feels good to be able to just don't think about browser support for a second and just write your code based on the standard, based on the specification, and just know that the transpiler will take care of everything else for you. Does that answer your question? Yes, sure. Yeah, thank you. Okay, so you were talking mostly in terms of code points. How about that representation of string in JavaScript? I don't know. So when things come in, comes in as UTF-8 most of the time. So internally, does it convert into UTF-32 or what kind of? Are you asking about the internal encoding that JavaScript is using? Yes. Yeah, yeah. So what JavaScript is doing internally, it really strongly resembles UTF-16 if you're familiar with that. So UTF-16, it uses two bytes to encode any BMP symbol. But for any astral symbols, it requires four bytes, so twice as many. And that's pretty much what you see happening in JavaScript internally. The problem is that it kind of exposes these surrogate halves as if they were separate characters, which it really, it should be an implementation detail that is never exposed to developers. But historically, this is the way it has been, and we cannot change it now because it would break a lot of existing code. So my understanding is it is sounding similar to Java. Say let's say I send some data from the server, and I say this is UTF-8 encoded, multi byte encoding. And it comes in. Will the JavaScript engine converts the characters coming in as UTF-8 to UTF-16? No, JavaScript does not have built-in encoding functions. In the DOM, there is an API called the text encoder API. So you can use that in a browser to do this kind of thing. But yeah, basically it's just the source code, the JavaScript source code itself is represented as UTF-16 internally. And sometimes if you're looking at a string or getting the length of a string, or iterating over characters in a string, that kind of implementation detail kind of bleeds through, which is a shame. Hi. I just want to make sure if I'm transpiling, so will there be any effect on the 16th and 15th plane private use area? I'm sorry, can you repeat the question? If I'm using some characters in the 15th and 16th plane private user area. Yeah, in JavaScript strings, you can, it's fine, or in regular expressions, it's fine to just use any character at all, including the ones in the private use areas in Unicode. So you could create a regular expression, for example, that matches only those private use area code points. It would work perfectly fine. Thanks so much. Cheers. In fact, in JavaScript, you can even, like, surrogate pairs. So you have these two surrogate halves, and they should only ever occur in pairs. There should never be a surrogate that just exists on its own. And this is actually invalid in UTF-8, for example. But for some reason, in JavaScript strings, it is still allowed to have these lone surrogates occurring in strings. And this also causes a lot of issues, especially security issues, because if you, so you have this string containing a lone surrogate, and you can represent it in JavaScript, but as soon as you try to insert it into a database that uses UTF-8 encoding, or you're trying to send it to some other kind of parser that uses UTF-8, it will just crash and it might break your application. And I've actually seen some real-world examples of this. I actually have a separate presentation about the security aspects of Unicode support. So check that out if you're into it. It's called Hacking with Unicode. Yeah. It's fairly amazing what you can do with just a lone surrogate or a pile of poo. You can actually hack websites with that, believe it or not. All right. Thank you, Mathias. Thanks.