 Let's go. So welcome. Today, we're going to talk about Unicode. So thank you for coming. My name is Nicolas Recaz. I work in Paris for Soncia Labs. I'm CTO at blackfire.io, the PHP profiler. I guess you should try. And I'm also a symphony core team member. I just reached my 1,000th pro quest on symphony. I care mostly for the cache, the debug, bottom per process, and dependency injection components. There are a few others, but these are the one I know the most. I'm also the author of something old right now, which is called Twerk UTF-8. It has seven and something million downloads on packages, so it's quite a use package. And that's what I'm going to talk to you today about. So at least things are learned while working on that. Unicode. So let's talk about cassettes first. So who have a struggle with cassettes here? Do you know? Yeah, cassettes, difficult things. So at the beginning of the computer, the computer scientist, the first engineer, wanted to, of course, represent text using CPUs and computers. So they decided to create a 127 table and to give each number in that table a corresponding character. So it was at the time where CPUs had only seven bits in their registers. So this is the US ASCII standard, which standardized the first 127 characters. And this one dates from the 17th, 70s. So we have the usual ABC letters and a few special ones. So the first rows, the two first rows, are the special character, control characters, in fact. And these are legacy things we have to live with right now. So maybe my favorite could be the EM. So who knows what the EM characters is? So the 1, 9 character, EM, what does it mean? So another one, LF, what does LF mean? Yeah? Line feed, great. The CR, what is CR? Yeah? Yes, so carriage return. It's a physical object, you know, the typewriter. It's carriage return, line feed goes down. So this is really control oriented. Right now, we still use those two characters to represent a line break. But this is logical now. Let's take another one, so back to the EM. EM means N of media. So we have rotating and magnetic bands and this is the end of the magnetic bands. Nice. So most of them are legacy. That's the EM. Then in IBM created the first eight bits from CPU and then with eight bits, we have 256 characters. So we have extended, in fact, characters that we can put there. So what the world did at that time is that every region of the world choose its own map of extra characters. So these extra characters in this screenshot are the one that Microsoft choose to encode and to standardize as something they called Windows that 1252. And this is the most used character set in Western countries. At least this was the most used character set in Western countries. It is able to represent any French word, almost any French word, any most Spanish words, German words. So, you know, European and West sides countries. Of course, it doesn't work for Russian people. It doesn't work for Greek people. So there is a corresponding extended set for Greek, for Russian, for and so on. So this still exists on the web. So the one ESO 8859, this is almost the same as before. This is the second most used. 5% of the web runs that. And then we have the other one. Chief.js is the Japanese one. And then we have, you know, yes, Russian, Cyrillic, Hebrew. And the first one is what we're going to talk mostly now and which is called UTF-8 for Unicode form 8, because it fits in 8 bits. So this is the most used representation in the web. And I hope you'll understand why after this presentation. And it fits in the Unicode standard, which is one single big global map for all characters that exist and maybe existed on Earth. So we have that number of characters. So a lot, a few hundred thousand characters already encoded and every new version of Unicode adds more characters. We have 135 scripts, so actual languages. One is Greek, one is, you know, Latin. One is Cyrillic, one is and so on. So of course, as you can see, there are not five languages on Earth, there are a lot. And this is the way we can represent the word peace using several scripts. So peace in English, of course. The peace in Arabic, the peace in Japanese, and the peace symbol. There is a symbol piece, you know. So these are encoded and represented by, in this case, to just represent this kind of characters. So let's take the first letter of this word. The first letter is just one single character, so it means it has one corresponding number in that big map, Unicode map. And so Latin capital P is at number 5-0 in Excite Simmel. And then we have the first Arabic letter Sin, which is 6-3-3 in Excite Simmel, again, in the Unicode map. And we have this CGK unified ideogram, so it was Chinese. And the peace symbol, which is 2-6-2-E. So again, Excite Simmel number, but that's the number corresponding to this character. So in Unicode, basically, we have a number associated to each character, a character associated to each number. And each character has a category, a script, and a few other properties. So is it uppercase, is it lowercase, is it a number, is it a marking space, is it punctuation, is it... So this is the kind of things that Unicode encodes. Why do we care? Because there are not only 26 letters in family names of people. So this is a Twitter post, a tweet from a angry user saying, hey, my name is valid, your form is not valid because, you know, this user in his last name has this C with an accent on top, and this form is just plain wrong. And this is the kind of topic we have to deal with as a developer. We must accept that kind of input. It's our duty to be friendly to everyone. So let's go. Unicode, Unicode is a space that has 1.2 million numbers. So it's from zero, character zero, to character 1.2 million. As I told you just before, not all of them have a corresponding character. Only 128,000 characters have already some associated number. So in white, it's the free space for the future. And in blue, it's the space that is already allocated. So these 12% or this, yes, let's say 12% of the area that have an actual symbol on top matching. So in the green area, the green area is special. It's called the private area. So it's an area that Unicode will never use to achieve to create any character there. So that you, in your private space, you can create things and that won't mess up. It won't be portable, it won't be standard, but it will be. It is an area that you can use already to create your own characters whenever you want to. So let's say some icons you create, you can map them to the green area. So each big square is called a plane. So there are 17. At the beginning, Unicode version one was just one plane. So 65 and something, thousands, character. And they took at that time, that was enough. But in fact, as you can see, it's already not enough. So very quickly, they decided to add more space for characters. So let's zoom on the first planes. So these are the two first planes and each color represents one script. So in this screen, we should have 135 different colors or different color red area. The blue one is for the Chinese symbols. So Chinese is using most of the symbols. Not a surprise. And the second one, the brown one, is for Korean characters. So again, they have a lot of characters. And the ASCII one, the ASCII one is just the four or three rows in the smaller square on the top left of the screen. So that's only what, you know, US is using and this is what the rest of the world is also using. So the same, this time this map represents, it's a hit map. So based on Twitter and on Wikipedia. So the more white it is, the more the character is used. Because the world, so black means nobody uses that character. Red, some uses and yellow a bit more and then white, these are the most used characters. So it's not a surprise that again, on this small square on top of the, on top left of the screen, these are the most used character. These are the ASCII US American characters and this is the most widespread language in the world. But as you can see, we have the Chinese area which is quite used and also we set kind of pattern in the Korean area and a few others. On the bottom of the screen, those are the emoji area. So as you can see, a lot of people speak emoji or write emoji. So those are the general concept now. Let's talk about your technical things. So bits, bytes, binary representation. So we have code points. Code points is the number associated to each character and let's map code points to bytes. So code points is some abstract number and now let's encode it into some, you know, actual series of bits and bytes. So we have three main representation that are in the Unicode standard. So one is UTF-8 which is a variable which uses a variable number of bytes to encode one code point. Then we have UTF-16 which uses either two or four bytes to encode one single character and then we have UTF-32 which uses always four bytes for every single character you might use in your own text. So if you think about UTF-32, it means that if you write, you know, U.S. text with A, B, C, D letters, it will be fourth time twice as big as it really needs to be. So of course this is not the preferred representation to do network transfer, it does it too high overhead. So what we use is UTF-8 because UTF-8 and one of the reason of UTF-8 is that the A character, the ABC characters needs just one byte and seeing this is one of the most used one, let's choose that one, there are a few other reasons. So let's take this letter, so we have this Latin, small letter A with acute, with a cute accent. So this one in UTF-16 is O-O-E-1 and in UTF-8 it's C-3-A-1. So these are the actual hexadecimal value of the bytes that encodes in binary, this character and then we have the second character which is the here again letter A. So in UTF-16 it's 3042 and in UTF-8, that time for this one we use three bytes so E3, 8-1, 8-2. So how does that work? Let's focus on UTF-8. So UTF-8 uses a very clever trick and standard to encode the numbers. So the first row, you remember the ASCII lines in the first very first table, they all are mapped to number that are below 127 which means that they use only seven bits which means that if we use a byte, which byte is eight bits, it means we just have to put a zero there and it works, we encoded the first rows. So all ASCII characters are encoded with zero as a first bit and then we use, we put a one in that first bit and we can, with this kind of trick, having one, one, one, one, all meaning there's only one next, one, one, one, all meaning there are two next and one, one, one, one, all meaning there are three next. We could continue but we just need to stop here because that's enough to represent and to encode 1.2 million characters. So it's okay for the future. So it's built on top of ASCII which is a genius thing. It's something that was created by very clever people. It's just very small thing but it makes the full system compatible with legacy ones. So legacy systems that only deal with ASCII deal also quite easily with just UTF-8. It's also synchronized which means that if you look at a stream of bytes, so let's say you're on the network and you look at the bytes that get through this interface, if it's a text, then it's really easy just by looking at the first bits of each byte if it's in the middle of one single character of if it's the beginning of a single character. So we just look at this zero where is the first zero and if the first zero is at this position, so look at the table, you can tell instantly where your character is starting just before or after, et cetera. So let's start with a few linguistic concepts because Unicos is really about linguistics. Then it's something that is encoded for computers but we have first to understand that the world and the languages are complex and maybe the American and the Western countries have something that is simple and doesn't fit every writing system that humans invented across the world. So case folding. Case folding is this concept of having uppercase and lowercase laters. So it applies only to one about 1,000 characters so just a few of them. Maybe the most important one but still just a few of them. So one concept we need, something we need to deal with when we have this case folding thing is that instantly we need to be able to do caseless matching. So we all do that, you know that. We want to do, if you want to write a certain chain you want to make it case insensitive of course. So technically that's something we have to deal with. So how does that work? One way to do that and maybe the naive way to do that is just to put the strings in lowercase and we're done. So if it's in lowercase, we just compare things and that should work even at the binary level. Yeah, you agree? Yeah, okay, no, it doesn't work. It doesn't work in the generic case. Why that? Because again people are very creative and let's say in Greek, in Greek there is one upper sigma letter and two corresponding lowercase versions of this sigma. There is the sigma inside the word and the terminal sigma which is the sigma at the end of the world and those are the same letter. So if you compare the binary representation for sigma one and sigma two they should be considered at the same character but of course since they are different characters they won't match at the binary level. So there's something we need to care about. It has a few other exceptions. So one is the Turkish exception. In Turkish the dot above the i has meaning. So when you put this i in lowercase, you don't, you select the character in unicode which is i without dot above. And when you put our i in uppercase you take this i with dot above, okay? So this is very special and this is something you need to deal with and if you deal with Turkish text you have to consider that because the word can mean something very different with or without dots above the i. Another one which is something that happened when you considered case folding. This is the German S set. So Germany, Germans have this S set symbol which means which is exactly the same as two S. So again, this is called full case folding because the previous example have one property which is that one character maps to one character. In this case it's special because one character is mapped to two characters. So if you think that your implementation and you will write your first implementation without having that in mind and it will fail because someone in Germany will send you that text containing that symbol and it will break everything or at least it won't work as you expected it to. So let's continue. This is the word déjà which means already. So this is a way to encode it in the typical way to encode it. So we have this D letter, the E with accent, G, A with accent and done. So the first row is the row with the unicode code points. So the logical number associated to each letter. For the E I just wrote the binary representation. So this is the E9 number and this one is represented by two bytes. The first one is that one and the other one is that one, okay? So the A also with the accent also is two bytes. So the length of this string is six bytes and four logical characters, okay? But in unicode there is something we call combining characters. And a combining character can be an accent. The accent, there is a character which is a standalone accent that applies to the previous character. So we can have this word be represented at that thing with that thing. So we have this D, the E, the plain E from ASCII, American one. The accent character standalone combining one, then the J, A and the accent, the other one opposite direction. And this is exactly the same word just represented differently. So the first version is called normalization form composed. The C, NFC, and then we have normalization form decomposed. And those are two representation of the same word that you have to consider be the same, exactly the same at the semantic level. So if you want to match that input, let's say that username with that other username are those the same? Yes, if the username is there, those should be the same user. So we have a problem, we have a technical issue which is how do we do comparison at the binary level? Because this is the only thing we deal with, with computers. Let's continue, we'll see how later. Another concept is alpha ordering. So A, B, C, D, this is the usual American order, so ordering is cultural, local, dependent. A few examples in the 20, they put the Y between the I and the K, nice. So you have to know how, which language is used for a text to be able to sort it. And if you have Lithuanian names, you know that you should put that Y after the I and before the K. Another example, in Spanish we have C-H which should be considered as one single letter. We have also the opposites. We have a ligature here, so we have this O-E, which is one single character in Unicode. It's O dans l'eau in French. We use it for off, which means egg. And it should be sorted after O-E and before O-F. So if you consider that, again, sorting is not that easy. In Danish, this A with ring above is a separate letter, is not a big A with some fancy thing on top. It's just another letter, and it follows the Z. In Swedish, V and W are the same letter. So there's no sorting between them. They're just different visual representation, but logically it's exactly the same letter. In German, the A with umlaut is exactly the same as A-E. So when you sort that, you should have that rule in mind. And there's that French specific, and I didn't know about it, about accents. So how do you sort word with accents? In French, you consider the last accent first and the first accent last to do the sorting, which means that we have this strange ordering between these words, which are different words, and this is the ordering that you have to use if you want to do that properly in France. So that was the word behind that is collation. So collation is the sorting algorithm that each culture and each country uses to sort things. So back to our deja word. So another problem we have here is this question at the bottom of the screen. What is the second character of this word? What is it? Is it E-9, the second one, the E with accents? Or is it just the E? Or is it the E and that accent? So of course, it depends. It depends at which level we decide to work. So at the code point level, the second character is the E with accents on the first line and the E without the accent on the second line. This is the code point. But if you do that and if you cut a string at that position, I guess you won't be very happy and that won't be the result you'd expect. So we have another concept in Unicode which is called grapheme cluster. So grapheme cluster is a cluster of code points that's logically for humans are one single character. So this E with accent and this E with the accent as another character is the same grapheme cluster and is one single grapheme cluster in this string. Okay. So grapheme clusters and now we have this which is new, emoji from Unicode 6, maybe 7, I don't remember exactly. So emoji is something, so we are all using emojis to send text, short message. And there is something really nice in emoji and they do use modifiers. So combining characters like this accent, or the standard and accent. But in this case, we have a color scale which is called in this example, the Fitzpatrick color scale. So we have a character which just represents a color. Okay. And then we have another character which represents just face, face of someone. So if you combine them, Unicode tells you that the face should get the color of the combining color. This is how Unicode and your Android Samsung iPhone thing does to just create color red faces. Nice, clever, isn't it? Another concept which is not grapheme. This is one grapheme cluster, the face and the color. There's more things in Unicode. There is a way to combine glyphs. So glyph is the visual drawing to that shapes this face. And in Unicode, you can also combine glyphs so that visually they look like a single block. The example is this one. So it's an excerpt from the Unicode standard. So we have the female face character, something we call zero with joiner, sequence. And then the hurt symbol, and then again the zero with joiner, and then that face. And systems that are able to handle that will represent that as a single visual thing, which is this woman, love woman. And if some technical system cannot understand what the zero with joiner means, because that is new in newer version of Unicode, then they will represent that at the second row. So separate and three rows of characters. Of course, that's clever, because it allows the technical system and it allows you to combine and to visually aggregate things together as you like. There is no separate character for woman love, new woman, yet there is just one woman, one hurt. So logically and semantically, it's kind of rich. Another concept in Unicode. Have you heard about the April.com security issue? Recently, have you heard about that? Yeah, well, so one security researcher managed to create a domain name which looks like April.com. Just you type, visually it's exactly April.com, but it's just his own domain name, it's not April's one. How did he manage to do that? He just selected from Cyrillic scripts the character that looked like the most like the A, then the P, then the L, then the E, and he created something which in Unicode is different, but visually it's the same, so that he could spoof and he could send you an email saying, oh, go to April.com, click on that link, and you'll end up in something very different, which is not April.com, okay? So if you think about your systems, you might have an issue with usernames because someone might have that username, April, and another one might be also named April, but that's not the same, but visually it's the same, so there can be some confusion there. This is a map from the Unicode standard also that lists similarly looking characters, so a few of them because there are much more, so that maybe, yeah, technically you can use that kind of maps in Unicode and be safe or try to be safe to prevent any visual errors from happening. Okay, so summary for Unicode fundamentals, we saw that we have a concept which is the case of characters which leads to something we call folding, case insensitiveness, we have composition, ligatures, we have an issue with comparisons, so how do we compare that deja and deja? We have collisions, so how do we order things? We also saw that we have this segmentation issue which is, I showed you the segmentation issue for characters, so this E with accent character, how do we split one character? Unicode has some words also, but word segmentation, sentences, hyphens, those are in Unicode also. There are more things in Unicode, so we have locals, so cultural convention, how do I represent some number? How do I represent some dates in the US in France? It's not the same. How do we do transliterations? So how do we represent Greek using ASCII only? This is something that has some rules and of course Greek people want to be able to represent Greek using ASCII only so that they can communicate over ASCII only medium. So this is what is called transliteration. We have this identifier and security issue with the concept that is called Confusables just before. We have another concept which is related to the display of the text trim, direction Arabic right from right to left. So in Unicode they have to have some word about directionality of text rendering and text reading, and there's the width of character also, one single character might not map a fit in one single column, visual column of text, so an A is not as wide as one single Chinese character which needs just more space to be displayed correctly. Okay, in practice let's do programming. So we have ICU. ICU is the reference Unicode implementation and it's supported by IBM, it's coded in Java and in C, C++, it is open source with an X-like license. It is really the reference implementation so if you want to do that correctly just use that and in fact that's what everyone does. So JavaScript uses Unicode. Every string you type in JavaScript is Unicode. Python, Python has typed strings so you can tell to Python this string is in Unicode and this one is just a stream of bytes. That's something PHP tried to do so we don't have PHP 6 and that's the reason why PHP failed doing that, so rest in peace. So right now we have UTF-8's who dominates the world and the NFC way of encoding accents is the way that dominates the world. It means that when I type as a French user that deja word, almost all of the time I will get the version that is composed one. I will very rarely get the expanded version. So fortunately we don't have to deal with graph and cluster that often because characters are already pre-combined most of the time, at least in Western countries it's something that we might not have to deal with every day. So PHP, so PHP is a zone, let's see. We have icon, icon you can configure it to be in UTF mode so you do that at the beginning of your PHP script set encoding UTF-8 and done, it's globally configured and then you have a set of function that allows you to deal with strings using at the code point level UTF-8 one. So icon for conversion, icon for STR-LAM to get the length of some string, sub string, STR-POS, STR-R-POS. So these are the UTF-8 where corresponding function for the equivalent STR-LAM, STR-POS functions that we use to deal with bytes in PHP. PHP deals only with bytes. So this allows us, and this is one of the tool box, one of the tool we can use to deal with 200 UTF-8 strings, nice, one problem kind of solved. We also have MB string, so MB string is a PHP extension and MB string has the same function as we just saw before except that they are called MB, so MB underscore STR-LAM, MB underscore sub STR. I did not list them, but icon like function are available thanks to MB string. We also have a few more functions, so a few more tools in our toolbox to deal with UTF-8. So MB string is able to put a string in uppercase or in lowercase using Unicode mappings, which is something that PHP natively can do. We also have this STR-I POS, I means case insensitive, which is able to deal with that sigma specificity I told you about, which is called simple folding. So if you give it that terminal sigma and you say to MB string, look for that terminal sigma in that other string, it will find the position of the middle word sigma which is different character. It will still stop that and say, oh, this is the position of that sigma in that string. So that was a surprise to me to see that, okay, MB string has this mapping bundled into it. So that's really nice. That's one more tool for us to deal with that kind of issues. So again, we have another tool to do the same thing as before, UTF-8 string handling, and we can also deal with character case now. Let's continue, we have a few more. PCRE, so the regular expression engine that is the one we use in PHP. In PHP, in regular expression, you can add and you have to add a modifier to deal with Unicode. And if you add this U modifier at the end of the regular expression, it, the meaning of the dot character changes. If you do not have this U character, a dot means one byte, one byte, eight bits, okay? Just that, which means that it doesn't work for this E with accent thing, which is two bytes long. So if you add this U, the dot changes and means one code point. So two bytes, one byte, three or four, depending on the length of the character that is just there. That's nice. That allows us to count bytes and so on using code points. So it, this engine also gives us access to Unicode properties. So we can, we have that kind of special notation that you can use in regular expression that allows you to, let's say the second row, match any Greek character. You say, backslash p bracket Greek and this will match only a Greek character. So that's nice. If you need to deal with that kind of problems, you have tools to do that. You, we also have each character as a property. So you need to look at the PHP documentation to have the full list. But this M, M, I don't know exactly. It's marking space. The N I don't know, I don't remember. This is something, so you can match a punctuation. You can match a white space. You can match a number with a number being not only zero to nine because numbers are also represented with a wide list of characters across the world. We have this backslash X, big X character. And this one is very interesting and it opens up the next topic because backslash X means one graph and cluster. One graph and cluster. So that thing we had before on the previous screen and it is also on the right top of the screen which is this E with a combining character after it. So backslash X will match once for the E with its single standalone character. There's a trick also which is this period match backslash U which is an empty regular expression which will match any string because there is no specific criteria there. And with a humidifier, PCRR, the period match function will check the string and just written force or true if it's valid. Nice, so we have a check, an easy check to know if a string is in UTF-8 or not which is just that one. Good. So again, using only that, we can recreate all the other string functions. Sub-string, SDR LAN, SDR POS, and blah blah. And we have a new tool which allows us to have access to unique properties. Backslash P-Greek. Cool. Ah, let's continue. In PHP, we also have this function, graphem extract, graphem SDR I POS, I SDR LAN POS, I R A POS, R POS blah blah. So those are functions that work at the graphem cluster level. So if you look at this deja, they will work and they will know about the combining characters and they will be able to tell you that the length of this deja word is four, whatever the representation is. It's four letters. This one is D, E, J, A, accent or not. It's always four letter and graphem SDR LAN is the only native function in PHP that will give you four as a result for that string. Cool. Now, we are able to deal with graphem clusters. One last thing we have in PHP, it's we have this normalizer class and this one is able to transform the deja from the top line from the first row to the second line. So if you give it the first line and you tell it to normalize using NFD, it will give you back the second deja representation and if you give it the second representation, the exploded one and you tell it to normalize in NFC, which is the default, you will get back this deja string, the first row on the top right of the screen. This is something you could do for every user submitted input because potentially every user submitted inputs is could be represented as the second low row or the first row. So this might be something you need to do for every username creation because if someone clever knows that and if your system is not using normalization at this level for creating usernames, it means that I could register two usernames, one using the first version and another one using the second version. So this is very practical. You have to do something about that and this is the tool to do that. So this fixes the issue we had with comparison. We can compare and match how the equality operator, just using normalized form for strings, long list. So this is bundled into the ENTL extension PHP one. So ENTL extension is just ICU bundled for PHP, nice. This reference Unicode implementation is available to PHP developers using ENTL. That's really great. So ENTL provides a few other classes to deal with the other concept I told you about. So Collator is something that will allow you to sort things using local ordering, number format or local message format. Spoof checker, so spoof checker about confusable and so on. Ah, one last thing. This is not PHP exactly but of course we do use MySQL quite often. So MySQL is car set, character set aware, and you know that. So we have collations and this is a word you might have seen when creating tables, when yeah, creating tables, something like that. So what does that mean? It was kind of unknown to me until I maybe prepared that slide. So if you select UTF-8 binary, it means that comparison will be case sensitive. So the big A will be a different character from the lower A. Now if you select UTF-8 general, it means that the A will be case insensitive but this ligature version won't be considered as the same character. So those two things will be different. That's something that could be an issue in French for example where this is the same really. So if you want to have that as the same character from the MySQL point of view, you need to select UTF-8 unicode CI. So why do we have unicode and general and binary? So binary because you might want to store identifiers, case sensitive identifiers and then between general and unicode, the reason is performance one. General is just faster than unicode because unicode has more work to do to just resolve that kind of comparison issues. So you can choose general and might move later to unicode if you want. I'll let you decide but you have, yeah, this is the kind of issues that you have to decide on your own. If you deal with Swedish and you want to do a directory and have alpha ordering correct for them, you can select UTF-8 Swedish and then when you will do select username or select last name from user, order by username. And if you want correct ordering for Swedish, you have to select the Swedish collection to have this A with ring above B sorted after the Z and a few other equivalent or not characters. Okay, last thing about MySQL. In MySQL, you have to select also the car set of database and the car set is something of database and also of the connection. When you talk to MySQL, you have to tell it in which language, in which character set you are going to send it, your queries, your SQL queries. So usually you set names UTF-8 and most of you maybe think that's enough. But UTF-8 in MySQL was created when Unicode 1 was created, which means that only 65,000 characters were encoded, which means that for UTF-8, just UTF-8 in MySQL, it uses only three bytes at most. Emojis are encoded on four bytes. If you select UTF-8 in MySQL, you won't be able to store emojis in your database. Done. So of course, if you want to store emojis, you have to select another car set, which is the new name they gave to Unicode 2 and later version, which is called UTF-MB4, multi-byte, four bytes, max. And this has a practical security issue and there are several big sites that were caught by this issue, so they selected UTF-8 and then if the default so is three bytes and there is another default in MySQL, which is that if there is a mismatch, if you send it a character UTF-8 one with four bytes, warnings are just silently skipped in MySQL, which means there's a big security issue because if you send it a four bytes character, MySQL would just cut it. And maybe just before you did everything right, but at the end, you just put random and corrupt data in MySQL. So this is something that hackers and crackers in fact can leverage to just enter some systems. So was it easy? No. Okay, so I hope, and we're working hard in symphony to create and to, yeah, provide you this component so we work on that. So this is truck UTF-8, something I created a few years ago that we want to add now to symphony so that all of you can really easily reuse everything I told you about in a more friendly way. So this is called UTF-8 component. It might come in a few weeks to symphony. It's just three classes that all have this list of methods. So these are the usual methods that we use to, you know, couldn't the number of character in the string, cut strings, trim strings, put them to uppercase, lowercase, a few other, I think you do replace, you might recognize and just wonder how that works. So we have a bytes class, we have a code points class and we have a graphons class so that you are able just using one of them to cut strings correctly in the correct unit system, either bytes or code points or graphon clusters. So in practice, let's just to show you how that works. So let's take again that deja world so we can, with a graphon class, we can, we have this static constructor. So let's say you call graphons from string deja, length. What's the output of that? Four, yes, four. Four because D, then we have three bytes, the E with the CC-8-1, which is the way to uncut the standalone accent, then the G, A and again same for the A. Now, what's the output of that? Yeah, six. D, E, standalone accent, G, A, standalone accent. And then at the bytes level, so this is the corresponding implementation for this function is the native strlan function and the output is eight, yeah. Okay, so now it's up to you to decide in which unit system you want to work and to use the correct class to do that. Easy, I hope so, you tell me. So do we have some time? Okay, okay, last slide. This is very Drupal, Drupal-ish. So the code of conduct tells about diversity and diversity, you know, at the technical level, Unicode is about that, really be inclusive, embrace diversity, enjoy Unicode. I hope you like it. So if you have any question. So way back on your second slide, you, the one with 256 characters, let's say that your source stat is not quite as clean as you hope it to be and you can tell that there's some control characters and other stuff that you gotta strip out in there. What do you do with those two lines in the middle that are legacy that, you know, there's some control characters in there, some regular characters, what do you do with those? I mean, when some character sets, some character sets are mungled and you have some input that was not encoded in UTF-8 and you want to turn it into UTF-8. Yeah. There is no, you know, right way and single way to fix that, unfortunately, because most car sets are not specific, which means you cannot recognize them using computers. So if you want to do that, you need some knowledge from how your source text was computed. So then you need maybe try an error to figure out, oh, it was in Windows 12.52. So I need to use icon with Windows 12.52 as from car set and the UTF-8 as two car sets. So this is really, yeah, there's no magic way. Okay, sorry. Is the MySQL has the UTF-8 Unicode CI where everything looks the same when they enter? Yeah. Do I need to do some kind of validation, for example, for my username when they enter in Drupal? Yeah, sure, we can use that. So we can do that. Yeah. Or is I use that schema? Do I need to worry about doing that validation? But it won't be enough. You still need to do. It won't be enough. Yeah, you still need to do normalization. So yeah, with this normalizer, normalize. Yeah. So is there anything out there anyone doing it already and Drupal, you know, this kind of thing? Don't think so. Anyone thought about it because we don't need it actually. Yeah. Even when you deal with mostly Western characters, maybe you don't need it that often, yeah. But you know, April.com, you can spoof really easily so people are going into that business now. With the, one of the first slides, you show that there was a bunch of private characters. Yeah. Is that what Twitter's using for like their custom emojis or do you know? Might be, but what they can do is they can use that, those specific code points, then create a font, a web font, and map those numbers to some visual glyph and done. Okay, cool. Yes. Yeah, yeah, yeah. In fact, yeah. In fact, Unicode has been built with history. So every pre-existing character set is kind of just mapped to some area in the Unicode space. So they didn't recreate things how, and they didn't make it ideal like they would make it starting from scratch. So that's not how it was built. So that's how Stonar is full of compromise. Okay, thank you. I would like to really thank you so much. Thank you.