 I'm Janelis and I'm here to tell you about like funny Unicode characters you should know about as a programmer and also I'm writing this Ruby block where I talk about funny things in Ruby that you maybe know or can check out. So the first character I'm telling you about is this one. Has anyone an idea what it could be or could mean? So it's one character even though it's made out of three different strokes and that's the thing. So it's a tie character and it's combined out of multiple sub characters and to understand this we have to understand what Unicode is about. So on their website they say Unicode provides a unique number for the character and this means it's not about the encoding like how the actual bytes look but it's about what which character has which number to refer to them all at once not by changing the encoding but to have one unique encoding that provides a number for every character. So it's just a screenshot of Wikipedia which is like the page where I can select all the encodings that are not Unicode and Unicode doesn't replace every of them but a lot of them. So often when we're talking about characters what we are actually saying is code points and code points is this unique number which every code point in Unicode got. So a character can also be made out of multiple code points and this is what we've seen at the beginning and it's often surely called a grapheme cluster like the user perceived character and in Ruby we are in luck to that we can work with those characters because there's the backslash x syntax and this, oh can you read it? Thumbs up, thumbs down, can you read it? Can we change the confess maybe? Maybe, oh no, it's better. All right, so yeah, the first example which is what we probably all learned that we can use the dot operator and regular expressions to match for any character it actually matches for any code point so the second version is what we want to use in most of the cases because it leaves this unit as one unit, as one character and because it's so important in Ruby 2.5 we are getting each grapheme cluster and a grapheme clusters method which will return the same for us. All right, okay, so I'm not doing this on my laptop but on a different machine this is why this looks not the way it should but this illustrates a good point also like the second character it's an A with two dots above and it's a character which is really common in the German language so I have it on my keyboard but when I press this button it won't output like the combined cliff that we've seen again, it's again combined out of two code points but it won't create a single code point unit which represents the same character so how can we work with this? I mean, in the first example or you it's not easily visible but what you can see is that in the first example which is the combined cliff it's made out of two code points and the lower one is just a single code point which is a different code point which represents the same character and I'm using the Uniscribe tool here which helps with analyzing code point data but what if we don't care about this different presentations, representations that the characters can have and this is where algorithm called normalization comes into play and this will transform the two code point version into a single code point version and it's included in the Ruby standard library since Ruby 2.3 it's even automatically required which is unusual for this standard library it's like three standard libraries also and this one is and you can use it by just calling string.unicode normalize and as you can see here the first one is the double code point one and the second one is single code point one and when they are normalized they are the same character so this is my third character it's the Latin smaller character O and you might think what's so special about the letter O and actually the special thing is it's not the letter O it's a different letter it's zero BF, zero three BF and this is the quick small letter O and how can this happen? The thing is, yeah this is the real letter O wanna see the difference? Well not the real one, the letter one which we are more used to and if we look at it again we see that it's a similar character and letter O is a record holder it has like 75 characters that look really, really similar and the Unicode consortium they have a list where you can look up those characters some more examples, there's a character two question marks a code point which means two question marks and of course it's easily confusable with a single, with two single letters question mark also you don't have to go to higher Unicode ranges you can start with ASCII the letter L, lowercase L and the number one they're also pretty confusable so it's a problem that won't be ever totally solved it's always a visual thing also and I put the two C's here which look totally the same because the Cyrillic C and the Latin C they are the same character so it's by purpose that they are looking identical but it's different code points this can lead to all kinds of security issues Spotify had a good write up on one and what do we do now that we know there are confusable characters and I'd say you can use this tiny micro gem it's called Unicode Confusable and there it uses an algorithm provided by the Unicode consortium to check if two strings are confusable, all right so this character looks a little like an I but actually it's a capitalized I with which has another dot on above and why do we, my first impression is why do I need an extra dot above so the thing is we think that the correct outcome of an up casing operation of an eyeliner is just a stroke without a dot but what about like Turkic languages like Turkish, they put another dot above so the up casing operation or down casing is a language-dependent operation and you need the context of the language to do it correctly so when we're in Ruby, we have the up case method and before Ruby 2.4 it wasn't Unicode aware at all so again the letter a like a with two dots it won't up case it all since Ruby 2.4 however we can properly up case it and what we can even, we can also get the old behavior by passing in the ASCII key as a parameter and we can even put in some local information for example Turkic to get the correct I that we want up case is not the only method that supports this additional argument you can also pass it to swap case or capitalize although this is pretty good that this made it into Ruby be aware that there's another method of down casing a string it's called case folding it's also supported as a fourth argument if you just use the fold use fold as the keyword and but it uses a different algorithm so in this example, it's the German sharp letter it's the German letter sharp S and if we are using the normal down casing it will return the lower case version of it but if we are using the case folding mechanism which is meant for comparing strings we get another representation of this letter which is in this case just a two S there's more to be careful about so I've told you the Lithuanian option is possible to pass into down case and up case but it's possible to pass it in but it's not working yet so don't do this yet also we don't have covered the whole word for example in the Netherlands they have a really interesting way to do uppercase island and no option to do this in Ruby and also there's the string case compare methods and they look really, really similar but the case compare question mark one it uses the case folding that I just told you about and the other one it will only use plain ASCII so be careful there okay now for some fun more fun characters it's a this is a control character called next line or sometimes called new line and I don't know how you do new lines but there are plenty of ways already depending on your operating system so normally you would use maybe just a line feed character or the character return character or a combination of both but because of this confusion there was another character introduced to give you a line break which is called next line but it wasn't so successful because adoption was not so good and however on my machine at least it works if I just print it out it will give me a line break but check on the systems you are don't use this character I mean no one does but you should know that it exists and if you are checking for line breaks you should know this could be an option to have a line break the next line character is interesting also for another reason which is not visible but this is an ASCII table and the first two rows which means like the first 32 characters they are special characters called control characters because they don't render normal cliff but they are like doing line breaks or similar stuff and they were introduced when ASCII was introduced but it's not only the first 32 characters so later people wanted to have more control control or wanted to do more stuff and needed more control characters so they introduced some more control characters I'll show you this next page which is better readable so the original set of control characters is called C0 and the new one is called C1 and it's not supported anyway the Unicode uses them because it's compatible with them but they essentially do nothing except for the next line operator that I've shown you, they don't do anything so if you meet them in your data that you are passing be very careful about them what their purpose is so how can we work with this in Ruby the next line operator it's not matched by the normal regular expression which matches for spaces but by this one using the Unicode property syntax and also control characters can also be matched with the CC as your property and you could also use the characteristics gen for further research if it's C0 or C1 so next section, this character is really wide so it's the 3EM dash and the thing is if you are using it there's a chance that you might click software so for example if I'm using it on the terminal you can see that the cursor is on home misplaced and that it's not detected that it's so long also if you change your name on Twitter you cannot see it but it totally breaks the layout maybe you can see it in the next I wanted to change it back but it wasn't possible because my changing back section was also gone and it's not the only character there are some more which are also like it totally looked different on my machine because it's also really dependent on the platform you are on how it is represented so here it says not well defined but it's more like it's not defined at all which is of course confusing for fixed width environments like the terminal is and it's especially true for a lot of Asian characters so the Unicode consortium started to assign if it's like one terminal space or two terminal space to a lot of Asian characters but even there there's a category of characters which is called ambiguous and they can be either single or double width so the user of the software or the library author has to put in an option to display them as one or two digits terminal spaces so you can use the Unicode display with micro gem if you want to have proper checks and the first character you see it's a normal, it's a character which is only like one space and the second is two spaces but the third one is one of these ambiguous characters this dot it so you have to pass and actually the option if you want to have it displayed as one or two spaces. All right, now that we know about this let's go to this character which is just a placeholder for an invalid character that is forbidden to use so there are two kinds of code points that are not allowed in Unicode and they are encoding related so as I've said Unicode itself is independent from the actual encoding but there are like three popular encodings UTF-8, UTF-16 and UTF-32 and you can use them in, you can use whatever encoding you want to and the UTF-16 encoding is needs a special area of code points to be functional so this section of D800 till DFFF it's blocked because UTF-16 needs it to represent more Unicode characters but the thing is in UTF-8 and UTF-32 you can actually have this code point value encoded but it's just not allowed the second version of invalid code points is two large ones which is again related to UTF-16 so the highest code point you can represent with UTF-16 is 10FFFF which is about one million and you could represent much higher code points with the different encodings but the Unicode consortium said no this is the upper limit I will read it out to you so this is just an example that in UTF-16 normally they add two bytes used to represent one character and if it goes to four bytes it needs both special zero gate code points so because they are not allowed Ruby will forbid you to create them within string literals and issue a warning but Ruby if you really need them you can use different techniques to create such data and Ruby also gives you some methods that you can use to work with strings and encodings and one of them is the valid encoding method and if your data contains those forbidden code points this method will return force and also you have the scrub method on strings which will replace all of these characters with this replacement character that you've seen at the beginning besides all this illegal characters there's also a section of characters which are legal, which are allowed to represent but which are not assigned to anything for example, the 10 F F F F which is the highest code point available in the whole standard it's not assigned to anything and it will never be assigned to anything because three kinds of code points will not ever be assigned to anything and one of this is called not so good name non-characters so there are 66 characters in the unicode consortium which are not assigned and just will never be assigned there's also a huge section of private use code points where you can encode like fantasy languages or logos of your rating systems yeah, okay, you cannot see any but the lower one F8 F F is the Apple logo so on Apple computers it will display as an Apple and the above one is the Ubuntu logo so on Ubuntu machines you will see the Ubuntu logo there but they are not defined by the consortium so they won't be displayed correctly on different platforms and then there's also a huge portion of code points which hasn't been assigned yet which is called reserved let's see if this graph works no, it doesn't so but what it shows is that it's a lot of reserved code points so it's almost all of the million code points are not assigned yet so there's plenty of space for future assignments the private one there are about as much standardized code points as there are private use code points so back to how we can work with them in Ruby again using this property, Unicode property syntax and regular expressions you can match for non-characters it's a little bit cumbersome you really have to write code point there you can match private use code points with private use and the reserved ones if you don't care about the non-characters you can just use the unassigned property and if you do care you have to do this negative look ahead positive look back in the regular expression to make it work okay so in this section it's not about the brackets it's about the space in between which look like a usual space, white space but it's not the usual white space it's no break space but again you have no way to see it's a no break space and that's not the only alternative space you have in Unicode there are tons of them and only some are matched and considered as white space by the consortium but there are also a lot of characters which are not considered as a white space so this can lead to a lot of problems for example just recently there was a fake WhatsApp on the App Store and it looked like it was published by the WhatsApp company but they put a space at the end so it just looked like it was published by WhatsApp and also in the application list where you could maybe see it they would just use blank strings and this is probably related to like not checking for blank code points correctly I put together this some more so in between these there's the Mongolian vowel separator this is a wider one it's the EM space this is the zero width joiner we will need it later what's this? it's the Invisible Plus on my machine it's actually invisible this is the Braille pattern blank which Braille's like the writing system for blind people and it's little dots in the picture of the character and of course you also need a ray to represent no dots and this is why this one is blank and also this is an example of a character which is not matched as a white space character you have this one which is the space for Asian characters and this one is also a space if I understand correctly a musical note without a note it's pretty new that's why it's still displayed as a tofu and besides all these white spaces there's a class of characters called ignorables which just render nothing or more correctly if your display system doesn't know how to displace these characters they should just render nothing and it's not only a few characters the whole Coltman range from E0 to E0FFF is considered ignorable and some examples of characters that you can find there are the so-called variations electors their purpose is to have a visual variation on the preceding character which is especially useful for example with Chinese letters where it's not important if in the top corner it's two strokes or three and this variation selector can then tell the display engine to render it correctly but the character itself it's totally invisible more famous is variation selector number 15 which makes some text-based emoji on some platforms turn into image-based ones and variation selector 16 is the other way around but again it really depends on the operating system for example a lot of the a lot of the smileys are always shown on for example Mac OS as a picture while on other systems they might be always shown as a text-based emoji and there's no way to change them it's only a few characters that you can use a variation selector on and then another ignorable kind of character is a tech character which was introduced to create language text sequences but you shouldn't do it they are just deprecated so don't create a language text sequence it's an invisible sequence in your text which describes something so it's the whole of ASCII encoded again but as different characters but don't use them here we can see them so what have you learned? There are a lot of invisible characters it's not all white spaces some are matched by the normal regular expression of just backslash X some more are much matched by using the unicode property syntax to match for white spaces and you can also match ignorable characters using this again rather cumbersome code point property and some are even matched by anything where you can use the characteristics gem again which has some more support to detecting if a character is ignorable or blank okay this is my last section it's the emoji section but apparently on this computer the code point sequence that should show a cook, a male cook with a light skin tone isn't working correctly but if it would be working correctly it could look like this this is how it looks on Twitter and yeah it's constructed out of four code points I won't go into detail too much about emoji sequences because it's really, I could do a talk about emoji sequences itself it's like seven ways of creating emoji and um and the most complicated one is one that uses the zero width joiner which we've learned about in the previous section it's an invisible character that combines two emoji and there's this thing with valid versus recommended emoji so there are a lot of emoji and theoretically you can just mix any of them there's this XKCD which had this fun idea of like combining all kinds of emoji and then Mark Davis who's from the Unicode consortium said oh actually if you are using the zero width joiner they are all valid emoji the problem though it won't be a display in bot app or Facebook or whatever because none of the vendors knows about them and this is what the emoji standard is about it recommends some sequences which then the vendors will implement or the other way around and another type of emoji are countryflex so the countryflex itself they are rather easy you just use two regional indicator symbols like the flag of Portugal it's not encoded directly as a single code point you make it up by using the P and the T regional indicator symbol however recently there was also introduced to have sub regions and you would guess that they just use the regional indicator letters again but no, remember that I said text sequences are deprecated they got undeprecated and now you can use tag characters in some kind of in another manner to create sub region flex so which leads to strange behavior for example this Twitter user was wondering like a week ago or so Twitter allowed to have like 50 characters in your username and they wanted to use it with like Welsh flex but could only paste in like three so what was the problem? Yeah, it's really complicated to construct such a sub region flag it's a normal emoji for flag flag then you do your tag sequence with these invisible recently undeprecated tag letters and then you have to cancel tag to end the sequence and there you have your Scotland flag and this is also the reason why at the end there's a flag flag probably it just pasted some code points and it was enough to display this flag flag but not enough tags oh yeah and to work with this in Ruby you can use another micro gem it's called unicode emoji and it contains a regular expression which is like huge and it will match like all types of emoji characters even if they are not usable on your platform yet and also contains the list of recommended emoji okay I know this was a lot of content so let's have a short recap we learned about graphemes that a character can which is the term for characters and if you want to match for any characters you probably want to use the backslash accent text and not the dot in regular expressions and with Ruby 2.5 we get oh it's not string each grapheme it's string each grapheme cluster actually then to normalize characters is a thing to have only one representation of a character and not multiple it's a part of the standard library and called unicode normalize Confusibles can be done with unicode Confusible gem we have case mapping starting with Ruby 2.4 and the only time it works is the Turkey option but there's more to come probably you have control characters that you should know about and a lot of them are not specified in any way displaying characters on a fix with terminal is complicated so if you're using RubroCop then you're probably already using unicode display with already so be free to use it in other software as well and to detect invalid encoded code points you can use the valid encoding method to just replace them you can use the scrub method and there are a lot of characters which are allowed to be part of your data but which are not specified and you can match them with the unassigned syntax and the private use syntax there are a lot of invisible characters like not only white spaces but also ignorable ones which don't even render a space but render just nothing and they're emoji and emoji are difficult to build up because there are so many ways to represent represent what you want to represent all right, that's it and I hope you enjoyed it