 So I'm Aaron Lissang. I have a hard to pronounce last name. Really not so much hard to pronounce, it's hard to read. And I'm going to do a talk on Ruby and Unicode and what could go wrong. So November 3, 2017, which is conveniently seven months ago to the day, was a post in the R Android subreddit. The user had noticed something odd in the Google Play store. There were two copies of WhatsApp floating around. And so the one on the left, they're downloading it. You can't see it. It's got like a billion downloads, right? It's WhatsApp. The one on the right has a million, though. It's called Update WhatsApp Messenger. And if we look at a little closer, you can see that they're both from WhatsApp Incorporated. So at WhatsApp, release some new thing and not talked about it, what happened here? Well, they opened up the app and it looked like this. This is clearly not their new strategy. Someone hacked into the system and put up ads. And it's reasonably harmless. But you can imagine if this were, let's say, Chase Bank. And they did an invalid credential log. Like they stole your credentials and said, oh, sorry, the service is down for the moment. Come back later. And then later, you find out you're losing a bunch of money. So how did this hacker manage to foil Google, right? This is a company with, it's an industry giant. They have money. They have talent. It was probably something really, really impressive. Like there was like people typing. And I always love this. The two of them on the same keyboard. Like that's how it works. It's just fantastic. Good job NCIS. So in reality, though, they were foiled by a non-breaking space. That's all they did. They threw a non-breaking space at the end of the name and it slipped by their checks. So it doesn't take much. Now you might be thinking, we're Ruby developers. We've got Strip. We'll just rip that out. But the reality is Strip would not save you in this situation. And part of the reason for that is that fundamentally strings are liars. So they show you one thing. But under the hood, they can be something different. And I mean, we're kind of at fault for that. We don't want to know how the sausage is made. We just want to eat, all right? We want to see what the thing looks like. We're not concerned with all the underlying details most of the time. And the thing that masks this is encodings. So all the ones and zeros get translated into a's and b's and fives and sevens and all that with encodings. Most of you have probably heard of this. But I do want to go over a quick few terms just to make sure we're all on the same page with this stuff. So there are coded character sets. Sometimes he's just called character sets. It's essentially just a list of characters with code points associated with them. And a code point is just a unique number. So like a is one, b is two, c is three. However they want to lay it out. A character is anything we decide we need to write anything we want to write. So obvious things like letters and numbers, punctuation, symbols, like dollar sign. But there's a lot of characters that are hidden that help with formatting and things like that as well. And those are, as far as this is concerned, considered characters. And the glyph is the actual visual representation. So a character is an abstract concept. And then a glyph is the thing that you actually see on the screen. So a character set might look something like this. You can see sometimes things are grouped, sometimes they're not grouped, kind of all over the place. If we look at an individual entry, the concept of a lower case a is represented in this case by the code point 97. We can get this in Ruby by calling dot code points. It will give us back 97. You can also see the example here of the glyph versus abstract character concept. I'm intimately familiar with a's. So the a on top, the designer decided to go with that kind of hooky over the top thing. And on the bottom, it's more like the O with like a bar slammed into the side of it. But these both represent the same abstract concept of a lower case a. Most of you, probably all of you have heard of ASCII, which is the American Standard Code for Information Interchange. It was first published in 1963. It didn't have lower case letters in it until 1967. So everyone yelled for a few years in there. So you can stop complaining about your grandparents. They don't know any better. And it only used seven bits. So we all work with eight bit bytes usually, but ASCII was only seven bits when it was originally designed. That's because it was designed to work on machines like this. This is a teletype model 33. It's essentially a keyboard, a printer, and a modem slammed together. And it would let you type messages, send them to other teletype machines or other groups of machines. So you can think like Slack back in the day. Right. And it meant that ASCII included a lot of control codes to be able to handle that transmission side. So this is where you get things like tabular data, tabs, right? It meant that the printer could know where to line them up. There's vertical tabs, which we don't use much anymore. Although these days tabs are mostly just an argument about enter space anyway. So ASCII included a lot of interesting control codes. This is the inside of one of those machines. And I pointed out because there's a piece of this that I love. If you've ever been in your terminal and you've like backspaced too far or tabbed without a completion or something and heard a ding, or you've seen your screen flash, the visual bell, right? There's literally a bell in this machine. So you could send a bell signal and ring it on the other side. So as computers grew and evolved, we kind of moved into doing eight-bit bytes. That was a debate for a little while and people settled on eight. Because why not? I don't know what the actual reason is. But ASCII needed to move up. So, oops, backwards. So they took ASCII and they tacked zero on the front. And that's great, everything's compatible. But this also opened up a world of additional characters, because we can flip that to a one at the front instead of a zero. We get a whole another 128 characters. And this is a little known fact, but it turns out there are many languages in the world, and not just English. And these other people were like, hey, can we use computers too? Because that'd be great. And actually write with our language. And so they took this, they flipped the one and they started throwing characters in there. And individual languages, or sometimes groups of languages if they used shared characters, could have their own encodings. And this led to this kind of explosion of extended ASCII's. So you've got ASCII, and then ISO 8859-1, which if you've been doing this a while, you might recognize, it's also known as Latin one, if that sounds more familiar. So that's great. And ISO people were like, this is awesome. We're gonna make a bunch more. We're gonna skip 12, because that's a crap number. I don't know why they did that. So then Windows came along, and Microsoft, and they're fantastic wisdom. They're like, we're gonna make something that's super close, but we're gonna add smart quotes. And make it just different enough that it breaks across everything. And for a long time, software had to actually like, try to guess, because people would put the wrong encoding, and so they would try to fix it for you. And then they're like, let's do this a bunch of other times. TIS came out. This is the Thai Industrial Standard. So they're like, hey, we need this, we'll do it. ISO is like, no, we're gonna steal it. So the ISO 8859-11 is an exact copy, but just to spice things up, they added one more character, which by the way, it was a non-breaking space. There's the KOI-8 series, which is like Russian type of languages. And this is the mid-80s, and ISO is like, hey, this looks interesting. You're not gonna mess with Russia in the mid-80s. And they're like, hold my beer, because we're doing it. So they made ISO IR-111, which is an exact copy. So now you've got that fun. And this is just a small snapshot. There's a lot more. Things went bad fast. So what you would end up with is, you've got ISO 8859-1, smorgas. You send that to someone with a Mac. It attempts to use Mac OS Roman. All the Ascii's line up, because they're all super sets of Ascii. Those parts work, but the other characters are different. And you end up with this. This is called mojibaki. It's a Japanese word, meaning character transformation or character changing. And it's when you have an encoding that goes awry. So mojibaki. It's not the same, by the way, if you see those little white boxes on the screen, that means there's no glyph to represent that character. That's usually called tofu. So don't mix them up. So the point is this sucked, right? We had tons of encodings. We had some that were competing for the same space. They weren't composable. So if you were wanting to write a paper in French on the Russian language, good luck. Wasn't gonna happen. So long comes Unicode, which is the universal coded character set, right? Their design was basically a super set of the ISO 8859-1, because that was super popular. Which also makes it, by extension, a super set of Ascii. So it was really backwards compatible. And their design was to handle anything we could need written. So it's not just languages, but it's also math symbols, music symbols, emoji. The only thing they won't put in there is made up stuff. So no matter how many times you try, you will not get Klingon, Dothraki, Elvish, none of these will appear. And apparently people have tried. So it's run by the Unicode consortium. They're the group that keeps all this stuff going. They've got a group of members. It's a pay-to-play system. So you pay to be on the board. You'll recognize a lot of the companies up here, I suspect, including Google. So they're intimately familiar with this, which just shows this happens to the best of us, I guess. The current Unicode version is 10. The first was released in 1999. They iterate about yearly at this point. And the encoding, the Unicode character set is different from the actual encodings that you've probably heard of. The encodings are an implementation of that character set. Three of the most common are UTF-8, 16 and 32. The UTF stands for Unicode transition format. The number behind it is the code unit size. So it's the number of bits needed to represent one character, the minimum number of bits needed to represent one character. It's also the minimum step size. So if you need more bits, you have to increment by at least that much. So minimum size on UTF-8 is eight bits. This makes it ASCII compatible. You can switch the encoding, change nothing with the bytes. Do all the same and it'll read just fine. But you can also go up to four bits, or four code units, to represent the whole of Unicode. UTF-16 uses 16-bit steps. So the first one's gonna be 16. If you need more, you have to add another 16. And 32 is a fixed size encoding. And they did this so that you could have a variety of options available to you. UTF-8 is almost always the most space-saving of the group. There are a few times where UTF-16 is actually slightly smaller. Generally, UTF-8 will save more space on disk transmission. It's backwards compatible. But if you need to index in to some distant part of a string, you have to walk the string and check because all these characters are variable. UTF-32, you can do the math and then just jump right to where you need to be because they're all a fixed length. And then UTF-16 is just terrible. The worst of both. When you use UTF-16 or 32, you're using multiple bytes for that code unit. So you start running into problems with the byte ordering. So in some systems, they order most significant byte to least significant byte, which is kind of what you're probably used to seeing and that's big Indian encoding. And then flipping that, you get little Indian encoding, which is the least significant byte to the most significant byte. So for systems to be able to be compatible, they have a bomb, which is a byte order mark and it basically says what order those are in so that a system can look at it and say, okay, you send it to me in big Indian, I read things, little Indian, I go ahead and flip everything so I know that I'm doing this correctly. We can see this because if we do A encode UTF-16 encoding, it tells us it's a dummy encoding. So basically it's saying I'm doing this, but under the hood it's really one of the other two. I thought dummy was not very meniswan though, I'm like fake encoding or I don't know. So there's really, there are also UTF-16, BE and LE for big Indian and little Indian. So the byte layout for big Indian is, remember earlier the A was a 97 so they just slapped a zero opening byte, done. The other direction, flips the two. So what happens if we just say UTF-16? We actually get four and it's because we get that bomb attached to the front, that byte order mark that tells it which way to go. This also applies to UTF-32. And most of you will work in UTF-8, it's the most popular, space saving, backwards compatible, all those good things. It was originally presented in 93 by Ken Thompson and Rock Pike, these names might sound familiar. Most recently the two of them along with another gentleman did the go programming language. So that might be where you've heard of them from. They've done a lot of good stuff. In the early 2000s, you can see the red line on here is ASCII and this is a webpage of I encoding. So ASCII is the dominant force. But obviously it's not for everybody. UTF-8 is the blue line skyrocketing up past it. And this ends in 2012. I don't know what today's numbers are. I'm guessing it's probably even hotter than this at this point. But they crossed paths in 2008. That was when UTF took over as the major encoding on the web. It's also about the time Ruby introduced it as the default encoding. This was the 1.8 to 1.9 switch. We went from ASCII to UTF-8 as the default encoding. So that means Ruby has Unicode support. Right? Maybe? So let's take my name and I'm gonna put some fancy, I'm gonna put a little oomlaut at the front. And if we capitalize this in 1.9, even though UTF-8's the default, nothing happens. It wasn't actually until Ruby 2.4 that this got fixed. So pretty recently, really, in the scheme of things. They also did this with capitalized, upcase, downcase, swapcase, all the kind of things related to upcasing and downcasing things. Upcase is an interesting situation because it's actually contextual. Most of the people in this room would uppercase i into a capital i like this. But if you were using a Turkic language, it actually upcases it into, it's a capital i, but there's still that dot on top. So there's things that will never be able to handle before you go to fault. Sometimes you just have to understand the context of what you're working with. And in addition to Turkic, you can pass ASCII, which gives you the old behavior, or Lithuanian, which doesn't do anything, but hopefully someday will. That's what they say. I don't know why they just didn't leave it out. This is another character, this is a German sharp S. Upcasing this produces two capital Ss. This is actually correct, it's a legitimate replacement. Two Ss is considered the same. We can downcase that. If anybody's not familiar, underscores previous value, we get two lowercase Ss. If we just call downcase on the sharp S though, it's gonna stay the same. It's not gonna downcase this. You can see where this might lead to some comparison issues. Unicode, aware of this, has what's called case folding. So you can pass fold to this downcase, and it will actually convert it to this double S setup. And there's a number of other characters that this works with. And it's important because this happens when you do case compare. So these will case compare to true, even though they are different, because it uses that case folding. However, the non-predicate version, note the lack of the question mark, is ASCII only. This would be a zero if they were equal. So this is a fun little watch out. Now, I've been talking about some of this stuff, but what space was what originally got this thing going? So let's talk about some white space for a minute. Most of us are probably used to thinking of white space in terms of space, new line, tab, ASCII things. So here's a picture of all the ASCII white space. It's a little bit of it. Here's all the white space in Unicode. You can see there's a lot more here. Now these are the actual, I'm not even sure I have a full list if I'm being honest, but this is a massive list of white space that you can use in Unicode. They have a lot of things in here. You see like Ian, Quad and line separator stuff, Braille space, right? So you could sub that in for a normal space and trip people with that. And it will be reported as a different character as far as Ruby's concerned. So back to our strip thing though. Let's run strip, right? Well, strip only handles the ASCII side. It doesn't do any of the rest of this. There's been an open ticket for two years. They haven't resolved it yet. So I don't know why or what's gonna make that happen, but for now, you're on your own. There's also a whole list of invisible characters, right? So those are fun. You can jam those in all kinds of places. And you can imagine even if we remove some of these, it's really hard to pick some of these things out. Strip wouldn't be enough. You'd have to pull them out of the middle because the hair width space is like the width of a hair. So you can put that in a lot of places and have it be noticeable. Yes, there's our non-breaking space by the way that did not get caught in the strip. So ideally we could do something like use TR, maybe get a list of white spaces into some constant and pass them through, strip them all into, turn them all into regular spaces we're comfortable with, squeeze them down. Maybe we've helped things out a little bit. Made it less likely to have this happen. But we still have those invisible characters I mentioned. And I wanna kinda go on a tangent for a second tell this fun story I found about invisible characters. So there's this guy, Tom, and he's part of a group of gamers. They play a variety of games together and they have a private forum they use to communicate with each other and talk about like, you messed that up, whatever, we should have done this. Well somebody was posting this stuff externally. And Tom got pissed, he's like, I wanna figure out who this is. So he makes this plan, he says, I'm gonna take the username, we're gonna say jerk face. I'm gonna convert it to binary and then I'm going to use zero width space as a one and a zero width non-joiner is a zero and I'm gonna insert it into every single string of forum outputs. So they went and found a copied version, ran this process backwards and figured out who the user was and then permanently banned them. So I think there's two lessons from this. One, it's hard to know what you're copying because these characters can be inserted anywhere. And two, just don't mess with Tom. Like this is over some video game stuff, he's. Well I took away from it. So maybe we can do something similar here. We can delete the invisibles, maybe keep them in a constant, make sure we change it up with anything that we find along the way because like I said, it's hard to get all these pinned down. So white spaces may be fixed up, invisibles are gone. We still kinda have some issues though. E length is one or two. Both these are legit. It depends on how we represent the character. So there is a single character, lowercase e with an acute accent but you can also represent this as a lowercase e followed by an acute accent diacritical combining mark. It will display no matter which one of these you use as the one on the left. They look identical, you can't tell them apart. Ruby, however, will report them as two characters. The one on the right that is. So if you do something like this, right, I just kind of like jammed a bunch of them on there. And it kind of worked out, I got like two smiley faces in there just totally by accident. Uh. So if this were part of a real language you would look at this and you'd say this is one character, crazy character but one character. Ruby will tell you it's 21 characters long because each of the marks gets counted along the way. We can kind of help this with graphemes. So this is basically the idea of what a human would view as a single character is what we want the computer to tell us as a single character. So it's a way to kind of handle some of this stuff and I'm gonna go through a little bit of examples of this but we're gonna do something a little more fun and use emoji to do it. And to do this, I'm gonna need a zero width joiner which looks like this. Zero width. So we're gonna take some emoji. By the way, this works in your terminal. So here we have a little family. We've got a dad, a mom, a kid, a daughter. Dad's got a kind of rockin' stash. I don't know if you can tell. It's blonde so it fits right in. We've got like a nice 70's cop stash thing goin' on. And we can zero join them together and produce this picture, this new emoji. And there's an individual code for this but you can construct it out of the top three. Also, you might, again, it might be hard to see but dad had to lose the stash for the family photo. It's gone, it just vanished. So we decide we wanna add a kid to the picture, a son. He's the same age so they kidnap, no adopted. They adopted, they're good people. So we do that and it will produce the new one. And we have the new family with the definitely not kidnapped adopted son. We can also do things like sub these people out because these are still individual characters at this point. We have the representation. You can find the one that is the group and have that be that one character but it could also be this group of characters. So we can do this and we can end up with two dads and a daughter and a son. And by the way, one stash, one no stash for keeping, for keeping count. There's also some other things this works on. So when they first did all these human action things, it must have only been men because they forgot half the population was women and they're the male and all men. So you can use a female sign and get a female runner. We now have these actually in there but this is a step to get to that. There are also regional indicators. So you can use a U and an S and get the American flag. And then you can also do things like join the thumbs up and all this kind of stuff we love with skin tones so that you can have a variety of skin tones. Now on my computer, this actually did not work. So it's not quite up to date yet apparently. It should look more like that. So we can get these grouped up with graphines whether we use the individual or the sort of zero width joined group. They can be treated as a single unit. So back to the E example for a second. Graphing clusters length one, it will be one no matter which type of E this is, the single character or the combo. This just showed up though in two five, like literally December. So this is still pretty new to the game. Part of this you would have had to use slash X and reg X. So reg X interestingly has had this for a long time. The engine was available to be able to pull graphines out. They also added each graphine cluster if you want to be able to like walk through and in each kind of way. And it's worth noting that the original design was asking and you can still see it in a lot of Ruby code. So if you use Rjust, this is lets you justify text. So for example, if we have the E that's the single character, Rjust three puts two spaces and then uses the third space to drop that E in. But if we do the E that's the two character type, we get one space and then what looks like the one character but as far as Ruby's concerned, it's still using two slots. So there's a lot of cases where this can go a little sideways, right? Our accent, one of them's above the quote, the other one's now over the S. This is not good. So we can now use graphine clusters reverse join and have what you would expect to get. Regardless of which kind of E is in your stuff. So that's fun. The problem is these still don't compare right. So we've grouped them up but they're just tiny strings that are now messing with us. Let's take a look at company names again. So I made up this, I hope I made this up, Antech. And they have an A with a ring above it. But not all As with ring above them are As with rings above them. So the one on the left is a capital A with a ring over the top used in Swedish and Danish. The one on the right is the symbol for the unit angstrom which is one 10 billionth of a meter. So these are separate code points. They are separate abstract character concepts. One of them is an A in the language, the other is a unit symbol. But they use the same glyph. They are visually identical. You can't tell them apart. So if we're trying to fix this and we're at a Google level and we have international people, this could be an issue. There's also ligatures. So these are typographical flourishes. So you can see the first one has none. The second one, the second F in the I have kind of combined the F is like reaching over to dot the I. And then the third one, there's a line straight through the Fs and into the I. And all three of these will report back as being different. So you can see, again, where we might get confusing things that people wouldn't notice potentially, right? They also report back because they are individual characters in Unicode. They're not combinations. They report back as individual characters. So O, FFI ligature, C, E. And they are not equivalent if you look at them in text. Not surprising at this point. So we're screwed. There's nothing you can do. Thank you for shouting. That's the end of my talk. No. Unicode understands that this happens and provide Unicode normalization. So they provide this concept and they tell people how to implement this Unicode normalization. Ruby added this in 2.2 with the aptly named Unicode normalize. There are four types of transformations that come in this that allow you to normalize your text. And they're broken into sort of two major categories, canonical equivalence and compatibility equivalence. So canonical equivalence means that you have equivalence between characters or sequences of characters that are the same abstract character and have the same parents. So if we look back to the E we've been looking at, an E with an accent and an E with a combining accent would fall into canonical equivalent. They are conceptually the same character and they are visually the same character. So we have the normalization form canonical decomposition. This is the process of taking a string and ripping them apart. So you see like the C that has a little tail, the sedalia, rip it into two pieces. It will also reorder marks. So for making comparisons more, allowing you to compare things, obviously if the dots were reversed you'd get different strings even though they should be the same thing. So it will mess with the order of the marks to put them in a prescribed order. Some Asian languages also have combining marks that will get split apart. And then there's also singleton equivalents. So singleton characters are characters like that A with the ring and angstrom we were looking at earlier, the characters that are identical. The example here is the Greek omega and ohm. So they will always be reduced to one of those characters. So it will always become the Greek omega. So if you've got physics stuff you will lose the meaning of ohm as far as character is concerned. Usually it'll all look good but it is not a perfect system in keeping meaning of what you're working with. So if we've got that C with the code points we can call Unicode Normalize NFD which is the normal form decomposition we were just looking at. We can look at the code points and it's now capital C followed by that diacritic. This lets you do some fun things. The combining marks are in a range. So we can take words like a resume and Englishize them by ripping out the combining marks. So more to the point really is if you're doing sorting or searching or things like that ripping out these characters might make your search more accurate as long as you're doing with the search content as well. And of course for anything you're gonna do always keep a copy of the original because it's easy to screw up and it's best to always have what you started with. Normal form canonical composition is the other way. Kind of. So it actually rips everything apart first. No matter what it's looking at it rips everything apart first and then it slams everything back together. So you still get all that reordering. You still lose the meaning of things like ohm but it will put, oops, I don't know what that was. It will put that character back together. These transformations are, as I kind of mentioned they're not really reversible which is why you should keep originals. They are item potent. You can just run them again and again and again and again and nothing will change. The second form is compatibility equivalence. So this is NFKD, normal form compatibility decomposition. So this is where you get characters like these script characters you might have seen people using their Twitter handles and things like that. They're math characters. You can reduce these back to their original A, right? One half, they actually call them vulgar fractions apparently from the vulgar term common not because they're vulgar in the sense that we think. And those will get converted into, this one's an interesting one, it's not one slash two it's one fraction slash two. So you still have kind of a different slash character than you might expect. And the nice thing is you can take ligatures and you can reduce those as well. So this is a way to really break it down to its most base components. You can have problems though. Three to the power of two will become 32. I think you can imagine where this could go wrong, right? Moon landings and such. So it's helpful but it isn't always perfect. So the two forms are the NFD, the canonical decomposition and the compatibility decomposition. And the canonical's really strict. It doesn't change too many things. Compatibility you saw is much more lenient. It's much more liberal. I kind of try to remember it because k for compatibility is pretty liberal. Kind of like using it for say conference, right? Cool Ruby hack, you're in good company here with Unicode. The NFKC uses that compatibility mode decomposition to break everything down. And then really is just the same like recombining that you got before. It's not gonna put ligatures back together. It doesn't know that it was a ligature to begin with. You're not gonna get any of that back. It's not gonna go back to being a fancy A. It has no concept of this. It's just kind of gone to the wind. So we could normalize the string with NFKC. We could fix up the white space. We could clear out invisibles and we're on it. The very least a much better path. I don't want anyone going into their work and being like, if we do this we're secure and we're in there and said so. That may not be the case. Talked to Jeremy about security, not me. So, but we're definitely better off than we were and we're certainly not gonna get tricked by an on breaking space. So does Ruby have Unicode support? Kind of, I mean they're still working on things. Like I said, the Lithuanians not there yet. Strip doesn't remove all white space. You could say that that maybe is something that should happen before we say it's fully Unicode compliant. But things have gotten a lot better especially over the recent years. There's been a lot of improvements. It seems like it's continuing to go that way. So I think it won't be too much longer before we can be pretty confident that it will handle most everything that Unicode prescribes and wants us to be able to handle. So once again, I'm Aaron Lissang. You can find me on pretty much all the places as Aaron Lissang if you can spell my name. I have a book, Mastering Ruby Strings and Encodings. So if you thought this was interesting and you want to like nerd out on the more deep kind of pieces of this, you can go pick that up. I have 20% off if you use Ruby Hack. And you can find that and me at AaronLissang.com. And thank you.