 today to tell you about my favorite. Before I get started, I wanted to find a couple terms. I'm going to be using the word character a lot. There's a very technical definition of a character in a Unicode technical report that I've linked, but for now, I just want to keep things simple. When I say character, I'll be referring to a sequence of bytes which represents part of a letter or symbol. And I'll be using the terms character and code point more or less interchangeably. And when I say glyph, I'll be referring to the visual representation of the character. The character isn't actually visible on this slide, but its effects are. The zero with Drainer Unicode 200D is an invisible combining character used to combine two or more Unicode characters to one glyph that is displayed in its place. So first, I'm going to tell you really quickly what Unicode is and why we have zero with Drainer's at all. I think that this history is interesting and I think that we can learn a lot from the way things are today by thinking about the past. We didn't always have zero with Drainer's because we didn't always have computers. With typewriters, if you want to combine two characters, you type a character, you hit Backspace, and then you type the second character on top of the first one. If you want to write in two different scripts, you have to switch out the ball inside the typewriter that has the characters engraved on it, and I have a bunch of those in this picture from the New York Historical Society. So ASCII was the first standard for storing character information on a computer. It was an early standard. It didn't have a zero with Drainer either, though. It had English characters. So people would work around this in other languages by writing letters, not putting accents on them, kind of hoping for the best. If there's one or two of these, then it's pretty understandable. And this example is in Italian, and so you can kind of muddle your way through it. It's frustrating. So then there were a couple more code pages, which is sets of characters, and you could, IBM mainframes used EBSDIC, which is a code page other than ASCII, and had different numbers for different characters. And you could write in maybe Italian with the accents, and when you printed it out and showed it to someone else, it would have accents. But I say printed because you couldn't really, like, put a file on your, from one computer to another and have it still necessarily work. You had to convert between the computers because there wasn't one universal standard. So this program that converts between different code pages is still on some computers today. In Linux, if you type icon dash dash list, it will give you a list of everything it can convert between. You'll see how unfortunate you might have been if you had received a file and wanted to read it. So that's not true anymore. What changed? In 1991, the Unicode Consortium was created, and this quote pretty much says it all. With Unicode, the information technologies industry has replaced proliferating character sets with data stability, global interoperability and data intro change, simplified software, and reduced development costs. So the Unicode standard describes different code planes which can be used at the same time. So when you had your one character ball and you needed to switch it out, that was like changing between ASCII and EBSDIC, but now you just have a really big one. You can type whatever you want, and you can use it at the same time. A little more technically, if you have a sequence of bytes representing characters with Unicode, you can start where you are and keep parsing from there, and eventually you will reach a character that is comprehensible, hopefully. And you don't need to go backwards to find out where the characters start, where you want to have a complete character necessarily, but if you go in some systems, such as Shift.js, you have to see where the shift starts to see what a character, how long a character is. With Unicode, it's chunks that are regular. And how do we handle overlapping characters, which was really easy on a typewriter and pretty much impossible for a little while after that. We use the zero with joiner. So it's my understanding that Arabic letters are typically joined when they form words. I first learned about this from Ramsey Nasser, who I am honored to share a stage with. And so I believe he runs a Tumblr blog devoted to pointing out places which don't link letters that should be linked. A zero with joiner joins letters, so you can use it to make sure Arabic letters are written in a joined form. They should have properties specified that join them together when they're in words anyway, but you can force them into a joined form, even if they're not joined next to one of those joining letters, such as in this slide. You can see, sorry for those of you who can't see it, you have zero with joiners and a character that's in one form, and then at the other side of the equal sign, it's in another form. And Arabic is typically read right to left, but this slide, you can read it left to right, because I speak English. So to better understand the zero with joiner, I like to think of how it's different than other similar characters. The most similar character that comes to mind is the zero with non-joiner. So a zero with joiner joins two separate characters into one glyph. A zero with non-joiner, you have two separate characters and they are non-joined. It's like it's evil twin. So this example is in Canada. You can see the characters on the top. You put them right next to each other without doing anything else and they combine into one glyph. On the bottom, they are still next to each other, but not combined. You may have noticed that my first slide had three examples and I've only gone through two so far. I don't speak Arabic and I don't speak Canada, but I do speak another language, emoji. So the Unicode consortium names this glyph as couple with heart, woman, woman. The emoji typically are couple and family emoji using zero with joiner. One of the reasons I like the zero with joiner is that it destroys any assumption you might have about how long a string is. This is really apparent in Python. This is Python 2. Someone in the audience, how long do you think this string is? I heard eight. I see someone holding up three fingers. That's three. Someone guessed 40. Those are all pretty good answers. I saw someone doing five. Okay. I'm going to tell you 20. Okay. So this is because the default implementation in Python 2 for strings is a byte string. There are 20 hex escapes which comprise this string under the hood. So Python also has the concept of a Unicode string, which is actually the default type of string in Python 3 onwards. So let's try this again. Someone, how long is this string? Okay. A lot of people seem to think it's three, which is a good... Okay. I hear five, two. Those are good. That's pretty close. The correct answer is six. There are six Unicode characters in this string. There's woman 1, zero with joiner 1, heart, a variant selector for the heart which says that we want the emoji display of it. There's zero with joiner 2 and finally woman 2. So I have one more puzzle though. Perhaps like me, you program in languages which run on the Java virtual machine. The Java virtual machine, you might or might not be blissfully unaware, but it stores strings under the hood as UTF 16, which is a standard that Unicode puts together. So knowing what we know now, how long, what's this going to return? This is the length of a string in Java. 12 is not what I was expecting someone to say, but it's a guess. The correct answer is eight. So this is a trick question. The woman emoji, which is one of the six characters in this, doesn't fit in a UTF 16 character. So it uses two, a high surrogate and a low surrogate, which combine to form a surrogate pair to represent it. Since there are two women in this, each of them takes up two and it returns eight. There's a way to check this. You want string.codepointcount instead of string.length if you want the number of Unicode characters. So I want to go on a quick digression. The zero with joiner isn't the only way an emoji can change that breaks your assumptions. You may have seen the skin tone emoji. You can now have an emoji and specify the color of the skin of the person in it. That does not use the zero with joiner, but that doesn't mean that the length of the code points in it is what you would expect it to be either. So you have one character and the next would a second character, which is the skin tone, I think it's called the Fitzpatrick modifier, which is for light skin to dark skin. And this is much like the heart in my example, how to variant selector. This will tell you what variant you want. And this uses variants because it's a separate, it's part of the same thing. It's not a new concept that needs a new emoji, but that's pretty much a judgment called by the Unicode consortium. So there's one more wrinkle I have for you before you can display emoji. The Unicode standard was introduced last year, Unicode version 8 was introduced last year, and it uses zero with joiners for couple and family emoji. Not all devices support Unicode 8. This is my smartwatch. I'm wearing it right now. So I sent the text to myself with the emoji sequence for the couple with heart woman woman, and this is what it showed me. As you can see, there is a box which represents a woman, and then a heart, and then a box which represents another woman. So this, the reason it doesn't display like complete garbage is because Unicode defined a sequential fallback for when you see the couple with heart sequence, and it is person, art, person, all next to each other, not combined. And this is to prevent breaking backwards compatibility. And my question is, do all of you, when you're writing apps, do you need to be that careful to make sure things have a fallback and that they still work? So in conclusion, when you're writing code for something that takes user input, think carefully about how much space you're allotting for its display. And also in conclusion, we live in an amazing diverse world, and I hope that we can build apps for everyone in this world, not just people that speak English. Thank you.