 I want to talk about Unicode and the difference to how Unicode is handled in Python 2 and Python 3. First, Unicode, oh, there's something wrong here. That's better. But first, let me go way back in time and talk about what's our encoding and what is Unicode and where did it all come from, right? So we all know what an encoding is, right? Because we all know Morse code, we've all heard about telegraph, olden times, people with long mustaches, bit, bit, bop, bop. And it's a way to encode letters and numbers, right? Such that a machine can read and write them. So a machine in this case would be a telegraph, so a wire with an electrical signal going through them. And us humans would write out our things in letters, but then tap with dashes and dots to transfer that by machine, right? So this encoding format is a way to encode text, has a couple of characteristics. It uses a ternary system. There's three types of bits. There's the long dash, the short dash, and there's also the pause. So between two letters, you'll have an actual pause. This is all standardized in the 19th century, so telegraph people can understand themselves across continents. Another thing, it's variable length, right? So if I try to go back quickly, you'll see, for example, the a is dot dash, so it's only two symbols, two bits, theoretically, as opposed to say the y, which is four bits, right? So you can't guess, right, how many letters a piece of morse will contain. You have to decode that morse to figure out how many letters it is, right? So if you're trying to, I don't know, fit it on a computer screen, you'll have to go through letter by letter, right? And finally, there's no special characters. In morse code, there's only letters and numbers. That's why in telegraphs, in olden times, all your sentences had STOP at the end, because you couldn't really get a period. All those three things kind of make it kind of crappy for computers, right? Computers are really good, well, electronics, so the core of it, your CPUs, your memory, are really good at saying zero from one, negative voltage or positive voltage, but a ternary system would be very difficult or just not worth the price of electronics for it. So all our electronics work on a binary system. That's one reason why we can't use it. Variable length encoding is kind of crappy for computers, because variable length encoding, not brilliant for computers, all our memory accesses, so when you fetch stuff from either RAM or more specifically hard drives, usually in modern computers you've used in the last 40 years, you'll get what's called bytes, so a series of eight of these bits of zeros and ones. Are we all right? So this is why, after much trepidation and different things and standards from IBM nobody uses that we don't want to talk about, we invented ASCII in 1963, it was standardized. It is the American, I knew this five minutes ago, something standard for, sorry? Information interchange, right? What does it contain? Zero nine, eight of z, come punctuation. It has what I call 128 glyphs, so 128 symbols that you can represent, right? And the key here is it contained in seven bits, right? So it's a seven-bit length encoding, two to the power of seven, 128, so you can have 128 different symbols. If you look at an ASCII table, not all those symbols are visual, some of them will be transparent, they'll be things like return line, space, end of transmission, record separator, all these whole load of things that 1960s they found useful, right? And some of us still do today like return lines. Right, but why seven bits? So we have eight-bit computers, one byte is a bit. If you do a read off a file in most modern operating systems, you'll get a byte of data, you'll get eight bits. Well, in seven bits, yet that extra bit you can kind of use, the most common use for it was called a parity bit, so you can encode your ASCII letter as a series of ones and zeros, and in this case have five ones, so I'll add an extra one at the top, make it six ones and make it an even number of ones, right? So let's say I'm transferring this over a wire, maybe I have a guy of a telegraph who does zeros and ones, maybe I have carrier pigeons that each carry a big zero for zero, a big one for one, and one of those pigeons gets eaten by a falcon or something. Well, let's say I lose that X, right? Well, I'll see, oh, there's a one here, there's one one one one one here, so by deduction that must be a zero, and then I get my letter back, right? So I can show parity. That's very common in networking, routers will do that for you automatically if you pay enough for them. But, so we have a standard, right? And we can communicate, it's everything's happy, fine. We sort that out in the 60s, we got a man on the moon, woohoo! Not exactly, the problem is the rest of the world, right? All these different languages had different characters and weird ways to count money and even some people started proposing other characters that they wanted in books in different letters, right? So this one guy, well, this whole committee started ISO IEC 8859-11998, I think that the Euro 1990 is the latest version, more commonly known to friends and family as Latin one. Can everybody hear me? I just heard it cut out, okay, cool. Latin one is probably the most common Western European encoding, weird encoding that you'll see, really popular in the 80s. If you ever have to deal with CSV files from Excel, they will usually be in Latin one, so that's where I mostly see it. What they did was they used that extra bit, right? So they said, okay, those last seven bits, that's gonna be ASCII. If you see a zero in front, you just take the seven next bits, you look at your ASCII table, you pop up with that character. If you see a one in front, I'm gonna slip in another table, right? That has the same, like a different series of 128 bits, 128 glyphs from ASCII, where I can start putting all my weird accents and umlauts and all that stuff, right? So I have this whole bank of another 128 characters I can use, right? And okay, if I'm dealing with Americans, or people from the UK, I just put ASCII all the way and we're all happy. If I got a French guy or a German guy, well, I tell them it's under Latin one code page, maybe he can pay me for my software and it supports that, and suddenly it'll work on his computer, right? So oh, pop quiz, can you use Latin one to write in French? Who thinks raise their hand? Well, who disagrees? All right, okay. What happens is you do not get capital Y diersis, which basically only comes up in the French language when you have like five cities in France when you print them in caps. So they don't fit in Latin one. So if you're ever worried about that, you now have an answer. Right, but what characteristics does Latin one have? One byte glyphs. So all the glyphs are represented in one byte. This is really important for all your encoding needs because for example, if I use grep from system three Unix, all it's going to understand is one byte things from ASCII, right? So even if it was a seven bit character, it's all going to load in one byte. So if my encoding works in one byte, I can pass it through all my ASCII tools I can buy from my American vendors and I don't have to change the code if it's in one byte, right? One byte is also really useful that I have a file that's 350 bytes long. I know it is 350 characters long. I have a line that's 70 bytes long. Well, I need 70 spaces in my monospaced font to show it on my screen, for example. So basic text tools work between text editors, all your Unix text tools, just work with this one byte encoding, right? And also backwards compatible with ASCII. So I have an ASCII file, it's also Latin 1. So I have a Latin 1 text editor. As long as I don't use any accents or I need all the glyphs that are available in ASCII, it'll write out an ASCII file that can pass onto somebody who has a computer with tools that only read ASCII files, right? So this is such a good idea that everybody started doing it. So you started having standards for Western and Central Europe, so maybe more of the carrots, so you'd have the code pages, right? You'd say, this file, Latin 1, you'd have to tell people what code page it is for them to read it on screen. You had different ones that ties got onto it, Russians. Mac OS version 3 in the 1980s developed their own version independent from Latin 1. I've never seen it myself. The Japanese one is kind of interesting because it's not a one-byte system. So I'm talking about Latin 1, it all fits in one byte. This one had multi-byte characters because Japanese has much more than 256 different symbols. So to fit it in, they had to use multiple bytes. But it's a pain, right? So everybody had their little different standard. You had a text editor that worked here. You'd save it under a format. You'd read it somewhere else. Your grandma was like, it's all garbled. What is this? And you're like, oh, use this code page, and you don't have this code page installed. Oh, I don't have the internet, so we're all screwed and go to the software store and try to buy your way out of it. So another standard came about Unicode. The one code threw us all, right? We all knocked our coconuts together and 1991 decided we're going to go with Unicode. Hooray! What is it? It is currently more than 110,000 characters covering 100 scripts and various symbols. So what they call a grapheme, which is basically a letter, has a unique code point, which is basically just a number that can be encoded as binary decimal, hexadecimal also classic way you'll represent it. But whenever you see grapheme, it's a type of letter. Modulo, maybe some weird symbol or whatever. And then code point is basically number. Latin 1, so all the code points. Latin 1 will describe numbers. So an example, I think, a Latin 1 is a superset of ASCII. So a letter in ASCII will have a number, I'll have the same number in Latin 1 and I'll have the same code point in Unicode. And I'll explain how this changes a bit later. But the key is, that's all Unicode is. So when you hear Unicode, just think it's this huge bank of PDFs basically on the internet for all intents and purposes of numbers matching to a specific type of letter or character. Okay? So that's the problem, right? What does it, how does it matter to you, right? I know this letter somewhere exists in the effort and it has this representation as a number, but how do I print it out in my printer? How do I show it in my screen? How do I show it in my web browser? Well, at that point, the Unicode committee proposed UCS2, Unicode code point system, I think, to, to is for two bytes, because initially they didn't have the 110,000, I think they limited themselves to under 65,000. So if you took two bytes, right, and you just dumped that into your file, you could represent all those two, any one of those 65,000 different characters. So what it literally was, was, you know, like C-structs dumped to a file, one after the other, right? So every letter took two bytes. But you could show, if everybody used that, between the Thais, the French, the Germans, and the Americans, one file would be convertible everywhere, it would show up in the screen at the same way for everyone. They wouldn't be fighting with encoding. However, one really painful part of it is they decided to depend on your architecture's byte order and in this, right? So if you don't notice, you're lucky if you're never had to fight with this, different architectures. So X86, I think, is backwards, where it'll deal with numbers larger than 256 in that the most important... Anyways, it's complicated, it's a pain. I know PowerPC is different from X86. X86 is an archaic way, but it's X86. Everything's archaic in X86. I have no idea what ARM is. I think you can choose. It might be a switch on the CPU itself. So one of the problems you see is, depending on your computer's architecture, it would dump two different files that another person could not read, right? It's limited to 65,000 code points. Because it's only two bytes, there's no bits left to show anymore. So if you wanted to show off your Klingon, because Unicode eventually got to it, as I show in this presentation, all my emojis, those are in higher code points beyond their numbers in Unicode, is above 65,000, you can't show it in UCS, right? But the base plane, so one detail, the base plane is Latin 1. When you hear the term plane in Unicode, it's basically a series of 256 characters. So basically the last byte in your number for that thing. And so basically what's called the base plane, the lower level plane, the simplest one. So it's Latin 1, and for other languages, they'll go beyond that. So they'll start with like a zero or one or two, and then another byte going through those characters. Also, UCS 2 has one huge problem for Americans and all language speakers. It takes twice as much space as ASCII. If you have an ASCII file, you'll have to take every character in your ASCII file and proceed it by zero, zero, zero, zero, zero, zero, to make it a UCS 2 file. So suddenly you have to double the space. I know you can zip that theoretically, but that's the answer to everything and it never really works. You still have to grab through the file and it takes twice as much memory and it's twice as much of a pain. Right? So in 1993, Ken Thompson and Rob Pike, so two of the guys who helped with Unix, way, way, way, way back. We're working on a plan 9, an OS nobody uses, and found that, okay, UCS sucks. We need a better system. They invented UTF 8. So it's Unicode transformation format, 8. The 8 stands for 8 bits. So it's showed by one byte at a time. I'll explain how it works. So basically, it's what's called a multi byte system. This is... So a character can be represented by up to one to four bytes. I think even five or six nowadays if you want to try to reach those higher code points. So if I have a character of one byte, it'll start over zero. None of the other characters start over zero. It'll start over zero and it'll have seven bits floating around. Hey, go figure what a coincidence. ASCII fits in seven bits. So if you have a character that starts with the most important byte is a zero, you have an ASCII character. You can use all your ASCII tools to fight with it and you're fine, right? And you don't need to know about Unicode, higher code points, whatever had unification is. This is all details to you. You know you're dealing with ASCII and that's all you can ask. So maybe I'm writing a file system. Just can just be ASCII and I can still use UTF tools. Oh, I want to reach French characters, right? So I want to hit more options for characters. Well, you just concatenate all these bits together and it'll form you a number, a binary number and that binary number will be somewhere in Latin 1. And if you need to reach other Unicode code points or higher numbers, it's just all these bits concatenate, right? So you can just string these along. So the first one is 110 and then afterwards it's all 10, right? So if you're reading a parser, don't write your own UTF-8 parser. Just plenty of them around. Second, you hit 110. None of the other characters you'll ever hit will start with 110, be zero or 10. And you stop after you stop hitting tens and then you have one character, right? So what's great about UTF-8 is it contains all of Unicode because you can, like, the standard just allows you to string more and more and more numbers. So if we wind up into some weird system, maybe with Martians or some weird language where we'll need millions and millions and billions of different characters, we can just extend UTF-8, right? It's not limited like UCS. Backwards compatible with ASCII. So if you take an ASCII file, it is a UTF-8 file. So UTF-8 tools, your text editors, all that, can read an ASCII file. No need to convert it. No need to find other code points. No need to do anything. And no byte order. It's always the most important byte first, basically. Like, the byte order is specified in the standard. It's not architecture dependent, which is really great. There's a thing called a byte order marker in Unicode, which is supposed to show you that byte order. Don't use it. Like, only boring people from the 90s use it to fit around badly standards. Use UTF-8, and it shows up as nothing at all, like, not even a space, so no point in using it. Hooray! Okay, this is always sorted, right? So we now have a standard for showing text or at least storing text on a computer hard drive or on an electronic medium, right? A couple of problems with UTF-8. It's the standard. Every standard has problems. It's a concession to somebody. You cannot guess the length of strings because it's multi-byte, right? So I can't guess if the character is one, two, or three bytes. If I have 10 bytes of ASCII, I know that it's exactly 10 letters long. If I have 10 bytes of UTF-8, I have no idea how many characters that is. It could cut off a character. So if I'm buffering and I'm reading a file one kilobyte at a time, my kilobyte, the end byte, could be the starting byte of a two-byte character. I have no way of knowing offhand about reading through the whole thing. With ASCII, I know it's all one bit. I just cut off my bits. But in general, use UTF-8. If you need to choose an encoding, pick UTF-8. All the tools support it. All the web browsers support it. You can have URLs in UTF-8. The other ones are just too archaic and too much of it. But now for the point of all this, how is it done in Python? Okay? Up to Python 1.5, there was no Unicode of support at all. You would just not use it if you had to use a number of languages or you'd use UTF-8, so it would be a series of bytes. It was all crappy. In 1.6, start beginning of Unicode support. So that was in 2000. One of the big features of Python 2.0, this is way before my time, was apparently the Unicode type. Okay? So what is the Unicode type? What they wanted to do is, what they wanted to start off with in Python was, all right, everything is going to be Unicode. And then some people complained that, well, it's too complicated, like it's different from representation on a computer, and so they said to themselves, okay, okay, okay, let's make a dirty hack. Let's make something that looks like a string kind of responds to most things like a string, but we'll call it Unicode, right? So if you have straight up ASCII characters, ASCII characters are basically everything you can find in English. No accents, nothing crazy. You have a string, right? If you start your symbol with a U, and you put, in this case, your text editor or your terminal will encode this as UTF-8 as you enter it into your computer, Python will decode that looking at your local, sorry, encoding of your terminal or your text editor, and will figure that and says, oh, it starts VU, beginning of VU, there's weird characters in it, it's a Unicode type, right? So what's interesting is that in Unicode, I can specify my Unicode code points. So this is in Python slash U, the number is, this is the hex number of the Unicode standard code points. So if you're digging through your PDFs, if you look at the tables, the headers and rows, those match those numbers there. So if you see something cool in a Unicode PDF, you can dump it directly in the Python. You print it, it prints fine, usually as long as you have encoding of your terminal or you don't have any too many clashes. And what's the key here, right? So this is a Unicode character in UTF-8 being coded as tree bytes, but Unicode is smart enough to realize this is only one character, so the length is one, right? So if you're only dealing with UTF-8, it wouldn't be smart enough to do that. For example, if I do encode, so this is how in Python 2 you'll convert between different encodings, so I'd have my Unicode encoded to UTF-8. It shows up as these three byte characters, which matches that table I showed beforehand, right? And you then print it, it prints out fine because your terminal handles printing UTF-8, but if we do the length, it is now three bytes long, right? It thinks it's three characters long, which is not true because when you print it, you wind up with one. So if I have a string like this and I encode it in UTF-8, I wind up with those code points matching the accented characters and the rest show up as ASCII, right? Because that's what they are. And if I decode it again, right, so my UTF is this, these things with these code points, which I now wind up with, in this case it doesn't show up as a character because my terminal is weird, but it'll just show you this is the Unicode code point, which is basically this UTF-8 here decoded. Serving is perfect, right? It all works fine. There are no headaches at all with this. Ray. Narrow Python builds. I was surprised because I hit this seven years ago, and I didn't think it was still a thing, but apparently it is. So this is a really, really, really big, sorry, this is a really high level Unicode character. I think this is Emoji. Oh yeah, it is. This is Emoji for the scary ghost. Really cool Emoji. Emojis are like Japanese cell phone characters that I think Apple and the Japanese cell carriers managed to shoehorn the Unicode standard and Unicode people were like, but characters can't have colors and stuff but it's big in Japan. And it got in and it's cool. There's all these animated things. Check it out on Wikipedia. Chrome, who can I talk to? Chrome on Mac OS X doesn't support them. I have to use Firefox, I guess. Maybe that's a sign of the times. Because it's one character, there's one ghost on my screen, and yet a Python thinks it's two because Python, by default, won't store things as UTF-8. It'll store something called UCS-4, so it converts this into two bytes internally in memory, right? I mean, you shouldn't depend on... Well, first of all, this is in C-Python and this is in a certain build of C-Python. I know my home-brewed install of Python on my Mac OS X machine, not this one at home, works that way. If you use Jython, I have no idea what's happening. PyPy, good luck. It's platform-dependent. Especially if you're dealing with Unicode, it's supposed to be a higher level over these strings. You shouldn't have to care about these things, right? Collation. Pas collation, malheureusement. Collation. So I have a string, I have a list of strings, haut, école, front, right? If I had those three words in a French dictionary, the way it would wind up would be in that order. Haut, école, front. The way Python will sort them for you is haut, front, and then école, which is a bit painful if you're ever dealing with address books, stuff like that, be forewarned. Module support. Maybe you might have guessed it. I deal a lot with CSV files from random sources and stuff like that, so this always annoys me. So, oh, I had a Unicode string, oh, I write it, it writes fine. Oh, let me throw a Unicode string. So, like, on a strongly typed system, like, okay, it's a weakly typed, I understand, Python's a weakly typed, but you kind of figure, type for type, it should just be safe. No, it don't. The CSV writer will tell you it has to be ASCII, or more precisely, if you encode it as UTF-A, it'll work, but it won't warn you beforehand your unit tests, because we use accents in unit tests, and we'll just pass flawlessly, and it's production at 3 a.m. in the morning that you'll figure this out. Oh, and, yeah, don't just don't. So, I have Robert. I have a string, maybe it comes from GetText, maybe it's just in my code. I do string interpolation. Oh, right, I get a Unicode string. Okay, well, that's not what I was expecting. Maybe it makes sense that it wouldn't encode the Unicode and just upgrades to a higher standard. Oh, let me try in French. So, this is a UTF-8 string, which is kind of subtle, because there's no U in front. So, my Python interpreter will see me type SEDILAS-C, and because of how my terminal is set up, we'll encode that as UTF-8. It all falls apart. And, well, Robert is kind of, they could have kind of done it for me, but I can understand where they don't want to do it for me, but how, you know, maybe in the first case it should have failed. It shouldn't be dependent on the contents of your strings. What are you going to do? Clean up your strings. You can start trying to encode everything beforehand. In Python 2, what I recommend is stay in Unicode as long as possible, and then make a wrapper on the lower level to decode into UTF-8 or whatever encoding you want. But it's a source of many unpredictable bugs, right? But things I think personally are a lot better in Python 3 for the simple reason that they do that differentiation. So, what used to be strings in the old language and everywhere where you'd see, sorry, what's now called STRs in Python 3 is basically the old Unicode, right? So they're a series of characters. They're not bytes. They're a series of characters. They can be Unicode characters. They can be ASCII characters, but they're just a sequence of characters. The bytes type in Python 3, you don't have to care about the encoding because all it is is a series of numbers that are all underneath 256, right? So what does that mean is you have to tell everyone at all the steps of your programming what you want, right? So if you open a file as an R, right? So we all read the docs in Python 2 or like, oh, open a file. Oh, if you're in Windows, open it as RB because the Windows API does completely crazy things, which for historical reasons, which you can understand if you don't do RB, right? In Windows, it'll change your return lines from something weird to something weirder. But I don't know. I always programmed on Unix machines. So I always open files as R. In Python 3, this will bite you in the ass because you'll have to specify an encoding. And if you do, it'll automatically decode it for you. So instead of having a string of weird encodings that you're not sure, that you have to code yourself, it'll assume whatever encoding I think by default, it's UTF-A probably, and then give you a string, a Unicode string, so a sequence of Unicode characters. If you open RB, you get this bytes type. So it's just a sequence of numbers. It's not a string. You can't use string operators on it. I thought they're starting a newer version of Python 3 to be a bit more acquiescent on that subject. But so this is a byte string, right? I start off with a B. I use characters to represent this, but all it's storing is the sequence of numbers that match the UTF-A, in this case, characters that my terminal is giving. Oh, sorry, in this case, it gives a syntax error because you cannot have a non-asky character in a... Sorry, you cannot have a non-asky character in a byte string, right? However, if I take this string, right, so this matches the U thing we had in Python, and then I encode it as UTF-A, I now get a byte string with the UTF-A encoded, and this is a valid byte string, right? And if I decode it again, so matching the exercise we did in Python 2, I now get the str type, which can contain unicode characters. If I take the unicode version, it's a character. It's the reverse exclamation point. Okay, Jordy is not making big eyes, so I must have written it correctly. However, if I encode it in UTF-A, so I'm going to wind up with this string, the 0th element of that array, right? It's a number. It's not a character. It's not a byte string of one character. It's a number, and that matches that first UTF-8 number you had here, that first UTF-8 character you have here. So on the other end, when you write to a file, you have to specify it right binary, and if you do, and you try to write a unicode string. So this is a string that hasn't been encoded yet. So you've got to think logically, Python doesn't know. You'll try to open a file. He's told me right binary, so I'm just going to write bits on the wire, and now you're asking me to write an unicode string. I don't know when coding to give it. It'll complain bitterly, but it'll complain whatever the string is, right? So it doesn't depend on the contents from the get-go. You run this through a unit test, it'll blow up right away. Vice versa, I open blah, with just a W, so no longer this WB. I specify my coding as UTF-8 as out, it's just a bit too long. And now I write a byte string, right? What you're saying is you might be in the danger of double encoding this, right? So if this byte string had already UTF-8 characters, it might try to double encode. That would be horrible. So it just does the logical thing, gives you type error, and craps out. So just my two-cent rule, some couple rules of thumb, very high level that if you do not need to worry, if you want to worry about this, read the docs, don't follow these rules, please, but if you do not want to worry about this, always use the base string, the str type in your code, right? Always open WR with encoding and let your file reading or writing take care of the encoding issues for you. So in your code, just deal with string, unit code characters. Don't worry about the encoding at the last possible moment. And you can use byte strings. That's completely fine. If you need to do bit twiddling, fighting with binary files, cryptography, really cool stuff that you can do with binary strings and then you just read it as a binary string, use those numbers, XORM, whatever you want, and then you're in a much more comfortable position. It's just that in Python 3, it's all separated for you. It forces you to have to pick unit code or have to pick byte strings and no longer guessing, no longer on a Tuesday in these conditions, this code might react this way and your code won't fail deep in its stack with some weird encoding error, it'll fail right away and you'll be more easily able to fight. And that's it.