 Okay, is everybody ready or are you for you as for Unicode? Thank you, and thanks for making it out early for the presentation Thanks everybody As you mentioned my name is Greg back. I'm gonna be talking about Unicode today With apologies to super after I'm gonna try to make this into a kind of a mystery themed presentation So the obvious question to lead off of the presentation like this is what is Unicode? But we're not actually going to start there. We're gonna start with what is an encoding? Hopefully if you've been a programmer in the US for any length of time you've probably dealt with ASCII a Way to represent various letters numbers punctuation suitable for most English text not all English text and Life is pretty easy when all you have to deal with is Unicode or sorry with ASCII But if you start having to deal with other languages to get some more a little more complicated So beyond ASCII there's an encoding called Latin one that has a lot of accented characters some more currency symbols Few other characters that are useful in representing Western European languages There's another encoding Latin to which some different accented characters some are the same Beyond the Western European languages you have things like Greek with a different set of characters Get into non European languages like Japanese and other Chinese Korean Have some other characters in that beyond the ASCII character set Even some two-byte characters. There's not enough bytes to represent all the characters So basically an encoding is a way of mapping characters into numbers or Bites that a computer can understand So the first clue in our mystery To solve our mystery humans deal with characters, but computers deal with bytes and so an encoding is how do you map? human characters into Bites that a computer can understand With that understanding we're ready to say what is Unicode? Unicode was designed to be A way to represent all of the characters in all of the both the active human languages as well as a lot of historical languages that aren't used anymore Mathematical characters Recently all of the emojis are Unicode characters so Unicode was made to solve this problem of I have Text that is in some encoding, but as we saw on the other slides The same byte value can mean different things in different encodings. So Unicode is meant to be an unambiguous character encoding mapping from bytes to or From bytes to characters One I guess nuance thing that's important to realize if you ever heard the term Unicode code point That is sort of like an intermediate representation between characters and bytes So I wanted to distinguish between characters code points and glyphs so a character is The idea of the letter a or whatever character it is And if you see you see these terms in a lot of the documentation a Code point is how that care or how that character is represented as Unicode And it's typically written with this U plus syntax followed by a four or now five character hexadecimal value this is still kind of conceptual You don't actually And you don't have code points directly as bytes on a computer And then a glyph, which I'm not going to talk too much about today Is all of the different ways that you can display that character. So that's more in the realm of fonts and Things like that There is Gen in an ideal world. There's a one-to-one correspondence between characters and code points But that's not always true and we'll get to that in a little bit But in general the idea behind Unicode is for each character in any of the languages There is a unique Unicode code point for that I mentioned that you can't directly represent code points in Bites so there are ways to encode a code point into bytes There's a lot of ways that these are kind of four of the most common UCS to is obsolete and you shouldn't use it basically it maps every Unicode all of the kind of the first set the first 64,000 Unicode characters to a two-byte value UTF-16 uses something called surrogate pairs to represent all of the Unicode characters But I was sorry and UTF-32 Unambiguously represents every code point as four bytes, but for a lot of text using four bytes for characters is a big waste So that's why UTF-8 was developed. It's a variable with encoding All of the kind of all of the ASCII characters are represented as their one byte values And then all of the other higher code points are represented using two three or four bytes For all practical purposes, you can basically deal with UTF-8 for almost everything Windows you kind of have to deal with some I'm not sure if it's actually UCS-2 or more like UTF-16 But it is very much a 16-bit character on Windows But for all like web stuff UTF-8 is pretty much the standard I'm gonna show some code All of the slides that are blue are Python 2 and all the slides that are 3 or green are Python 3 Tried to make it a little a little iterative so easy to remember So in Python 2 you have your quoted string literal And then we know what the type of this is string. Yeah, it's the str type in Python you basically have two types of Sequences of characters or sequences of bytes byte strings and Unicode text strings and you can use a B prefix to represent to explicitly say This is a byte string and you can use the U prefix to explicitly say This is a Unicode string And so as we'll see yes, the first string is an str An str is the type of byte strings in Python 2 in Python 3 Very similar Except well, it's very similar in that the single or the double-quoted string is still the str type but in this case the Unicode string is the same thing as a string literal without any prefix so in summary Python 2 the Unicode type is called Unicode and the byte type is called str and on Python 3 the Unicode type is called str and the byte string is called bytes These two make perfect sense This is very confusing So this is probably this actually isn't a clue in our mystery But this is the what this is one of the important things to remember if you're dealing with both Python 2 and Python 3 Another big difference between Python 2 and Python 3 is how you can combine byte strings and Unicode strings in Python 2 it just lets you do it So we have a byte string and then a Unicode string and we just cram together and it's Unicode string Python 3 is not so nice. It says can't do that So again summary of the differences Python 2 you can't add them together in most cases some edge cases that I can go into I mean Python 3 you can't so is this a win for Python 2? No, not at all It lets you get away with things that Work especially when you're dealing with just ASCII that if you have non ASCII characters just start blowing up in your face so If you if you ever been in a situation where your program works And then you suddenly get some unexpected input and then it blows up and you're like why is this happening that this Is kind of a clue towards that? the right way to combine a byte string and Unicode string is to either Decode the byte string into Unicode and then add it to the Unicode string or To encode the Unicode string at bytes and then add it to a byte string So in this case we get the same sequence of characters But in one case it's a byte string in the other case. It's a Unicode string So clue number two in our mystery The process of converting from Unicode to bytes It's called encoding and the process of transforming bytes into Unicode is decoding. I Get this confused all the time. I might even say it wrong during this presentation, but remember that This was another big sticking point for me that once I realized it made understanding this all a lot more So here we have a Latin one string made up of some They're not arbitrary, but some non-ascii byte values and We can print out the bytes and a little more clean format to receive them But if you try to encode it as ASCII you get an error not surprising You've probably seen an error like this before if you've ever dealt with text in Python 2 So here again is the Command we ran and the error we got Python 3 doesn't actually make this any better It still can't do it, but it does give you a better error message oops sorry, so if you try to Call encode on a byte string in Python 3 you get an error saying there's no method called encode on bytes Python 2 will Subtly try to help you out and in the process create lots of confusion so Again the right way to do encoding and decoding if you have a byte string, which is what our Latin one string is is to Decode it from the Latin one bytes into a unicode string and from there you can encode the unicode string As utf8 bytes and so there you can see the there's the actual values of the characters It's almost unicode, but not quite So this was probably the most important thing for me to realize Unicode is not an encoding in Python If you ever try to encode something into unicode you're doing it wrong or if you're trying to decode something from unicode You're doing it wrong you decode it To get it to unicode and you encode it to take from unicode to bytes So our Latin one string didn't work in ASCII so let's try it as utf8 Still get an error But if you notice this error is a little bit. It's the same error we got before it's saying the ASCII codec can't decode that byte Why did it do that? We wanted utf8, but it's saying we can't do ASCII If you look closely you see that we tried to do an encode Not to mention the fact that it's already a byte string, so that's one problem, but we get a unicode decode error What's up with that? The reason is that Python 2 behind the scenes says okay, you're trying to encode something that's already encoded, so I'm going to decode it for you first and then encode it and The system default encoding on Python 2 in many cases is ASCII so it's trying to decode the string as ASCII before encoding it as Utf8 and it fails because we have these non-ASCII characters in our byte string There's a bunch of Things you can call to deal or a bunch of attributes and functions You can call in the standard library to get information about the types of encodings that are used on the system This is a handful of them as you see some of them are utf8 some of them are ASCII There are ways to set The default encoding to something other than an ASCII, but I've never gotten them to work reliably They've always caused more problems than it's worth So it's better to be explicit with your encoding and decoding so that you know when you're encoding to and from Python 3 it's a lot cleaner. It's always utf8 So that's definitely a win for Python 3 so again summary Python 2 ASCII most of the time that you can't change it Python is utf8 and then that sys get default encoding you probably never actually need to call it because It will be these things unless you're on a system that isn't using ASCII So just to prove to you that you can do encoding and decoding correctly in Python 2 We have a byte string here and we're going to decode it as Latin 1 because we know that's the encoding of it and then we're going to encode it from the Unicode back into utf8 and See no errors it worked the next clue To do with encoding and decoding is I didn't come up with this term, but I call it It's called the Unicode sandwich and basically if you have your code you want to decode from bytes to Unicode as soon as you can and Deal with Unicode throughout your program and then only encode it to bytes at the very end when you're outputting it to somewhere else Fun fact when I was writing this slide. I had encode as the first one from bytes to Unicode I told you I get a screwed up and I just called it like an hour ago So we know that encodings are important because if we have a stream of bytes we want to get it into Unicode So natural question. How do you know the encoding is? There's a few answers to this question So we're gonna go through them in order You don't if you have a stream of bytes it can be any encoding or it could be no encoding and maybe it's not even text Unless you're told HTML, HTTP, XML all have ways of specifying an encoding even Python source code has a way of specifying the encoding for a stream of bytes You can't always trust it though If you're on a web server sometimes if you're on a shared hosting environment If you're not don't have control of the web server It'll return an HTTP header with one encoding even their html is in a different encoding Lots of work in web browsers who goes into detecting the encoding And displaying it correctly, but even then you have probably gone to a web page where it just looks like gibberish Hopefully not recently, but it used to be a much bigger problem Fortunately like I was mentioning with the web browsers you can usually guess at the encoding There's a great third-party Python library called char debt or I don't know how you pronounce it I pronounce the char debt that given a string of bytes You can tell char debt to detect the encoding of the string so we have two byte strings here They're completely different One of them is in an encoding called GB 2312 Which is used for a lot of Chinese characters and the other one is utf8 according to try that with 99% certainty So let's try it if we think there's any coding let's try decoding them Success it was it was actually the same string all along surprise It twists in our mystery if you try to if you try to decode with the wrong encoding You're gonna get errors in this case. I just switched to the two encodings. I tried to decode the first string with utf8 and I tried to decode the second string with the GB 2312 and I got these exceptions Anytime you get exceptions like this is probably a sign you have the wrong encoding There is a workaround Which is there's a second option you can pass to the decode function Which is what to do if you get a byte that doesn't match the encoding You can either just ignore the error. You can put in a substitute replacement character You can escape it with certain escape codes I don't have a slide that actually lists them, but there's a if you look at the documentation for the decode function There's a handful of ways you can deal with errors The problem is that if you have the wrong decoding you can replace and still get absolute garbage So don't do that That said sometimes you do have to do something like this if you have just poorly formatted text or text that Naively combined things in different encodings. It may not be possible to decode perfectly a given byte stream under any encoding encodings are Oh, one other one other caveat with this It's not perfect. I have another byte string here. It says it's max Cyrillic and it's in Russian It's not actually it's actually using this CP 1251 encoding. I'm the only difference is that first character I don't speak Russian. I don't really know Cyrillic So I have no idea what that actually says But charted is not perfect. It's the point of this Yeah, I did look it up on Google Translate, but I don't remember what it was. It was economics and something. I don't know So the fifth clue is be prepared for anything You really never know what you're gonna get with text that you're you get them somewhere One other quick gotcha with Unicode is normalization So we have two strings here and we're gonna say see whether those two strings are equal Let's take a poll real quick. How many people think that first thing is gonna be true How many people think it's gonna be false I wouldn't have put this in here if it was true what I It is false. Yes, and As you see in the second box there, they have different lengths and They have different byte values and the reason for that is remember how I said that Unicode tries to Map one-to-one characters two code points That is not perfect. And so the first string there is a Unicode code point for the e with the accent And there's also two other Unicode code points for the e and then the accent combined with the previous character So that way you can put a accent on any character that you want The problem is if you're using something like this for usernames You could have two people that have what looks like the same username, but it's completely different Unicode There is a feature in the standard library in the Unicode data module that lets you normalize all of your text to the So NFC I forget what it actually stands for but basically it means combine The if you have any instances of the combining care accent characters combine it into the single value and so when you're doing comparison on Things like usernames is the best example Make sure you're normalizing the input first one other quick, I guess hint in the standard library is that there are some functions that Behaved differently if you pass them byte strings versus Unicode strings So the Lister Method in the OS module if you pass it a Unicode string it will return you all of your paths as Unicode strings and if you pass it a byte string it will pass all return new values in byte strings This is actually useful for most applications Because you have something and you want to deal with the same thing But if you don't know this is happening or don't expect it it can Everything so far I've been talking about is if you're in Python 2 or if you're in Python 3, this is how you handle it Sometimes you're writing code if you're writing a library or a package that you want to work on both Python 2 and Python 3 There's a compatibility library called 6 if you're not using it and you're trying to write Python 2 3 compatible code You should anyway For no other reason then it makes handling text much better So there's a variable called 6 text type which is On Python 2 it becomes the str type and on Python 3 or No on Python 2 it becomes the Unicode type and on Python 3 it becomes the str type binary type is str on Python 2 and str on Python 3 and then string types is good to use in Is instance checks or just on Python 2 it's either of them and on Python 3. It's just the Unicode type These are useful like if you normally have a like a call to str you want to convert something to a string You can use this instead you say I want it as text and that's a lot more It it works better across between Python 2 and Python 3 then just naively calling str because on Python 2 you'll get a byte string on Python 3 you'll get a Unicode str The other kind of key to working with Python 2 Python 3 compatible code is The future import Unicode literals one thing that caught me caught Confused me for all of this is this only applies to Literal strings in your Python code. It doesn't magically solve all of your Unicode problems But normally if you were to run that second command there without running this import It would be the str type, but now it's a Unicode string so You six and use this import that is the There it is Use the six module and use Unicode literals for cross compatible Python source code So hopefully those six tips along with some of the other notes. I had along the way are helpful This is my last slide I'll let everybody take a picture of it. It seems like everybody wants a picture of this one Here are some of the resources I will post these slides afterwards and probably send a tweet to Pi Ohio, so it'll get copied there. I'll go back to this. So you guys can look at it does anybody have any questions on On Python 3 yes, because the No, so that that works the same way in Python 2 and Python 3 The the where you can get in trouble is if you try to compare a byte string and a Unicode string Yeah, I created them differently. I actually had to type I Actually, I made the byte string and then Printed it out to get the variable to paste in or if you have if you can enter those characters on your keyboard It will probably be a mystery. It's possible that some Software will normalize it when you paste it, but I'm not actually aware of any that does yeah, so I don't have any good advice In my experience most of the Python libraries for like the database driver libraries, so my sequel libraries MongoDB libraries all Do the right thing if you pass them Unicode strings in Python 2 or Python 3 The There are ways to configure the database so that it all of the text fields are in you are in Unicode or UTF-8 It varies what it's called, but basically it's stored as the UTF-8 bytes and then the drivers when you pull them out of the database will return a Unicode strings and So that kind of handles that part of the sandwich for you In terms of accepting input from users on a web form or whatever you really have to Be explicit and do the conversion to you want to make sure that you are Consistent in what you passed it to your database or whatever Does that help her? Yes, so that that generally happens with I think if you have trying to remember now if you have You if you have UTF-8 text and it gets interpreted as Latin one or vice versa a lot of those characters are a Lot of those characters are Typically I have I have decimal up here not But a lot of times you'll get that 233 that you with the accent Because that is usually the first character of a lot of two-byte UTF-8 values I Have lots of examples not in the slides unfortunately of that kind of Mixed encoding that results in gibberish text if If you can be confident of the encoding that's coming in you can generally do the right thing Given to you, but you start with Latin one that's valid Latin one Even though you can all these funny characters in there, so you'll have extra characters in there until you re-convert it back So Latin one is usually a kind of a half way to get Data moves that is UTF-8 encoded back from say the database We've had to do that often where it got put in wrong and you're gonna have to tell the database How do I get it out? Yeah, that's actually a really good point that's something I didn't mention my slides a lot of the Latin one Latin two Latin three encodings any byte is valid almost You can ignore bytes and there's some holes in some encodings. I think Latin one might actually be Five But yeah, so if you try to decode you can decode most things as Latin one even if there aren't Latin one That that's a problem to keep in mind Any other questions? Yeah You can double encode UTF-8 and that'll cause you a lot of pain. Yeah, so don't just always encode UTF-8 arbitrarily Yeah a few more things In case you want to know all of the Unicode characters I used here was a few of them Fun fact though the thumbs up and thumbs down in my slides are actually images because for whatever reason when I'm in Presentation mode it shows them as boxes. Who knows? So and that that's right that there's actually examples of the five hex characters You have the the leading one for the For there's different. They're called planes of the Unicode character set or the Unicode Character space I guess and so all of the emojis are like the 1f ones There's all of the characters in that one Latin one string as you see they are all Basically zero zero something characters in the Unicode code point space That is a great question and I should have put this in my slides because it's probably very common You can do backslash you and then those four values as a string in a string literal That's another way. There was the question about how I got those two cafe strings You can also in place of any I mean even even ask a character you can do backslash you and then for Hex characters you can also I think do a capital U with eight characters that lets you do the the higher ones 160 so I mean it is a symbol it is a character in the Latin one in coding This is I mean it's an iso standard that has a bunch of bite values It's mostly for the Latin one characters. There's a lot of weird puncture way like there's that 172 paragraph symbol The fractions things like that. It's not all They had to fill it in or they didn't have to fill it in something But they had all these bytes left over so they used put some common Those are not printable characters. Okay, and I forget what they actually are so 127 is delete And that's true in all of the encodings But you it's not a printable character. I had a script that did this and I didn't bother handling all the exceptions or all of all of Your question or Python 3 let's so the default Encoding for Python 3 source code is utf-8. So you can have variables with any any valid code point I've never actually seen it done I guess if you're working on a project With people who aren't Americans and have to deal with English you could probably get away So one last thing I wanted to mention this was originally the title of my presentation Unicode lets you do some really crazy things. So we have some text here. You can have characters with double strokes you can have characters with All surrounded in bubbles You can even do some absolutely crazy things that I was almost readable, but don't do this Please don't do this. It's good for a laugh, but please don't do this correct, so these are Different Unicode code points So the oh the lowercase oh in the first one is different than the lowercase. Oh a different Unicode code point and A different Unicode code point than in all the other examples There is a lot so there's I mentioned the Unicode data library because that's the one that can do the normalization There's a lot of properties of each Unicode character So there's like what type of character is is it a letter is it a number? because even in all of the other like African languages Asian languages they have characters for numbers that are different than our Arabic numerals But they have a number value and the Unicode data library Stores information about the numerical value of all the characters as well, and I think but I'd have to double-check that There's something that says the oh with the circle is Conceptually similar to the letter. Oh, but I'm not I'm not It's good. I I Know very little about fonts But I do know that some computers will have a way of If you're using a font that doesn't know a certain doesn't have a a glyph for a particular Unicode character It will go and find a different font on your computer that has that in order to display it But it's good to know that there's a font that has all of them. Oh I'm sure yeah That's really impressive The other thing that reminded me of and I just got it I That one's important but you'll run into windows Changed about eight characters in the whole thing and Python library will get you I have a snippet of code that I just basically write all of those to their Unicode values because right that's one way to do it a lot of times too but you'll pop into that a lot so that one's nice to have as a kind of a go-to but Windows 4.2 is going to pop up a lot more often with one of those users and that's what copy-crate stuff works. I do remember what I was going to say. The Unicode standard, the specification does have basically images for all of the characters so like what this, a representative glyph for this Unicode code point but font support is mixed obviously and a lot of, if you're designing a custom font for your cool little graphic thing you probably don't need to put all of the Unicode code points in there. Any other questions? A font is basically a collection of glyphs. It's a mapping between the code point and a glyph for that code point. I don't know a lot about fonts. Yeah, so there's a way to enter Unicode characters directly. I know on Mac you can set up a special keyboard that does a use alt and then type the four hex characters. There's something on Windows that I don't remember what it is. It's a number pad on Windows. Yeah, I remember that now. There's different ways to enter Unicode characters directly on the keyboard. A lot of times I'll be honest. I type the name of the character and then find it online and copy and paste it. It works pretty well as much of a work run as it is. I actually did write a script that will, so in order to get these I wrote a code that would basically print out the names like you pass the Unicode string and it looks up the names of all the code points to print that out, so there are ways to do that too. Any other questions? Yes. That is the official name for the slash and the reverse solidus for the backslash. I don't believe so. I believe, I mean that is the kind of the ASCII underscore. The other thing I guess one thing that I will mention is there are a lot of times that the same or very similar glyph is used for multiple characters. So for example the Greek letter omega, capital omega is also used as the symbol for ohms, the resistance, but there's a special Unicode code point for, I think it's called ohm, and it looks exactly identical to the Greek capital omega, but they're different code points but they're very often used the same glyph. Another thing to be careful, and normalization won't help you with those, I don't think, so you're kind of on your own. Yep, that's a big issue with I think the Cyrillic characters. So Cyrillic alphabet has many characters that are, they look identical to the Latin characters, but because they want to have the entire Latin block and the entire Cyrillic block, it's not like they reuse the Latin, I feel like there's the Turkish eye or something like that too. It's the same thing. It's like they reuse the Latin one for that character set, they kind of have the whole block for the entire character set for that language, and so that is a big issue. I know that there are, I know the web browsers have been trying to figure out how to combat that. Other questions? All right, thanks everybody.