 And then I have my name, Wei, Slu Wei, but I work for US companies, and my name's reversed. And the company's name is Quantcast. This is a talk that I give during our bi-weekly tech talk. Anyway, so what is character encoding? It's just like dry definition. Encoding is a process of putting a sequence of characters, including letters, self-unumeric, and underscore symbols, whatever, into a specialized format for efficient transmission or storage. Decoding is a reverse process. I know you're not interested in that dry definition. Who knows what this is? Moscow. Moscow. In SOS. SOS, yes. So this is character encoding. So in a way, we encode three characters into this got-nosed-what signal. And then here are some more examples of character encoding. So it's a bit, I'm trying to make it a bit more personal than what we think. So the first example, bacon cipher, is this thing. And as you can see, it's just mapping the alphabets into, in this case, five characters. Every alphabet is matched to five characters. In this case, it's A and B. You can represent it as two different on and off, whatever. So in a way, this is mapping 26 characters into a binary representation. And then there's also Chinese telegraph code. Interesting, not invented by Chinese. So an example is that this Chinese character, its code is 2-4-2-9. And then how to interpret it is that you can find this character on page 24, row 2 and column 9. So this is just because Chinese, as we know, have many, many equivalent of alphabets. So encoding Chinese becomes a more challenging problem in a way that you have to have a huge dictionary just to do the encoding in this scheme. So Morse code, as we saw before. And then as probably how many of us here are software engineers or programmers? Right, so then you guys must have heard of ASCII and Unicode, like UTF-A, UTF-16, all that. We're sort of throwing around. Today we're just going to go into a bit more depth of what those things are and what they mean for us. Again, why do we even need to know about this? Why do we need character encoding? In today's world, as humans, we speak our language in ways of characters. Like hello is H-E-L-L-O, but computers don't get that. Computers only speak ones and zeros. So encoding is a way where we map our human language into the language that computers understand. So yeah, let's talk about ASCII. ASCII stands for American Standard Code for Information Exchange. This is one of the earliest attempt to making this translation work. And then it has the upside of working and simple. And it works great for Americans, extensible to the rest of the world, sort of. Not really, but we'll see why. So it's simple because it's effectively this table, where you have characters, alphanumeric characters mapped to a number. And then if we can map a character into a number, that number can be represented in binary form and computers can understand. So this is convenient. You just look it up. If whatever you want, this also basically maps all your keyboard characters in here, almost all. And then you're like, OK, so this works great in a way. And also, let's say, this also maps to seven bits. All this can fit into seven bits. And then, so how many bits are in a byte? Eight. So yeah, so conveniently, every character can be represented in a single byte with one bit to spare. That's where the extensibility comes into the picture. Because you have this one extra bit, and then people are like, what's missing here? Anybody from Europe speak a European language? Yeah, accents. So then European people are like, well, we'll add them in that bit. That's exactly what IBM PC did. They're like, all right, we're going to extend this and then add these things. But the problem is a lot of people around the world happen to have this idea at the same time. That's why you see your email being jumbled. And then you just open the email and you can't read it. And because they are using different ASCII encoding scheme, they have different extensions. And that's the early days. So what ends up happening is this ASCII getting extended in different regions differently. And then that's why we have a Microsoft end up with this thing called cold pages. And if you've heard of it, cold pages. So that is basically you have different table in different regions. So if you have a resume, which you've written in Korea and you send it to Russia, and then they open it up, not everything makes sense. Some things make sense. Some things don't make sense. So that is a big mess. Yeah, essentially the problem is that we don't just have English. We have a lot more languages that we have to deal with plus emojis. What is a better place with the emojis, right? But computer still speaks only 0s and 1s. So how do we solve that problem? That's how I end up with Unicode. So Unicode, I mean, fancy definitions aside, it's essentially a better lookup table than ASCII that serves more than just Americans. I'll stick with that definition. Right. We can't mention Unicode without mentioning UTF-8. What's the difference between Unicode and UTF-8? UTF stands for Unicode Transformation Format. So Unicode can think of it as this lookup table. It's a gigantic lookup table. And UTF-8 is a way to convert code points to binary representation. So here is an example of part of the Unicode lookup table. So here we have this eye rolling emoji. So a code point is a fancy way of saying a number. So you're mapping this emoji to this number. So that's what Unicode defines. The Unicode just defines what character mapped to what number. That's all it is. And then UTF-8 is what translates this number into a binary format. So why do we need UTF-8? Why can't we just use this number and give computer that number and convert? That's a hex number. So why can't we just convert the hex decimal number into binary and represent it? I'm asking. Save space. What? Save space. To save space. Most characters that you're sending, you don't need more than 8 bits. So these ones are all more than 8 bits. Oh, well, there are. Do you have any good ones just to be white numbers? Yeah, but you also have that this is a very big code point number. You also have the a's and a, b, c, those alpha numerics that are very small. They can be represented by 1 byte still. So what's the problem of just directly mapping the code points, the number, and then convert that into binary format and storing the computer? What's the problem with that? We don't have to use 8 bytes. Size? Size? Well, if you don't always use 8 bytes. Well, if it's 1 byte, if it's a smaller number, we use 1 byte. If it's a bigger number, we use 4 bytes. Exactly. So that's the problem. That's where UTF comes in. We need a way to tell the computer where this code point, where this character, well, sorry, where this binary string ends and where this represent a character. And then as you said, how about we just always use 4 bytes? That is a big waste of space. And there is a encoding scheme called that. And we'll look at it later. So that's why you see these code points get converted into binary. And then this binary is equivalent of a hex. It's different from the original code points. So you ever wonder how that's converted? We'll see. So this part is a little bit mathy. So if you don't understand, you can zone out and don't worry about it. But just in case you're interested, this is how it's done. Because I wonder about this question. So here's an example. Here are three examples. But know that this is UTF-8. And UTF-8 has this ability to represent a single character in either 1 byte, 2 byte, 3 bytes, or 4 bytes. So here, this example, this code point, 6D. This is in a hex. So U plus is just to indicate that it's a unique code code point. So this one is just 0, 0, 6D, which in binary format is 1, 1, 0, 1, 1, 0, 1. 6 is 1, 1, 0. D is 1, 1, 0, 1. And then that number, you look at this table. So this is the first, this lower end. This is the upper end. And this number falls between these two range to end this range. So then that means we can just use 1 byte to represent this number. So yeah, so that's the binary which fits in there. And that's the equivalent of a hex. OK, so that's the first example. Second one, let's just look at one more example. So the second one is 416. And then again, same thing you do. Convert hex to binary. And then you notice this one exceed this range. So then we have to use two bytes to represent it. And then how do we lay it out in the two bytes? As you can see, there are these already predefined sort of prefixes in each byte. So this is how the computer knows whether it's a one byte or two byte or three byte or four byte representation of a unique character. So you just take these bits and then lay them out like this one. You can go from the back. That's more as easier. I wish I have a pointer. So yeah, just lay them out. And then if there's still space in front, you just add 0. And then that translates to that hex after you convert the binary with these prefixes. So if you just look at these three examples. So the last one is just an example for a three byte conversion. So if you look at this example, observe this side of the hex representation against that side of the hex representation. What did you see? Anything that you find interesting? Last digits. All right, what else? Further instruction. Thank you. Irritation, please. The fire alarm has been activated in the building. We are investigating the situation. Please remain calm and stand by for further instruction. Thank you. All right, so what I was going to point out is that it's 60. That's 60 over there. And then this is the only case where the original hex representation maps the converted hex representation. And why is that? Because you have a 0. Correct, yes. Because of the way it's translated. And then do we remember how many? Doesn't look like we have a real fire, so don't worry about it. Right, so this is long story short. It's backward compatible with ASCII. Because ASCII needs seven bits. Non-extended ASCII. Non-extended ASCII. So the original ASCII is backward compatible. So that's also the traction of UTF-8. Because if you have a system that speaks to ASCII, and then when UTF-8 comes up, your ASCII still works. All right, so there's the UTF-8. People also heard of UTF-16. And how do they compare? How do they differ? So yeah, so they're both variable length encoding. Meaning that, you know, UTF-8 maps through 1, 2, 3, or 4 bytes. And then UTF-16 maps through 2 or 4 bytes. So they don't always map to the same number of bytes. That's why they're called variable length encoding. And then that's what makes some storage efficient compared to fixed length encoding schemes. But they both also require sequential access. Because you need to scan through the sequence to find out where what ends, right? Which makes it a bit hard to do parallel decoding if you want to. And then what's important is that UTF-16 is not better because they represent more contrary to popular belief. They actually represent the same set of unique code points. So, capability-wise, they're equally good. But if you were writing English, you'd probably want to use UTF-8. And if you were writing Chinese, you'd probably want to use UTF-8. It's a very good point made. But we'll go into a bit more of when to use what. But capability is also not the same. They're the same. It's a valid UTF-8 encoding and bidding code. But it's the moment that we'll do UT-16 code now. Correct. The capability is up by the side. If you've got a, it's probably a writing code way. If you've got a UTF-8 decoder, if you read, then you ask you to have a link into the decoder. Oh, if you're reading UTF-16 decoder, then you're fine. Yeah, but I mean like in terms of unique code code points, they represent the same set, yeah. Right, so yeah, UTF-8 has the advantage of being backward-compatible with ASCII. Why is UTF-16 not backward-compatible? Anybody? Exactly. So UTF, because ASCII always uses one byte per character. And then UTF-16 minimum use two bytes. So you can't possibly make it backward-compatible. Right, so this is how you convert UTF-16. I won't go into details here because we're, because of time limitations. But just know that most of things get converted in, like this part is pretty straightforward. But once you go beyond this FFFF code point, then you have to use this scheme. Yeah, so before we talk about when to use UTF-8 versus UTF-16, we have to talk about UTF-16 versus UCS-2. UCS-2 stands for two byte universal character set. It is basically UTF-16, a limited version of UTF-16. It's a fixed-lensing coding. It always has two bytes. And they represent this subset of the unit code. So this two bytes is the same as UTF-16 is either two bytes or four bytes. So the two byte part is the same algorithm in conversion. What's that? Yes, what's BMP? Keep in mind that UCS is fixed-lensing coding. Well, UTF-16 is variable encoding. Remember we talked about the pros and cons of each? Yeah, so BMP stands for basic multilingual plan. This plan contains lots of languages. It contains the most basic characters for almost all model languages. As you can see, most of this is occupied by this red pinkish color. And they are the CJK characters, meaning Chinese, Japanese, and Korean. So that means the common Chinese characters are already encoded in this plane. And then what is a plane plane? It's a continuous group of 2 to the power of 16 code points. So they kind of chop it up into multiple planes in unit code. And this is just one of the planes in unit code. Yeah, so this is mostly sufficient in representing almost all the common characters in all languages in today's world. Ladies and gentlemen, your attention, please. We have investigated the situation. There has been a false alarm. We apologize for any inconvenience forward. Thank you. Thank you. Ladies and gentlemen, your attention, please. We have investigated the situation. There has been a false alarm. We apologize for any inconvenience caused. Thank you. Ladies and gentlemen, your attention, please. We have investigated the situation. There has been a false alarm. We apologize for any inconvenience caused. Thank you. OK, so we have five more minutes. Let's talk about this. Yeah, so what do you think? When should we use what? Well, I'd use UTF-16 when I had to control right with nasty smelling windows software, but otherwise not. Why windows? I'd use UTF-8. I think UTF recommends UTF-8 users always for everything, like never bother with the other stuff. Why? It's better to have like double world fucking one, including. No. I mean, if you're sending it over networking and all that stuff, it's definitely preferred. So UTF-8 is, remember, it's a variable landing coding. So it's slow. This is where I use a flip table emoji. So UTF-8 is variable landing coding. So it's slow. You can't process the stream in parallel. So there is still application for UTS. What's that? That's an application. I'm very good. What did you have some extra coding based on UTS-2 because you don't have to use that first bit just to say it's like zero or one. I think Java is just good. Yeah, but I mean, they still don't like you want to be back. You want to be compatible with UTS-16, right? UTF-16. It shouldn't matter. Anyway, so you made a very good point. Because in memory, you want it to be efficient. That's why UTS-2, well, some programming language is UTS-16, but actually, in reality, most of the time, they're using UTS-2 because they're more memory efficient. It's not more memory efficient. It's faster to process the same amount of strings. And while UTF-8 is great for network and storage, because the less you transmit it, because it compacts the string very efficiently, and then the less payload you have, the faster things get transmitted. So that's just something to bear in mind when you choose a encoding scheme. And don't always assume everything's in UTF-8, because that is another mistake. Just the whole point of this talk is that there are various encoding schemes out there. And when you do store a file or transmit a file, always, always specify the encoding. So then on the other end, we'll understand you. So yeah, if you look at programming languages, their Java JavaScript, they both use UTF-16. Python, I don't know. Just like Python, I did my research, but I still, to date, I don't quite know if you know. What is their, I know because the Java and JavaScript, their characters, that's great. That's a unique character. Their characters are backed using UTF-16. Well, Python has different character representations. I bet they're backed by different encodings as well. But Python 2, you just know that it defaults to ASCII, which is bad. So if you're still using Python 2, remember to specify the encoding when you read a file. And Python 3, it defaults to UTF-8, which is much more sensible. So there's also UTF-32, which is a fix-lensing coding scheme that always use four bytes. And it's fast and simple like ASCII. Why? Because it's capable of mapping to every code point in the Unicode definition. But it is very space, just like space. Inefficient, yes. Right, so if you forget everything I said today, just remember. Plantex does not, don't ever say I have Plantex and it's ASCII. And it's just every character is 8-bit because that just falls. It really depends on your encoding scheme. And also when you communicate between systems, always remember to specify your encoding. And then email, HTTP, they both have this content type that you can specify. And do that. That's it. Questions? Yes, no questions.