 So, I'm going to read from the Stack Overflow question where I originally was trying to figure this out, well then get into the various things that I had come across that helped me figure this out, and then finally, me showing a proper implementation of this, UTF-8 on Windows without a, it is my understanding that by default character is Latin, Latin 1, wide character is UCS 2, and wide wide character is UCS 4. Now for those unfamiliar, quick little primer, Unicode has two major encoding schemes. The older one which I'm mentioning here is UCS, which is I believe the Unicode character set, it's just sort of a flat mapping. With UCS 2, it's two bytes per character, and with UCS 4, it's four bytes per character. The encoding scheme that most people use now is UTF, which is the Unicode transactional format, and you can kind of think of it like a compression scheme for the Unicode code points. The Unicode code points that would be in the Latin 1 or ASCII range only require one byte, whereas the characters in towards the end may require four bytes, so it allows it to be more flexible and as a result save space. But that NAT can have a specified PROGMA wide character encoding of UTF-8 or the UC flag NAT with an uppercase W and 8, and that those characters and their strings will be UTF-8 encoded instead. At least on Linux and FreeBSD, the results fit with my expectations, and they should. Both Linux and FreeBSD underwent projects to convert the entire system to UTF-8 streams, and because it's unified around that, everything just works with UTF-8. But on Windows, the results are odd. For either wide or wide wide variants, once a character moves beyond the ASCII set, I get a garbled mess. I believe this is called an ISET emojibake, apparently there shouldn't be an E, it should just be mojibake. So I figured it was a code page issue. After all, the default code page in Windows, and therefore what the console host would load with is 437, which isn't the UTF-8 code page. So then I put in the command for changing it to the UTF-8 code page, and now instead of the mess of extra characters, there's an immediate exception. And it raised an IDEO exception's device error, and the file that occurred in happens to be the add a wide wide text IO package. So I navigated to the specific line that this was occurring in, and it seems to be within the put C binding of F put C. This is extremely important, far more important than I realized at the time of this writing. We'll get to that. This turns out to be the core of, like half of the core of the problem. This is a huge fuck up. But this is standard output, shouldn't an EOF never happen. Now it's my understanding at the time of this writing that the distant didn't make sense. A stream, especially an output stream, should never have an end of file, because you can't seek it, and you're constantly appending new stuff to the end. So you can't seek it, you can't read from it, you're just adding stuff, you should never get an end of file. As it turns out, and we'll get to this one, we also get to F put C and how that's wrong, EOF is sort of an overloaded error and happens to have special meaning when dealing with standard output. So then, because I assume at the time of this writing that I was misunderstanding stuff for not doing something properly, is there some kind of special consideration Windows needs, how can I get UTF-8 output? So some people ask some questions that I need to elaborate on and I add, I tried piping the output into a text file. The supposed UTF-8 encoded program still generates mojibake in the file. Not sure why this had immediately thrown exception in the console though. So then I tried directly opening and writing to a file instead of the console or through a pipe. Oddly, this works exactly as it should, but text is completely correct. I've never seen this kind of behavior with any other language, so it should still be possible to get proper UTF-8 at the console, right? Now, as it turns out, that was actually a case of me having a very biased sample, with the exception of IDA, basically every programming language I use is very window-centric. They are still fundamentally cross-platform, but they focus more on Windows than on other systems. And so unsurprisingly, they work properly on Windows. As far as general programming languages go, most of them apparently do not work properly under Windows and replicate the same exact behavior. Right down to the same mojibake. So let me show off the fix first, just that I did get this fixed and we'll delve into exactly what the learning process was. The deficiency, so many others, not just here, describe in the Windows console host has either been fixed or never existed in the first place. As it turns out, it's been fixed a long time ago, we'll get to that. Based on this document, and we will cover this, I feel it was probably always very misunderstood. Windows doesn't treat the console like files, and it's easy to fall into that trap. Using this very straightforward forward code along with what one does needs and expects behind the scenes. And you can see other than using this package wide console instead of .txt.io, this code is exactly identical. Superficially, it correctly, let's explain a little bit about the test cases. Hello, obviously fits right into that one or into ASCII. And as a result, that is guaranteed to work properly as long as the code page isn't set to Russian or something else. In which case, Russian would output properly but did not the Latin one. So that was there to show that I'm not doing any code page trickery. Then we have Privet, which is Russian. For this to work, you'd either have to change the code page to Russian, in which case, as I just said, the hello would not print correctly. Or be using a proper Unicode format. And then we also have some runes. Just because this is so deep into the Unicode character set, that if this prints out correctly, it's definitely working correctly. There is no code page for runes, so the only way to correctly print this out is to be using the UTF-16 low-endian or UTF-16 big-endian code pages, which are not available to Native code. Those are only available to .NET code, so AIDA would not be able to use those. Or to use the UTF-8 code page, which is available to Native code. So then we have, I happen to have been doing this console instance through Visual Studio Code. It's still PowerShell, it just happens to be running under the Visual Studio Code IDE, or it's not really an IDE, it's just a special editor, but it happened to be running through that. And that had to do with VS Code having font substitution within it. That's a problem that affects both Windows and Linux and other systems. Not everything has been updated to do font substitution yet. And so I just used something that did because while it's very easy to find a font that includes both Latin and Cyrillic characters, it is not easy to find one that includes Latin, Cyrillic, and Runic characters. So with font substitution, it can pull the necessary symbols from a different font and therefore print out everything. And you can see that they all did print out. So clearly there is a way to do this. And it doesn't require any setup by the end programmer. So let's get into this. I'm not going to read from this, but I will have a link down in the video description. I encourage you to read through this. And I think he has another document which he goes into some additional detail about this that I would definitely also recommend reading. But it's conventional wisdom is retarded, a.k.a. what the bleep is underscore O underscore U16 text by Michael S. Kaplan. And this was the document that got me starting to think that what I was being told about Unicode output on Windows was wrong. And there's one part of this, I gotta find it. I think it's down here that I want to point out. But that's the only part I'm going to point out at all. Maybe it was up here, actually. Or is it? Okay. And the CRT, starting in 2005, it knows more about Unicode than any of us have been giving credit for. Essentially, this problem people continue to attribute to Windows has been getting fixed since 14 years ago. Now it's obviously not fair to criticize Linux over problems 14 years ago. It's obviously not fair to criticize FreeBSD for problems over 14 years ago. It's not fair to Windows to criticize it for problems over 14 years ago. Windows has come a long way since XP. And yeah, this has worked correctly for quite a while. So let's get into the Microsoft documentation. Because this, you see there's two functions here. I told you this is gonna be important. So let's see what it has to say. For the return value, each of these functions returns a character written. For F put C, a return value of EOF indicates an error. For F put WC, a return value of W EOF indicates an error. If stream is null, these functions invoke the invalid parameter handler, as described in parameter validation. If execution is allowed to continue, they return EOF and set error null to ENVAL, EOF, not W EOF. Remember it was saying F put C, not F put WC. Everything matches up, F put C is clearly being used. Clearly being used, so it gets interesting with the remarks. Each of these functions writes the single character C to a file at the position indicated by the associated file position indicator, F defined. And advances the indicator as appropriate. In the case of F put C and F put WC, the file is associated with stream. If the file cannot support positioning requests or was opened in append mode, the character is appended to the end of the stream. Okay, everything's standard there. The two functions behave identically if the stream is opened in ANSI mode. F put C does not currently support output into a Unicode stream, I don't repeat that. F put C does not currently support output into a Unicode stream. Inside the routine specific remarks for F put WC. Wide character version of F put C. Write C as a multi-byte character or a wide character according to whether stream is opened in text mode or binary mode. The Microsoft documents make it very clear that for wide character output and Windows supports all of it. UCS2, UCS4, UCF8, UTF16LE, UTF16BE, UTF32LE, and UTF32BE. Whether or not that's a good idea or not is up for debate. I understand their reasons for it. I understand the reasons why Linux and other Unix systems would want to simplify it and just unify one encoding. Both have their merits, I don't want to get into that debate. But it is very clear that for any kind of Unicode output, you need to be using F put WC. It is also very clear that F put WC still does the correct thing. If the stream is open to Nancy mode, it still outputs correct ANSI stuff. So you can safely blindly put F put WC into your code. It will do the correct thing when not doing Unicode. F put C on the other hand will not do the correct thing when doing Unicode. Now back to where I was showing off that this can actually work properly. I'll actually back out and I'll show you just the specification. Actually real quick, there's three variants of this package. They all contain the exact same stuff. The only difference is whether they work with, excuse me, with character and string, wide character and wide string, or wide, wide character and wide, wide string. They're in separate packages just because otherwise the downstream developer would have to deal with the collisions and have to specify what type the character is specifically. And then that's just unpleasant for them. So rather, split the implementation just like add a text IO, add a wide text IO and add a wide, wide text IO or split. Essentially, these are a specialized wrapper around those behaviors. So for example, wide console is a specialized wrapper around wide text IO that specifically only deals with the standard input, standard output and standard error streams. And because of this, because it knows it's working with them and only with them, because it knows it's not working with a file, can do the correct setup. So we have a bunch of wrappers for get and put and newline and putline. We have some additional stuff for like moving the cursor around in the bell. I don't have a lot of the unique console stuff supported, just those because primarily these are what I'm interested in. But let's get into this. And the Unix one, because again, most of these systems have completely unified their text streams. We don't need to do anything special. These can directly map to the add a text IO stuff. However, inside of the Windows one is where we get interesting. And you can see what I'm actually doing is making calls to fputwc. I set that up somewhere. Right? Or did I set that up in the spec? Where did I set that up? Okay. Either way, though, I'm making calls to fputwc. So each of these correctly maps to the appropriate Windows command, Windows function. Furthermore, there's I said in the Stack Overflow post that the Windows console is not a file and it's an error to treat it as a file. Or rather, rather, rather, rather, the standard input, standard output, and standard error streams on Windows are not files. It's an error to treat those as files. They're really more like objects that need to be sort of instantiated or adjusted on Windows. And part of that is associating what stream, what character encoding the stream is. These three commands do exactly that. We set standard input as UTF-8 text, we set standard output as UTF-8 text, and we set standard error at UTF-8 text. That's all that needs to be done for correct output. And because this package wraps that behavior and hides it away, it's exposed as what is the exact same stuff. You could quite literally just drop in the console packages that I've written into any added text and they should just work. If they don't, that's an error on my part because otherwise they are directly replicating that behavior superficially. Rather, I should be clear about that. They are superficially replicating that behavior, what they're doing the correct thing under the hood. Now I have tried to get this fixed since I discovered this was actually an error on their part. The GCC developers have told me that I'm mistaken. The NAT developers have told me that I'm mistaken. Two Atacoma employees have told me that I'm mistaken. And yet you can clearly see this working right. If you believe these pictures are doctored, guess what? The console package is clearly open source. I encourage you to download this and to use it. See for yourself. It works correctly. I'm not mistaken. Microsoft's own documents are not mistaken about how Windows works. There's a lot of developers who don't understand Windows. So that's it for this part. I'm not sure what topic I'm going to be covering next, but until then, have a good one.