 Good morning afternoon or evening ladies and gentlemen My name is John he'd need and I am a consultant software developer from shepherds oasis and the thing I'm here to talk to you about today is the Fundamental underpinnings of a Texan coding API that's being proposed for standardization C++, but that I'm also kind of developing independently alongside shepherds oasis and How that in a how that new API is going to enable us to really make some significant and powerful changes To the way that we handle text and C++ in general and so the previous work that we had was Both part one and part two part one was given CPP con in 2019 Part two was done at meeting C++ in 2019 the part one gives a very broad overview of the entire design space and where we pulled inspiration from and who we were working with and Part two is more of a greater look into some very specific points the API is and how to make it scale how to make it work with You know type of raced encodings right so that we could support to use cases like that and also how to Use things like error handling to allow yourself to do More powerful things in the API, but now we're going to talk about some of the basis operations because it's been asked a couple of times What the basis operations are and how we should be handling them? So let's take a look into those First we're going to talk about some constraints if you watch the previous talks and some of these constraints are going to be repetitive, but basically cars bad and cars bad for a lot of different things The first problem with car is that it has a fundamental issue with What is its encoding is? You don't necessarily know right the minute you interact with the system the minute you interact with the CAPI You don't really know whether or not you're actually getting a proper UTF-8 or you know some you know windows one two five two or some other thing and so the real fundamental improvement that we want to make here is Trying to fix this problem What is the encoding of stuff is right because right now people don't know the encoding is and they just kind of make assumptions and the assumptions break down a lot We also know that W. Carti is bad and it's a real dead end right so it's UTF-6 You know windows except when you use it with the standard library in which case it will cut your surrogate pairs in half Or ignore them entirely and so you end up with what's more closer to something like the 1990s UTS 2 which is you know where the only 16 bit was the maximum number of characters so you have more than 16 bits They just kind of say well, we don't need that Um, and so it'll get mangled in for today when you want to really actually be using UTF-16 We have UTF-32 and policy machines, which is nice except if it's an IBM machine you're working on and then you get UTF-16 32 bit machine UTF-32 on 64 bit machines And then you get none of the above if you're on a Chinese or Japanese base low cow on any of those machines So that's incredibly unfortunate and it's just kind of the way the Lickly crumbles here. The other thing that's bad is cross 16 cart 32t Particularly because there is a defined on the C standard that says if It's defined if stood see UTF-16 or UTF-32 are defined then it's UTF-16 or UTF-32 And otherwise well, it's just one big shoulder shrug What exactly is the encoding in car 16 car 32t at that point? Who knows and there's no really a way to query it or figure it out or get an answer I mean just kind of have to consult your documentation and pray that your Developer the developer of your compiler and your environment isn't a jerk Except not anymore Thanks to some influential work done by RMF. We have now Mandated that car 16t and car 32t will be UTF-16 UTF-32 This goes for C++ 20 onward But thankfully we didn't find any implementation C++ implementations that differed from this behavior And we kind of strongly implied it with the wording so thankfully We get the benefit that car 16t and car 32t Will be UTF-16 UTF-32 from C++ 20 onward And you know even if it's not even if you're not working with C++ 20 you still kind of get the Semi-defective guarantee we haven't found again We haven't found a compiler author who's done the wrong thing and like you know tried to fit some weird encoding in car 16t or car 32t but You know there's always room for a special help plus plus implementation, right? So let's talk a little bit about the general like API support right for CNC What is the standard give you to kind of go from one thing to another? Encoding in a way that works and doesn't break and isn't fragile This slide is attention of blank. Why because there is there is no support. It's it's a it's a Garbage shoot Everything about it ends up in the dumpster every API designed is bad for a wide variety of reasons Whether it's the locale-based codec VT stuff Or the CAPIs which have real problems with outputting multiple different characters And I had to have a defect report filing in some of the functions, but they're still not Proper for other functions and it's a mess To lack of wide character support for getting in and out of the wide character encoding to normal encoding It's basically a nightmare in every single aspect of CNC plus plus and that nightmare has infected the world because Every single low-level device or embedded device that's built on top of these small CAPIs and small like C standard libraries that are shipped by vendors Very much Has let people in the dark. So you have cash registers and other devices that really just can't Stand up to the their needs of their users It's a real shame It also boosts a lot of frustration You know so I get emails semi regularly about You know specifically this commit, but also other things this this committee, you know, it's got some some really bad ableist language in there but you know it Appropriately capture the frustration of locales and having locale-based encodings and the nightmare that it is of working with C and in and by proxy C plus plus And so, you know, there's there's a lot of things you want to do right like so people will just fix it Right and the first the first thing people always say is like car should just be utfa, right? Just just just make car utfa, right mandate that in the standard get rid of all these local encodings, right like You know nobody cares about all these old systems. Just just make it utfa. Just just do it, you know, just do it, right? You know, they do they do this You know, so people like committee meetings and everybody didn't fully read utfa everywhere, right? They always come to me like where's the you why isn't car utfa come on like, you know, I can you know Yes cars, you know, the sign of car is completely implementation to find right so it could be signed or unsigned But you know, I pass my compiler flag to to the thing, right? And I get unsigned, you know, my cars are of unsigned type, right? So I can you know, just use utfa and know the math is correct And there's no bad overflow underflow and just does exactly what I want. So why don't we just do that? and Here's my hot take We won't just do it We actually just don't do that and we can't do that for a lot of different reasons, right and You know, there's a lot of problems with that, right a lot of that has come from like Oh, why does my string gender string contain garbage, right? And that's because a lot of people try to run on this assumption that oh, yeah utfa is It's car car is totally utfa right they run with this assumption They bring it all over the code base and then that happens when they use the CAPI Or direct the environment and they just lose right because that's just it's that's not what it is Using car for both the system encoding and utfa is wrong. It's in flagrantly wrong, right? And it's flagrantly wrong because it'll always make the wrong choice At some point someone's gonna port some software do some thing and they're not gonna be thinking that utfa or whatever else They're just gonna be using strings and your code is going and that code is going to be wrong on that environment Right, so you have to really go with the assumption that car will never be utfa always, right? And you know, there's you know, we hear about in the committee and other proponents are like, excuse me I can enforce it in my code in my my pristine beautiful code base that I can you know that I control completely, right? And there's no way that it could possibly be anything but utfa, right? And that's what they say and then this happens and they lose the game and They always lose the game because they don't control the environment As much they would love to and so whether it's the car star argv or whether it's the Data that comes over the wire or whether it's the C API That just doesn't really care and just generates the data from whatever your locale based encoding. There is you lose you end up losing and There's really no way around this kind of fundamental fact that the environment already has been Thoroughly poisoned by locale based encodings and there's nearly not much that we can do about it And you also have to remember that like you're not you're not Google. You're not Bloomberg You're not Facebook you're not Microsoft you're not you're not some big tech company, right? I mean some of you listening might be but you don't own the entire tech stack, right? You don't own the user's locale You can't tell them what to do with it and you are the piece in a much bigger pie, right? So you can't just say well everything's gonna be utfa and that's just the way it's gonna be Because you don't write like when when when Google and Microsoft if you're working global foundation services or whatever else, right? Like they get a machine, right? They control the minute that machine gets hooked up to their racks and their data centers all the way out to the time that they They send you something, right? Like they control every single part of the stack, right? They have the operating system everything else Google similar deal Bloomberg similar deal Facebook Not exactly the same deal, but again for server-based, right? They can control everything, right? They can you know part of their spin-up is that utfa is applied as a locale and etc. Etc. And they verify this in his check But you can't do that as the end user, right? And remember that the C++ standard is for everybody, right? Not just for the big tech companies So you need to remember that you are part of a much bigger that you are a piece of a much bigger pie and Please please please don't forget that you know for as much as you'd like to arrange against the machine and say Everything should be utfa. Please don't forget that you are much bigger piece of pie that you fit in with the legacy and that not everybody can afford to have your crazy super awesome Kubernetes setup and Make everything wonderful So, um, no, we've got kind of past some of the constraints and issues that we have Let's talk about encoding objects, right and and and how this would be useful Um, so I've talked about this before but I'm gonna give a quick run through again of what an encoding object is and Basically, it's that minimum a collection of three type definitions code point code unit in state to static member variables Which is just a number an integer That's you know tells you the maximum over code points and the maximum over code units that can be output by a single operation And you have the two operations the the the single function, right? So you have an encode one which can take Some code points and output some code units in the specified encoding or take some code units of that encoding and output Output some code points, you know, typically, you know your unit code utf32 code points in that encoding and that's basically how that works And that's it This this is this is the hill that I'm going to die on this is the Wrong that I'm going to hang my hat on this is where I'm gonna put my code in This is all you need period point blank. You only need those seven things That's it. That's lucky seven. You're sit that you can build literally everything on top of that And now some of you are probably looking at me like Pardon me. Is that true? I'm gonna show I'm gonna prove it to you, right? I'm gonna prove to you that that's all you need To do everything, right? So here's some supporting structures, right? So we have some struct and an empty structure, right? It's slowly just empty struct. It's got nothing. It's just it's just two to two braces We've got a byte span, which is just a span of said bite and a bunch of type desks, you know for spans of various types So you have C span, which is a car span u8 span, which is a car et span u16 span, you know car 16 Etc. Etc. Then we have an encoding error type, which you know can be encoding error Okay, which is just you know, everything's fine, which is zero because that's what the APIs do Invalid sequence, which means you tried to encode something But it was just the bytes were wrong Incomplete input, which means we read everything you gave us and it was all correct, but you didn't finish giving us, right? So if you gave me two UTF-8 code units, two bytes, and I needed three to Complete, you know, the smiley face you wanted me to mojide mine to make for you Well, then, you know, invalid, you know, incomplete input is is is exactly what you were and Incomplete output space is the last one where it's just oh, there isn't enough space in the buffer you handed me, right? So we're not gonna like we do by default. We don't like overflow your buffers We tell you no, no, there's not enough space to do what you want to and and if any of these trigger Then we don't output any information at all into the Into the final output buffer, right? If any of these trigger So for example in coding Here's some of the result types you get back So when you call these low-level functions like encode one decode one, these are the kinds of things you get out, right? And they they contain five pieces of information. It contains the input, which is the input that you put in you know post post decoder encode and We also give you the output which is the After the encoder decode like the rest of the buffer So we we use up some of the buffer, you know the output obviously your your code points or code units Then we leave you with the rest of the buffer We return you a reference to the state that you handed us and if you know, that's the case then that's fine We Returned an encoding error So if something does go wrong if this is not equal to encoding error Colon colon okay, then we you know you get to know about that and we also have a Boolean if you handled an error And this is kind of important because there's some error handlers that will insert Like replacement cares and other things and then like erase the encoding error to say everything's fine It's okay But you still want to know if they're like an error happening You did actually handled an error anyway, right? If you did make replacements, but we still kind of scrubbed the the encoding error error code You still want to know if that's happened. So that's what handled error is for and it's a little the same thing It's for both decoded and encoders live the same thing It's just you know what depends on what the input is and what the output is and that's where you're doing encoder decode So more results types on this is that literally it's literally the same thing except in this case It's for u8 and we're going to talk about what this is u8 means In a second, but it's literally the same thing you have an input of a kari a t-span output of car 32 stand and then everything's a dantical We have some error handlers and I talk a lot about this in the part 2 presentation About what you can do with this and like the different ways that you can like you can do replacement characters Or you can find the first valid sequence and do replace a character and I won't go into much in the detail here But basically you have a function that has a signature of taking the result by value and returning the result by value Taking the encoding that you gave it by by cons reference and then also handing you a span of any characters that were read But weren't used to produce the final value and this is kind of helpful for for things like for imp for like things like Input iterators for for example, if you're reading from like stood see in with a stood ice stream iterator Once we read a value and go forward. We can't really go backwards So it's important that we give you any code points or code units that we read from the stream Instead of just like letting losing them the time whenever an error happens. And so that's what those that's what the these three parameters are Again, we're not gonna go too much into it There's other talks that will that again they're going to this in depth But it's just for the purpose of setup and and make sure that you can follow along So here's an example exotic encoding. So there's actually a encoding called utf-absodec And you've probably never heard of it and that's great as you never heard of it and you haven't used it. Bless your soul So we have the seven, you know, lucky things here, right? So we have our three types of apps. So we have a code unit of car, right? It's just just a car input, right? That's that's when you're when you're working on, you know IBM machines and you're working with absodec. It's just car You have an output code point. It's the car 32t. So we're outputting, you know, code code points Our state is just an empty struct because there's no like shift statements or special sequences We need to calculate or do anything with We have a max code points of one Which is the maximum number of code points a single Decode one operation can output, right? So when we call decode one, we can only output one code point at most And finally we have max code units which in this case is six because that's the maximum number of code units That can be output from a single operation And that's exactly everything that you're going to need as far as the types and variables are concerned And then for the functions you have an encode one function And a decode one function and the whole point about this is that for encode one we you take an input of things and it outputs the the the It outputs the actual Characters that are in the encoding or you decode one and You take in code points And you take in Characters and you output the the code point a sequence of code points, right? And so in this case you only get out one code point when you're doing decoding But you can get up you can get from zero up to six code units from a Encode one call and that's basically how that that works Here's a more common encoding So if we were to have a utf 8, I mean it's literally identical in almost every single way except in this case The map code units is four Because a single code point can only be expanded to Four code units and that's like the maximum expansion And the reason we have these max code points and code units types on on both the utf epsidek and the utf 8 is because This allows us to have to know the maximum size of an output given any input That we do for any operation, right? And this is again important for what we'll see down the line Later in this presentation for for memory usage and everything else And I'm absolutely deadly serious when I mean that everything literally everything can be built out of these sevens, right? I can bulk encode decode and and even transcode between a and b Using justice using justice interface I can validate text using justice interface and I can do counting, right? How many code points or code units will come out on the other side? Um, and I can also build some ranges on top of this, right? So if I you know, I want to have a lazy range that doesn't necessarily Bulk encode and take up memory, but I need to walk the code points one by one I can create flexible ranges that don't that take a fixed amount of memory Uh, and you know output code points that I can say use for a free type or you know, half buzz or Maybe pango or some other library And then this all all works Sort of almost um, so there's one other operation that can be added. It's not required, but it can be added Um decode one backwards and encode one backwards So iterator is obtained from encoding view and decoding view Types like which are views that allow you to walk over a sequence of text and work with it In order to be able to go backwards You need to be have one of these functions because I can't like sensitize a backwards operation from those seven But this is a very rare case and it's not required But it's just still good to know that you know, if you want to go backwards over some text, which this is rare I usually usually the only people who ask for a first text or like interviewers honestly um This is this is supports that these these two functions are in support of that But it's not the it's not part of the required core base, right? And I also want to be very specific about why is encode one and decode one the thing that we're using rather than just You know a bulk encode or a bulk decode Uh thing, right that doesn't only output one unit of information And the reason we do this is because it saves us higher levels of abstraction, right? So If we only output one unit of information only consume one unit of information What it means is is that I can predictably size my output buffer and know I have exactly enough to handle one unit of complete Output and this is important when I want to do things like make ranges or preserve memory Uh preserve certain memory Constraints or if I want to make it so that I never have an insufficient output error, right? If I have a range-based api that always Greedily consumes the most amount of information it can and outputs as much as it possibly can Then I end up in a really bad state where every single call I make to the api can always have insufficient output as an error Right by making it so I only output one unit of information Not only do I make the api less complicated for an end user to implement, right? So if you were writing your own encoding you wanted to implement encode one decode one It's easier, but it also means that a class of errors never happens, right? And it also means that I never have to do things like save state between encoding object calls and other things like that It also gives the end user access to data to do as they want with it, right? And so by not overly consuming and not having to store any extra state I can enable people who have never king buffers and everything else to reuse their buffers and other things like that Without requiring them to also cart around Potentially stateless encoding types, which is very much important so some of the standard encodings that we're going to get for c++ 23 are The encoding scheme type Which will allow you to basically take Other encodings and apply an endianness to them So if you wanted utf 16 little ending you can do that if you wanted a wide execution with a big endian spin You can do that Whatever else right? And it's just this kind of generic scheme type Then the you know your concrete encodings are your ascii your narrow execution and your wide execution the narrow execution wide execution correspond to car and W carti as defined by the locale in the library. So that's why it's execution Then we have narrow literal and wide literal Which correspond to the assumed encoding that your compiler dumps out When you know you give it a string literal and it says put the string literal in my binary when you're serializing That's what narrow literal is right and it can be different from what the actual execution encoding is Uh, the narrow execution the wide execution encoding that ends up being run by your system Um, so that's why those are two different things. And then we just have the typical utf 8 16 32 Yeah, typical basic stuff. We we talked about this in part one of the uh Of these presentations as well if you want more information Now some of you are like, okay, listen like there is a lot more encodings, right? I've I've spent my time on the web I've spent my time, you know, I'm I'm from japan. I shift you ass is still very prevalent Um, I'm in russia. I have a bunch of other different encodings that I really need to handle here right like There needs to be a lot more encodings than this if I if you want me to use this with like my mail client I know this that would build out some of these these other abstractions And so, you know in late cms was 23 or perhaps early cms was 26 We do plan to provide the entire what wd suite of encodings, right? I mean, I say we as and we as shepherd's oasis Do plan to provide all that but you know, we'll see how it goes with the committee. Um And also, you know legacy encodings code pages liary for example, microsoft if you look in there They're open source stl You can see that they have a wide variety of encodings Baked in and maybe as a vendor. It's personally interesting to them to ship additional encodings and so they can Um, and also just the sorting types. You can kind of collect these encodings and do that But also I the the really big point here is that you can make your own There's no special tricks No secrets no no no special magic implementer foo that you need to know here, right? Everything just comes from these seven Different basis operations and that is incredibly important. We're going to talk about why it's important Uh, uh going forward, right and and why you can build almost everything based on these seven operations So let's let's extrapolate some base operators, right? Let's let's do the let's do the math here and everything else um So if I want to do transcating Validation or counting or a bunch of other stuff, right? The idea here is that with those seven operations, that's everything I need So let's let's talk a little about transcoding, right? You know going from one encoding to another I have a simple idea. I have a front encoding And I have a two encoding. They're both encoding objects And I have these encode one decode functions. And so What I my what I postulate here is that if I have a common code point between them If they can represent all the same values, you know within reason Um, and it does not error during the encode step or the decode step then I can always transcode That's it. That's that's the you know, you know If this was a math book, there'd be you know theorem and it'd be in that like cool box and you know Theorem with the italic text like blah blah, you know simple idea, you know from encoding to encoding You know, you know you get the the the upside down sigma signs and all other cool All that other cool stuff you find in math books, right? But the idea here is that our theorem is that we can always transcode if these hold, right? This is a uh the diagram from the part one talk the cpcon 2019 talk and so as you can see here, right We have this idea right that you can take an encoded single input, right? I can decode that to a unicode code point. I can take that unicode code point I can then encode it and then from there I have an encoded single output And through this loop through this four step dance we do We get Access to every single encoding as long as they have a common code point type In this case the common code type is almost 99.999 percent of the time utf-32, right? That we get code points out and that it works And that's what that means And so that's this is this is the picture form of the theorem here, right? It's probably not math book ready, but you know gets the job done And so let's let's do a little bit of setup here, right? So we have this transcode type and we're going from utf-hepsidec ui to utf-8 u8 and so in this case we have a c-span input We have a u8 of output right utfa output and the c-span of input, right? We have a from state and a to state which again they're just empty structs because there's no state for utf-hepsidec and utfa And we track the encoding error and the handle error, right? And that's that's just how that works So let's take a look at if this this works now if this holds, right? So let's let's implement transcode so In this case, I'm being simple here, right? We have the handler type. It's just a default text handler. It works We get the encoding state of utfa of utf-hepsidec That's our from state. We have the encoding state of utf-8 and that's our to state And then we have this wonderful wonderful in-between t which is just the encoding cone point of utf-hepsidec Now in between some of these lines There's going to be some static asserts that basically confirm that the code point types are indeed compatible And this is implemented in library, but for the slide where we're not going to you know, the It's actually outputs a really big message. So it can't fit that on the slide. So you just have to look at this But in this case we we have our our encoding code point we have our in-between type And now we can make a buffer of it a just a plain c array That takes the maximum code points of utf-hepsidec, right? So utf-hepsidec can output at maximum one code point, right? So we have a buffer that's big enough to handle all of the output from utf-hepsidec, right? Then we create a span Over that buffer, right? So you want to view the whole buffer, right? So we take a span of the intermediate buffer, right? And that's there we go. We got our array. It's it's all set It's all cool, right perfect Um, then we begin a for loop This for loop, you know, we just use the double Semicolon to mean that we're going to run forever until we reach our stop conditions And so here is the basic idea If I have a from encoding And I call decode one on the input into the intermediate and I pass the handler and the necessary state variable That result will get me Decoded code points. I'll have code points in my hands in in the intermediate buffer, right? I fix up the input after I do that by moving the the from results input back into the the input types So I can update the input variable Um, and then I check if the error code of the result is not equal to Okay, right if it's not then we bail right we give the we give our current input, right? How much we read the output? We give the error code um that we got and then all the other information like the state and everything else, right? But if there is no error then we compute the used right and what the use is it's you know This is this looks kind of weird right calling intermediate on the the intermediate span We're getting its data and then we're getting the from result output data Which is the same type as intermediate, right? It's another span And we're calling that data on it to get some more information and The way this works is very simple, right? The first row is our current intermediate It's the actual span the second row is our actual is our second span that comes from the from result that dot output and what this means is Is that we are basically measuring from the beginning of the intermediate to where we stopped At the from result output, right? So it always ink it always writes into the the output range and then stops at the when it's done writing things And so we basically are we're just measuring the distance between those two and that gives us What's used let's give us the use portion of the of the data, right? And that's that's what we're getting That's what we're gunning for here now from there When you do the second half of the operation, right? Which is encode into new code units, right? So we take our u-span Right that we that we computed doing the computer doing all this right that that blue marks part of the u-span And then we give the original output right the original output that we're going to write into and the handler and then you know the state um We update the output if We update the output and then we check if the two results error code Is wrong and if it's wrong then we return we move the output and we return the error code and everything's wrong and Whatever else but if we succeed Well, we check if the input is empty and if it is we stop otherwise we loop back and we start doing it all over again, right? And until we break Or until we return an error We keep going until we can finally say return the input and the output as they are Etc. Etc. Right and so the input and output here represent the Represent data that hasn't been touched yet. So you input the data that Uh, that hasn't been touched yet and we we incremented forward all the way and then we kind of hand you back to say We haven't used this is the part of the span or the output or whatever else that hasn't been touched yet and that's that's how that works and Well, that's it. Um This whole loop here that I just described to you is Is the entirety of it, right? We just implemented transcode right between two different encoders We implemented transcode by calling these defined functions on the thing And that's it. That is literally all you need To do transcoding, right? And if you just replace the specific card coding of Of the utf ebsedeck and the utf8 here You can do this between any two encodings as long as the code point types are common and that Is what as I've just proven to you Is possible with this api So so let's move on from that, right? And let's talk about something else, right? What about validation, right? Like I want to verify that some text is in the proper encoding or that can be Uh in the proper encoding, right? And so the idea here is somewhat simple. Um It's the same loop and check idea, right? So we have we get our code point or code unit. We get the uh, uh From state the to state we get a buffer of The code points and the buffer of code units We create an intermediate buffer and an output intermediate buffer And what we do here is for this we do the same loop, right? So we call from result decode 1 the input to the intermediate we check that the error code happens blah, blah, blah Right, then we use the span to get the use calculation all over again Then we call encode 1 with the use to the output and we get the handler and the Tuesday, blah You know, it's the same loop, right? Anytime we fail with an error We return false, right? Because obviously it can't have worked if we get an error, right? Then that means that the text isn't valid, right? Because there's no way it could possibly be represented in this this code Right, then we move on to the next part of validation, right? Which we create a mirror input, right? It's just the same used calculation, but we're doing it for the output Uh, we're doing it for the output of the two result rather than the output of the front result, right? So this is at the very end we're getting we're calculating the use of the output, right? And so we get a c-span of mirror input And what this enables us to do is we check Is the mirror input that we got from the operation, right? We did the whole loop, right? We did we did one cycle of the loop, right? We went from the input to the output and then back to the input again Using the exact same encoding type, right? Like I want to emphasize here that the the thing here is that we're not using a from encoding And a to encoding we're using the encoding itself, right both times And so when you do decode one and encode one and you loop it through, right? The whole point is that if you go if you round trip to the encoding, right? No errors should happen and the input should be exactly identical to the output that you get, right? So the the input should be identical to the mirror input, right? So we do study equals you and we you know We get the iterators and we call the function and if it's not equal return false Otherwise we update the input and we loop back around, right? And if this loops to the whole thing and we reach input dot empty, then we return true, right? And Well, that's it Right, we literally already defined transcoding as decode some code points if an error return with error Else take the decoded code points and put into the encode step if error return with error Else loop back if the input is not empty, right? Except in this case rather than returning with error We're just returning false for the fact that it can't be right. It's not valid text. It's not valid in that encoding, right? But it's it's literally the exact same idea, right? And so the whole point here is that this whole thing holds up and works Without any additional effort, right? Now, of course, you can also even do this if the laser possible, right? Like you can use transcode to do The actual validate right rather than implementing validate as a loop you can Implement it as I called the transcode, right? So we call transcode and we give it the input And the out and an output buffer but instead of actually Providing a to encoding and a from encoding. We just use the same encoding twice, right? We use the encoding from the two and the encoding as a from encoding, right? And what that means is that We're basically doing we're basically checking Between itself can it loop can't do a full loop, right of encoding and decoding amongst itself And if it can do that we check if the result error code is okay And we also do a study equals if the input is Exactly equally output in both size and actual values And that's it. This is this is a valid perfectly good implementation of validate, right? And it's built entirely off of the transcode call now for obvious reasons I don't recommend this right we are literally creating a stood vector the size of the input, right? Like that's going to be a little bit wasteful that that's going to be just a tad It's a tad bit wasteful there. Um, so obviously we don't want to do it like this But the whole point is that it works and it scales, right? So we use a loop version because obviously we don't want to have infinite memory consumption But the whole point is that it works, right? And that's extremely useful So now let's also do counting, right? So how many code units or code points will this operation yield and I'm not actually going to do this one for you, right? It's We leave this one as an exercise for the viewer, but it's it's it's not hard. It's really not hard. It's not a trick It's the same idea, right? It's just instead of kind of that, right? We used to use calculation From the last from the last portion of the loop and we just you know Count the code points or counting code units and bada bing bada boom. We've got ourselves exactly what we're looking for Um, and that's just exactly how that works Now I'm not actually going to leave it to you as an exercise. I'm not going to say Yeah, go take these things go take these encoding objects and go implement transcode Validate and code count decode count all this other stuff. No, you know, no, no, no No, we're not going to do that, right? We the the the paper and you could read in the paper the official c++ paper the working draft that's on my blog We Provide all this for you, right? And not only do we provide all for you, but we you know as I mentioned earlier, we do the templating, right? We we take the air handles. We'll do the checking. Is this a proper range? You know, do we need to boil this down to the range etc etc, right? But we'll have decode and decode into right where decode into actually takes the output and you we output in the output But if you don't care about the output, then we just call it you could just call decode and we'll like spin up a vector for you Whatever if you're lazy We also do this for encode where you have encode into where you can pass the output Um, and we'll fill it up as much as we can or we'll just you know Or you just call encode and we'll we'll create an uh, and I'll put them We'll do reserve call and a bunch of other stuff and can do a string. That's exactly what you wanted. Um We also validate calls And we also have, you know, the encode one and decode one counts, right? And we provide this all for you, right? We it's templated We do all the shenanigans underneath, right? But the whole point is that you can plug in any encoding object or any two encoding objects when it comes to transcode and it works Right, so to to give you an idea This is just kind of a quick basic of using some of the basic overloads, right? So you can call in in the desired api what you do is you can call std text validate as And you can check if can I take this utf32 heart and like put it in my My my my literal encoding, right? And this will assert at compile timer because all this is constexpr Right that you can handle that heart, right that can be put in your literal encoding, right? So if you have certain things that need utf support Right, you can static assert a bunch of characters in the you know the bilingual multiplane or some emoji that are farther than that and It works, right? You'll be able to check at compile time like no, no, no You're you're you're you're little encoding needs to be able to handle this, right? And that's important You can also just call std text encode like I said They're simple overloads so you can just pass a string in you call encode you get out utf8 emoji We generally assume that when you're doing encoding and you pass us a utf32 string We'll just kind of assume you want to go to utf8. That's the default But if you don't write you can pass some like std text ASCII with the replacement handler and this actually just ends up as a question mark Because ASCII can't handle anything more than that and that's just the way that works But most importantly about all of this about the simple api and everything else The basis never changes the seven operations are still the seven operations used to build everything Now obviously there's arguments to make for performance and everything else But the entire point is that you can at minimum write these seven things and you will have perfect interoperability and safety and everything else for the entire ecosystem at no cost to you And I want you I really want evidence right the basis Never changes the basis operations are what we compose the entirety of our text encoding apis out of and enjoy full support Without having to do any additional worker labor No additional work on the standard library implementers part and no additional work on your part And that is why this api is infinitely scalable and better than almost every single api Out there currently in the world It's this just these these lucky Lucky seven is exactly what you want right and obviously if you want more speed and safety There's different hooks and other things you can do and I described some of that in part one I'll also be going back going into that in part four which might happen at either cpp Russia Which might be online and some other stuff But the whole point is that you have the seven magic number and that encapsulates everything you need And it's all yours Each encoding object is its own type and it strongly controls its semantics and representation Right, and there's no committee telling you what to do. No standard library saying it's not important enough to be added That your use case isn't good enough There's no gods no masters no one to stand in our way and that means that we will seize the means of production uh, yeah, yeah, but Yeah, and that's what's the magical bit here. Um, and of course the other magical part here about this whole thing Is you for listening. Thank you so much for listening. Thank you so much for Tuning into this presentation. I hope you learned a lot. I hope you're excited about the future of text for c++ Um, I hope that we can get this api through to the standards committee and make a real difference um I just want to spend a moment thinking all of my wonderful individual patrons and sponsors. Um, who have uh Uh, wonderfully helped me up during this time, especially even now. Um, I know it's very hard to Uh, apart with your hard-earned dollars in a time like this. Um, and I'm super glad that you were supporting Uh, my work and everything else that I do whether it's standards work, uh, it's the bitsy the bit library at everything else Please please, uh Pat yourselves in the back I hope that I am returning by working on these things and doing these things for the standards committee and the c committee that I'm returning Great value to you I also wanted to thank the n en standards body in the netherlands who took me on Uh, on a recommendation from someone and have allowed me, uh, buy their sponsorship to Continue to attend the wg-14c standards meetings and push for new apis that make this whole thing better Um, so very much. Thank you to n en you can check them out on their website Um, of course, there's emails phone numbers and everything else there If you're dutch or even dutch adjacent and you want to help with these things You can definitely ask and and they'll be happy to help you out Um, and maybe even help get you to as long as you're pushing, you know for standards and other things like that They'd be able to help you out. Um And that would be great Now I just wanted to thank all the various people who put together various media. Um that I use in these slides Um, you know just giving credit where credit is due Uh, and of course, uh, if you'd like to be part of one of those people who helps when a part of one of those patrons I want to support a vision for fluid text handling and cnc plus plus There's a plan at the portfolio text link there as you can support the plan with the uh, uh at the link there Um, and of course, um, if you don't want to just support the plan directly with donations You can always contract and consult us Please send an email to shepherd at soasis.org We do pretty much everything system profiling hardening testing performance We're also sort of known as a text people Scripting on small devices a whole bunch of things cc plus plus whatever language you Uh got in mind or whatever task you have at hand. We will be there to Uh provide A wonderful place for you to rest your head easy knowing that it will be taken care of Any questions