 So I left off in the last video saying that I was going to talk about Unicode categories and how I feel like the .NET libraries don't quite have it right. So categorization is part of the Unicode standard in a very limited form, and I'm not going to bitch about that, although I am going to suggest improvements after the fact, including something that I'm basing those improvements on, but I don't think the .NET team quite got the Unicode standard implemented the best way it could have. The standard I'm talking about is an NX, so you don't have to support it, but if you're going to support it there's a standard way to do it. Unicode NX or UAX 44, and it's specifically section 5.7.1. There's a whole chart on Unicode categories, and it's very basic, very, very basic, but it does cover at least enough for general purposes. You know, if you're not doing broad parsing things, then it's typically sufficient. You want to validate a password has at least a number and a symbol in it. It'll do the job. It'll take care of what you need. There's a lot of improvement, but we'll get into that. The things that I don't think the .NET team is doing quite right. In a few cases, they're not using the same name as the standard, which is just dumb. It's pretty minor stuff. You're still going to understand what the thing is, just looking at it. Not assigned versus unassigned. It's very obvious that they're the same thing, but it's a standard. You should be using the same name because it's a standard. If you were to make up your own names that are similar enough, but don't fit with medical terminology and start going in to work at a hospital or whatever, using your not quite standard terms, you're going to confuse the fucking shit out of people, and it's going to cause mistakes. This is a very minor compared to that, but still, you strictly adhere to standards. If you disagree with the standard, then don't adhere to it at all. Do your own thing, but as Microsoft are talking about, they don't follow any standard really. They're more concerned with bug compatibility than actually reasonable things. But there's a more technical side of this that I don't think they implement quite right either. And to be fair, I don't think a lot of people do because they're not thinking about part of this. See, the categories aren't just specific things. There's broad categories too. If you look through that chart in 5.7.1, you'll notice that there are, just like I said, broad categories definitions. There's a category for letter. It includes cased letter, which is itself a broad category for lower case, upper case, and title case. Then there are other forms of the letter. There's the, what is it, combining letter and then other letter. I think are the other two definitions, categories rather. But it's clear that there's a semi-hierarchial thing going on. So does that mean you implement this through objects? No, no, no, no, no, no, no, no, no, no, no, no. Do not do that. Given a considerably more sophisticated system, then yeah, maybe you have to do objects. But how many base level categorizations are there? I don't remember quite off the top of my head, but I think it's 27. It's less than 32, though. That's important because with 32 bits or an integer, we can do a flag set up, which conveniently ends the she-sharp world is still in enumeration, but just with the flags attribute and specific value set for each of these. That's also incredibly convenient because when we say, hey, this category is a letter, then anything in those subcategories will have those bits set and will identify as that as well. We have hierarchical behavior with the same exact performance. Yeah, that's what I mean by I don't think they quite implemented it right. It's not that they implemented it wrong, although I'll still argue that the name should exactly match. Maybe I'm just being pedantic, but again, standard terminology is standard terminology. You stick with it or you don't follow the standard at all. But aside from that, I think it's more just that they could have implemented it better. Not that they implemented it wrong, just better. So don't quite agree with how they did it. Now, I said there were some improvements that could be made. These aren't going to make it into the v4.0 release. It's not worth it delaying that to the point where I can implement this, but I'm still going to talk about it nonetheless because I wanted obvious that not only is this flags enumeration a better representation of Uax44-5.7.1, but also I want the hierarchical nature of it to serve as a sort of get you thinking about more developed and fleshed out category hierarchies. You can kind of tell I've been implementing a bit of that as individual method calls and there's a better way to do it going forward. But specific methods for detecting very granular categories, things like, or even just broad categories, is this a combining mark which would cover the spacing combining marks, the non-spacing combining marks, and the enclosing marks. But also more granular ones. Is this a superscript? Is this a subscript? Those are granular, very granular, and not supported by Uax44-5.7.1. Then I haven't added any of those methods for a while and it's because there was clearly a better way to go about doing this. We're dealing with structured data. There should be something to represent the structured data, not tons of these method calls. Because good lord, what you do not want to have going on is hit dot i, start looking for any of the methods that would start with an i and see this absolutely massive list of methods starting with is. That's namespace pollution. That's bad design. That is a nasty code smell because it makes it very hard for people to find what they want. Instead, it's clear that this should be driven by some type of data structure. Now, that's a pattern I've been doing a lot throughout Stringer, where things are driven by tables. This one is not driven by a table. Remember how I said this flags thing, set up a basic hierarchy. Well, actually, a hierarchy is the right way to deal with this. It would essentially be an object hierarchy, seriously. You wind up with the basic entry for any Unicode category at all, index inside of it, and who would it be, index or would you do static properties? Static property might make more sense. Yeah, because then that's safer. Yeah. So do a static property. Get your categories, dot. Boom, all the subcategories come up. Select one of those, dot. Additional subcategories come up, if there are any. As you go drill down deeper, you get more and more granular. As you do the opposite, as you scale back. Instead, what happens is it gets more and more broad. Each broad category has to be correct, except anything within one of its subcategories. Okay, you have essentially the same number of subcategories except anything within one of its subcategories. Okay, you have essentially the same behavior. The way it's implemented would wind up being a little bit different, but it's still the same behavior. Then all you've got to do is set up a way of allowing operators to be used between these, so that you can still do arbitrary composition, because that's the other distinct advantage to having flags. See, there are situations in various text processing algorithms where you might want to say trim or filter out multiple categories that have nothing to do with each other. They don't share a single broad category. Say, filter out any control characters and symbols so that you can look at text and numbers. Control characters and symbols don't share a broad category. You could call the same thing twice naively, and that would work if you're filtering out. It would be less efficient, because you're going to allocate an additional copy in between. It's not the end of the world, though. Trimming for those is going to be highly problematic, though, because unless you can guarantee that one is always inner of the other, trim is not going to work, really. So if you can specify combinations of these, well, that's better. And conveniently gives us a saner operator, because we can say and, like, or for combining things is just weird. So being able to say and is just a little better. Or maybe overload them both, because I could think of some context where or is appropriate. I can work out that detail when I'm actually come to it. That would require an additional type, a special type of one of these categories that isn't publicly visible, but still exists purely to combined arbitrary categories of work. I'm not the first person that's thought of this. The first person that has recognized that the UAX 44-5.7.1 is nowhere near as granular as it would be. There's a lot of convenience that can be gotten from more granular categories. It's non-standard as all technical notes are, but Unicode Technical Note 36 describes almost the exact same thing that I am describing. And in fact, is a large inspiration for where I'm coming from. I had very similar ideas, but still wasn't sure of exactly how to structure this. And it would be a huge pain in the ass to implement the majority of it myself. Whereas, thanks to this work already largely being done, there's a table which can be used to set up this entire infrastructure. Or to work hard parsing at module initialization time, but that can be done thanks to FODI weavers. Or the upcoming changes to C-Sharp 9 if that part goes through. It's looking like it might finally go through because the FODI modular initializer has made it very clear that that functionality is desired heavily. Would that require a v5? Would I be able to introduce it as a minor version bump in the v4? Probably minor version. That's probably how that's going to go. In fact, these changes are actually probably going to involve the linguistics portion of the literature library being yanked out into its own library that CORE can depend on with the literary functions being put into CORE. Because without the linguistics portion of that library, there's not a whole lot of literature functions. It's not really special enough to justify its own library at that point. But I don't know. Maybe I can come up with a hell of a lot more. I don't know. But languages could essentially be treated as those categories because, after all, every languages orthography has a set of specific characters. So it would be useful to be able to utilize that kind of thing, to be able to specify in the same setting as a category a language as a category. With the languages essentially being a bunch of additional stuff on top of it. So, clearly, a lot of these functions are going to be driven through different mechanisms. As part of this, because I'm going to be doing categorization differently and because languages can essentially be treated as a form of categories, it makes sense to no longer be using the string comparison enumeration. To instead be having just a simple case enumeration, similar to what was done in stringier.patterns. And in fact, I am yanking that enumeration out. It's not going to have the no preference parameter, but I'm going to do a special conversion thing to make it so I can still use that case inside of patterns, but one with no preferences inside of core. Because if they share the same enumeration values, then you can just do an unchecked conversion and rapidly switch between the two. That's important so that I'm not doing switch case statements all over the place. I have to figure out, because as far as I'm concerned, it's a bug. I don't share the same bug compatibility level of backwards compatibility that Microsoft shares. A lot of these functions, they default it to string comparison.currentCulture, which is problematic because that's not what you expect. That is what happens. In some ways it makes sense, but it's just again, not expected. People don't test for that, which is one of the big problems with it, but also carries performance overheads that not everybody is aware of. So I've been thinking, how am I going to tackle that? Because I could stay with the .NET thing, leave the method without an explicit culture as the current culture. That works, but the problem, because there's a problem, is that, exactly, why the fuck did I do that? I just described the problem earlier. The problem is that you don't expect that it's the current culture. People test that it's the current culture. Run your tests on a machine from a different locale, and the shit breaks, because it's doing the comparisons differently. So there's two options as I see it. The method without a culture could either do it on the invariant culture, so then you have to specify a specific culture, or have a mechanism for specifying the current culture, but regardless, that default being the invariant culture unless explicitly specified, which makes a lot of sense. That is the reasonable default to do. How the hell do you specify ordinal comparisons in that instance? Culture.ordinal? Ordinal is not a culture. I think what should actually be done is use ordinal as default, with all of these having an override for an overload, rather, for a culture parameter, a language parameter, category parameter, whatever invariant can obviously be gotten from that. Current culture could obviously be gotten from that, and any arbitrary culture can obviously be gotten from that. No weird semantics like culture.ordinal. You could then, at some later point, provide an analyzer which simply checks for ordinal operations. And, hi, Oscar. That's my neighbor's cat. And when it detects that, it's just a warning, because it's not an error. It's not like it's obsolescent or anything. It's just a warning like, hey, just so you know this is ordinal, did you mean to do an ordinal operation? You may want to consider one of the culture overloads. Then if you truly meant an ordinal operation, which, if you're doing performance sensitive stuff, you definitely would, you can suppress that warning. And because if you are doing performance sensitive stuff, you're going to have those all over the place, you can suppress that warning in a suppressions file and suppress it for your entire file all with one warning. I use my stuff primarily for performance stuff situations. And that doesn't sound like an impediment to me. So I don't, I think that's fine. I think it's a totally acceptable way to go about things. So I think that's what I'm going to do. All string comparisons are going to, you know, you're going to completely rework how that whole string comparison thing works. That's not going to be used as a parameter anywhere in these libraries. It's not like it's going to be hidden and wrapped up. There's, you know, it's just that goes on as I develop these libraries more and more and more. There's more of the .NET runtime that I'm not utilizing that I am actively avoiding. And it seems like it's starting to get time for this. So yeah, talked about this side of things a lot longer than I probably, than you guys probably thought I would. There's some of this shit actually gets rather involved and there's a lot of things you probably wouldn't consider. So is there going to be another entry in this? I don't know. Am I going to notice even more shit as I get through? I have. Some of it's getting put off to later releases, but I don't know. I can't safely say there's going to be another entry this time. So this might be it. Until the next video. Have a good one, guys.