 Well, I already need to correct something I said about the progress and plans and whatnot figures Yesterday especially I've been going through and restructuring projects quite a bit I believe I had mentioned I don't remember But I was separating out the search and metrics functionality into their own libraries that's Not staying the case It's too much of a pain for them to be separate. However, the encodings being separate is completely justified Roon being separate is completely justified So Trying to find a sweet spot. You don't want to extract out Everything into its own shit because then you get this mess like no JS has where you've got you know a package for is equals Package for is odd than other nonsense. Oh But you don't want these huge monoliths either The issue with that is just I'm one of these people you've got people that want things to be rather modular to use just the components that they're interested in and Not use the ones that they're not That's definitely understandable Something's too big. It's too hard to learn too hard to discover Bloats your application I don't want that so Believe it or not. I actually look to see what the total slot executable slot and Library size as in like that the size in memory when it's loaded is and I want to make sure those are reasonable values even according to my standards, which are Very everything needs to be modular. So Okay, that's That's part of what I was doing the other part Is the stack above core get off me? Patterns engine which largely doesn't need to be touched I do need to decide whether I'm going to merge tracing back into that or keep it separate I'm not really sure. I'm not sure if it's justified a complex enough component to justify Using an interface and then having the implementation be in a separate thing I don't know if that's justified and not it might not be it might be worth it to just leave the entire thing in there. I Don't know But then well Does the debug tools Stay in their own repo. Do they get merged in as well? Because the testing tools are in in there Although I think this time around it won't mean the auxiliary testing projects, which would be convenient. So Okay Another one it's like eight o'clock in the morning on a dead-end road with only five houses on it Three cars driving by the road trucks towing shit and that is Surprising so anyways Literary is Well, okay, so there's something I noticed in patterns I'll get to that but literary is another one that there's not There's not really that much in it function wise There's considerably more I can add to it But even then it's not ever going to be a particularly large project unless I resort to Doing like special text generators like leaked elite text and the mocking SpongeBob text and Like a zalgo generator and things like that, but Let me Justified how practical are those most people who are interested in that Don't need to do it in a high-performance situation. They're just want it so that they can copy paste it onto Twitter or something to You know you're gonna need those in some type of batch processing high-performance kind of thing Probably not so in that case would I be justified in taking time to implement those? No, probably not So then that means there's a rather limited amount of functionality inside of that What has been justifying it was the fact that There is the whole Language script and orthography is I Don't want to call it a table, but we could just call it a database It's essentially a database There's a lot I can do with that a lot of things clever things that I can do to advance that certain types that I can derive from Certain interfaces that I can implement that can greatly Really enhance what that system is capable of Which would be fantastic even enhance it in some ways that are compatible with the base.net system, which again would be fantastic There's Things like equality operations that take a string comparer if your sorting rules are defined within the language then you Have a string compare But there's another thing in the last video I Was talking about how The Unicode category enumeration in.net It's a little limited Categorization is a bit more structured than that and we'd like to enhance it And I'd said that for at least the time being it would be acceptable to use a flags enum setup that would cover the Unicode categories and This is true to some extent it does work and if you are only interested in implementing UAX 44-5.7.1 then Then it would completely cover your use case, but that's not my only use case See it wasn't just that it was planning around Eventually having the system of rich characterizations Categorizations it was also that I was actually Utilizing some of them and increasingly utilizing them this time went on in fact the patterns engine what it's been doing is Accepting these broad categories as definitions of patterns You should accept as well more granular categories so, okay, if you have a flags object, then could you write a Pattern node that would you know be able to parse those categories Yeah, absolutely You still run into the problem of not being able to expand that at all You know you have 32 or at max 64 bits that you can cover Are you going to be able to realistically and there's a Tie-in that makes that very very unlikely to happen at all But if you want to parse a specific orthography You've got some situation in which this particular field has to be within this language This is actually not a contrived example The Unicode character database itself the names must be in English Yeah, I'm sure you could find other examples, but yes, that is a thing But it's also a convenient optimization and restriction times I'm not gonna get into that because it's not the point of this video, but it has numerous numerous uses Especially since the old way I had to do it was actually through a Delegate so one of the pattern node types was What I was calling a checker. There's a function that describes what to check So it's not like parser combinators where the function describes how to parse that character or parse that Whatever pattern it is. It's rather a declarative thing of this is the function to describe What the thing is to match? It does not describe how to parse it only how to identify it But delegates are Expensive compared to just executing the function itself. That's not a huge amount of overhead, but Let's face it when there's an easy way to remove the overhead you want to move the overhead This is a fantastic way of covering one of the common Probably the most widely used case of checkers there are but it needs to be a granular system so I'd Stated in that video that there was the possibility of using Unicode technical note 36 which describes such a granular system But I hadn't Gone through it all that much. In fact, that's something I had just discovered a few days ago looking through the technical notes for any ideas So it's there It's old it covers Unicode 6.1 whereas The most recent Unicode that dotnet supports is like version 11 Dotnet 5 is supposed to cover version 12. I believe it looks like they are based on the documents that they've been using to test Unicode 13 is out now. I don't know if they're gonna update for that. They're not I would hope to do But regardless I would like whatever I do to be covering the most recent up-to-date one You know that's that's Part of the reason why I'm increasingly Removing the dependencies on dotnet is because then I don't have to depend on different be Out of date behavior Because I am supporting going so far back, but also that Where's just that No, it's essentially just that and Because I support so far back this time around I'm gonna be actually supporting as far back as dotnet standard 1.3 I figured out how to address the concerns I had had That was in part from Implementing more and more base dependencies and utilizing my side of things To enable that to happen But we want consistent behavior across all of these as well these are really just run times for the That's what I'm trying to do to to separate out the runtime and standard library side of things So that I'm only using dotnet as the run time That has other advantages as well and porting to another architecture That that being possible as well, but The less dependent on dotnet I am the better it's clear that a granular categorization scheme would be Superior You can classify far Far better. I have actual use cases for doing this those use cases Sort of require Being passed an object now. I'm not saying object in the object-oriented sense It doesn't need to be a fully featured class that utilizes polymorphism. Although that is Probably how it's going to be implemented If I didn't need this it to be incredibly granular that flagship would be totally fine You pass it an object of one of the values and boom you've got You've got your thing taken care of no need for a delegate Which is great, but We would base this on so I'm going through the unicode stuff again trying to find Something either in the standard or something in one of the technical notes that is reasonably up to date Oh, that was the other thing uh utm 36 I had some Not simple disagreements about How certain things were being categorized And because it's not a standard Why would I follow it exactly? So I would just go and do my own thing anyways But I don't want to do my own thing There's There's a lot of freaking characters to classify It'd be better if something was done for me even if it was Not as precise as it Ideally would be I could add in the stuff needed, but uax 571 is a reasonable base So we just need to go beyond that There is In the derived part of the unicode character database A file. I believe it's called derived properties Um, but it has Additional classes that you can Utilize for the classification of a character It doesn't cover all my needs But it is A substantial improvement Some of these are compositions of existing classes. Some of these are Well, it uses a lot of set operations intersections joins Unions, I guess union joins same thing, but Whatever the hell I call exclusions in set theory Where you've got your two sets You exclude the ones that are common between them and take the ones that are unique to each But there's a there's a number of these set operations that are done and it creates all these different categories that Are definitely useful So How do I go about doing this the ucd categories In uax 44 571 They Are already support supported by net, but I want them Any They're not Easiest to access That's sort of one of the annoying things. It's I I want it easy to do Text processing stuff, especially the common text processing stuff I want it to ideally be like a single function that you call and it just happens But I want the more advanced stuff to be possible. I want it to be exposed and available and That net does this bizarre thing where most programming languages do this bizarre thing where they do pick this middle ground in between To where advanced stuff is hidden away typically, but sometimes exposed but not in the ways that it should be And there also is a high level enough functions to allow you to just take a simple declarative approach so you're just fucked either way It's lovely so right Just a simple console program that lives on one of these projects I do think this this what i'm going to be doing is justified enough in its own project I'll talk about why but have A console program in there that's sole purpose is to generate a file for One of the other projects the actual library project that would be called categories The point is to parse The unicode data.txt, which is where uax that Uax 44 puts a bunch of its information including section 571 But parse that extract out the mappings between code points and categories You now have A fast lookup for any code point to get the category You don't have to have each of the categories as a Collection that you check all the elements through to see if it's there You don't have to set up the categories as algorithmic collections, which are incredibly fast by the way Anytime you can implement an algorithmic collection. That's probably the best thing to do but in this instance because unicode was not designed in a Way conducive to that kind of programming. It should have been but it's not Um, that means we have to go with a tabular approach, which Is fine. I'm no stranger to tabular programming Or table-driven programming It's a cap set up a simple table a simple It's a key value database really so a dictionary and I could Type all those out myself, but that would be tedious as all fuck and make it hard to do updates and stuff And unicode goes through regular updates. So you don't want to do that. But what I can do Like I had described that console program generate the source file an actual c-sharp file With that dictionary initialization You now have your database Come time for an update Take the new unicode data file Smack it in the generator Generate the new file compile your library boom You've got an updated database. That's fantastic Now if the support any more appropriate and well-designed shared object system Shared object shared libraries dynamic load libraries That kind of object Then you could even do away with the pre-generator Which yes, I get is a bit of a performance thing, but I actually store the text file somewhere in the file system because it actually parses really really quickly And it would only need to be loaded once So Parse at the first time is loaded and then as long as the state's resident memory. It's just there That would enable more hot swapping Of that file, which would be absolutely fantastic But most languages because of their semantics and whatnot don't actually allow for appropriate swapping of shared objects like they're able to support It doesn't seem like dot net is one of those c++ is definitely not one of those There's something I appreciate about Ida though. There's something really nice about Ida Anyways Anyways But about this, you know that that covers the the UAX 44 What covers the derived properties these are these additional properties that it was describing Well, this is a simple thing remember that whole Design philosophy that I keep saying Keep things publicly immutable, but internally mutable The dictionary that is your database doesn't need to be publicly visible You can expose apis to work through it, but don't expose the database So if you're not exposing the database, you don't need it to be I read only dictionary It can be I dictionary Now you have the ability to go through and modify the values The keys don't need to change The only thing that you would be doing is going through and seeing if there is a more specific derived property to apply to the code point If there's not you don't touch it If there is because you're utilizing polymorphism or Any system that would allow for these to still be equivalent, but in my case, I am going to be doing polymorphism There's some believe it or not performance advantages to this and I'll get into how that makes any goddamn sense But That way if you've got a more derived classification a more specific classification Will they still be considered as equal? Yeah, it's still derived from that base class Of course, it's still equal. So I'll get into My own extensions beyond that because like I had I had mentioned language being the the language script and orthography stuff being used for classifications and You know, there's yeah but first things first Get out of the way how this works each of the Categories objects doesn't actually need to have anything in it Is it because of the table driven approach? When you want to know the category of a character because you want to know whether or not this character is within this category You look up what category Is associated with that code point Not the other way around So categories are simple identifiers and not actually classification not actually collections it's a little different and To be fair, I think I might have a situation in which justified to still make these categories these yeah these categories as collections um But the whole Is this of this category system would not work With a collection type approach regardless So you want to know if this character is within this category It's a simple dictionary local which of course are really fast and since the only thing in the in the the value part is the Category then you just get simply back a simple category instance Now you could make this incredibly more efficient by utilizing singletons So that you're sharing One say punctuation instance for all punctuation That greatly reduces memory pressures. So then Get off me Like I'd said, this is utilizing polymorphism. So determining whether or not um something is within a broader category is I mean, I would I would love for it to be as simple as the um Just reference equality, but in this instance, you actually have to do a tight pattern match Uh, the performance overhead of those isn't terrible though And there might be some clever tricks I can work around to still do that in fact, I could probably reutilize the um The special I equatable and inheritance pattern that I had worked out to implement that kind of thing efficiently so I don't have to play around with that, but regardless even if I have to pattern match on a type It's not terrible performance Especially since we don't have a nice algorithm that we can fall back on In most cases in some cases there actually is and in those cases it may be justified to Not do the dictionary look up Some cases where that block is actually laid out well enough to justify algorithmic collections Not often I actually know I guess I should say it is more often than not but it's Unicode is bizarre so That gets us A lot that gets us An exposed category look up Which to be fair it was already, but it was exposed through different apis You had to do it separately. I go through a different api for characters than for runes and that's obnoxious Um, not a huge deal. There are other instances with the character properties that are that i'm not utilizing in this one that are If you can't access them at all through dot net to other things that I'll be keeping this approach for those properties when I eventually need to implement them myself um But I also get you know incredibly more granular Categories out of this including the ability to extend this system And that extension is where we get into the languages See the language or orthographies are themselves character categories If a character is within a certain language That's that's a useful thing to be able to test You want to be able to test that so ideally you would reuse the same category system that you already have in place now There are additional unicode category the unicode properties that I need to access for that um conveniently in the Ucd there are properties for the uppercase lowercase and tidal case mappings And in fact tidal case mapping is something that I need to provide any ways inside of core because most programming languages The entirety microsoft included which is It's confusing to me because they operate in numerous countries where tidal casing is a thing that matters Why it's not implemented is this I I don't I don't know Ignorance You go through the exact same code path as you do well not the exact same but the same style of code path as you do for The uppercase and lowercase It's already in this database. You just look up a slightly different property But two tidal cases not provided anywhere and it needs to be Sure, english speakers don't utilize it. Sure, french speakers don't utilize it, but There are languages where tidal casing is significant so that providing those actually allows for a tremendous opportunity for Simplifying how the orthography's tables are written But also a tremendous opportunity to extend the functionality of them can Make it so that there isn't a separation between the uncased and cased orthographies because it doesn't need to know It would have that mapping available in the form of these databases that's That's fantastic You'd be able to use languages themselves as categories Everywhere that categories winds up supporting you can use the language And that's fantastic The more I could talk about there, but the I don't want to jump the gun too much I'd like to get into implementing this now. Luckily, it's not too terrible to implement in fact the Properties thing for the uax 44 side of things not the derived properties Um that generator only wound up taking me like 20 minutes to write So it's really not that bad This is not that much of a divergence and You know each one of these things actually simplifies the rest of the code base So that's always fantastic. In fact, that's something that's been happening a lot through this audit is Simplifying the shit out of the code base a lot of methods are actually Much shorter, which is wonderful Whether better reutilization of existing code, which yes, that's technically tight coupling, but it's not object coupling So it's not a bad thing Um But those those things The whole thing is much much more sharing much Less I actually have to maintain much less. I actually have to sort through And this is going to continue to extend that which is awesome So That's it for this video. Have a good one guys