 I was thinking about recording a video outside today, but as you can see it's a Weebet Windy. So considering I can't exactly record outside as you saw because holy hell does this mic pick up the sound of wind I'm gonna be talking about that in here So as part of the work on stringier core One of the required things I have declared for the version 4 Is that there should be complete parity for glyph every single function that Can work on a car or a rune should work on a glyph and There's some tweaks that I wound up having to do to glyph that I'm working on today and I don't know how long it'll take but Based on what it looks like it shouldn't be too long, but there are I'm gonna talk about those changes So first things first, I am going to be open sourcing the entirety of it It was done originally as a separate The tables were open source the glyph itself was closed source and I had done that because And this is still sort of an ongoing issue of me noticing certain parts of stringier getting put into Other libraries other projects In very similar ways Without proper credit, which is unfortunate. It's obnoxious. I Stringier is under a very liberal license. It does not need to be People can easily give credit it's the bsd3 clause is a very liberal license very simple terms People suck sometimes So certain things have been closed source to kind of Stop that but also to help catch when that's happening See if it was Originally if the work was originally being done in a completely open source way and I was seeing those Things get followed almost exactly and then I closed source part of it and then they diverged It's kind of obvious what's going on but It's been closed source long enough that The people I suspect of doing that Have come up with their own solutions for these things and I don't really think that's Going to I don't think they're gonna revert their changes and follow mine exactly almost exactly again When I go to open source this the entirety of it again because Boy, that would be a huge admission of hey, we've been copying your work So it should be fine at this point This does simplify things quite a bit in that the glyph Development can be done through one single repository and a story There are changes that I'm making to go along with that though So the will cover the open source side of this first Glyph is table-driven and that's done as an optimization because I am almost positive There's a faster way to do the intent of Unicode normalization without actually doing the Unicode normalization algorithms and The reason for that is largely because the Unicode normalization algorithms go through the whole process of Converting the entire string all at once it creates a new allocation because you have a new array being created and It's a It's an expensive operation But it also means when you want to do things like say enumerate the Visual graphemes of a string. How do you go about doing that? My design was to actually have a strong type To represent part of that now there's numerous different parts. It's not that glyph alone doesn't cover every single Instance where this kind of thing needs to be done But there's other types that are intended for those other situations, but the the tables that drive this there were two tables a equivalency table and a variancy table and I'm gonna wind up having to have a third table I'll talk about that but the equivalency table is A mapping of the different variant representations of the same graph aim To an equivalency code. So now we need to talk about what the hell the equivalency code is And this is a Unique code space kind of similar to the Unicode scalar value code space and In fact This time around I'm actually going to be directly mapping the entire almost the entire Unicode scalar value code space into the glyph equivalency code space so a Codespace if you're unfamiliar with this in encodings especially direct encodings not compression encodings Is a mapping of integer values because that's how computers represent literally everything to Some kind of special semantic in the case of the Unicode scalar value code space It's a mapping of the the integers to the actual Unicode characters Should make sense with the glyph of a glyph equivalency code space What it would be is a mapping of those integers to an equivalency contract This is actually kind of similar to how the equivalent equivalency contracts with the upcoming dotnet records are going to work. So if Two glyphs two graph aim clusters have the same equivalency then they are Equivalent they are equal Regardless of what their representation is Now why I'm doing the direct mapping if there is a Unicode scalar already there Largely has to do with optimizations If you've got say the exact code point for a with an acute because that is a pre-composed character But also one that you can Compose through sequences if you've got the pre-composed one Why would you not just use that exact same value as its equivalency? Then there's no work that needs to be done you can directly convert in that case For the cases where there are composed This is where that table comes into play and how I was originally doing that table is to define the full sequence Mat to the equivalency code so what this means the table looks like is that for these composed sequences these extended clusters They have an entry there and they map directly to the pre-composed character for anything that does not have a pre-composed character. That's where Things are changing and granted you guys didn't see the internals of Life before but I was doing the very stupid thing of actually storing the entire sequence as it was found in the glyph That's done and I'm not doing that anymore Think about the entire Unicode scalar value code space it only goes from zero to one zero F F F F Which is bigger than a 16-bit value? But considerably smaller than the max value that a 32-bit can have This actually gives us a massive range to define our own equivalency contracts because the entirety of A zero zero one one zero zero zero zero through F F F F F F F F is Completely unused and the Unicode consortium has declared it will never be used for new scalar values so we can do a system where the Entity of below that point maps exactly to it so if you Pass in you know the a with an acute pre-composed Then we'll just use that value It's where you have the Latin a on its own Then followed by the combining acute mark That you have to resort to these tables and it would get back the equivalency code for The exact value of the combined a with acute follow For anything that wasn't already in there It'll map to a single value just the same way but it'll be one of the Code points above zero zero one one zero zero zero zero Because those will never be used by the By the Unicode code space so now We never actually need to store the sequence inside This is part of why the there's a third table that needs to be added But one of those tables the equivalency table It was defined as Well, just a straightforward table and that kind of makes sense It shouldn't have been a table in the first place though You see it was an Associate of array and for the other two which we'll get to those make sense to be associate of arrays, but not this one See the whole way the determining the equivalency contracts works is by parsing a sequence and a sequence of characters and then outputting the The the the appropriate equivalency code there is a Considerably more effective collection type for doing that kind of work The retrieval tree or try t-r-i-e From you know that the middle part of retrieval useful thing about that is it's It's a what's a tree data structure hence retrieval tree, but it's structured immensely different there are numerous different variations of retrieval trees from the standard try to the Patricia tree to the I Mean technically Ocona Oconan Ocon, I think it's Oconan Technically it's just a different Way to construct a standard tree, but it's often implemented as its own unique tree And I believe there are some other Variants of it beyond that but essentially we're going to be doing a non-generic version of that To where each node in the tree is just a single character then it can follow through the entire thing character by character because that's how the sequences literally are and When it reaches a terminal node that terminal node would contain the equivalency contract Try is actually happened to have the fastest lookup out of any collection When you can use them you can't always use them But they are fantastic for sequences and in fact Literally have sequences of characters So I went looking through third-party Libraries for to try implementations and I found a few a few rather good ones for general purposes it try net is Seems to be absolutely fantastic strongly recommend using it. However, this is not a general purpose thing and we can actually optimize their implementation a little further because of the very specific domain this is being used because we know for a fact that This is always going to be character by character And allows us to add some convenient APIs because we don't need to match a String we don't Well, I mean obviously you do We're treating the string specifically as a sequence of characters. I don't want to get into the internals of this There's advantages to rolling our own try luckily tries are actually pretty simple data structures because there's tons of Collection interfaces that they don't need to implement at all in fact, the only two functions you're really interested in using at all our add to it and retrieve Look up whatever you want to call that retrieve is the convention because it's a retrieval tree so that's not a lot to implement and That allows us to go character by character set up the I mean this would literally follow the exact way it's parsed which is Fantastic this also means potentially The actual try structure could be used for parsing glyphs in the first place Which is an awesome prospect because that is a tremendous amount of code reuse so We'll be coding a Domain-specific try for these purposes and that will take the place of the equivalency table the other table that was already there was a varian C table and The purpose of this is to take an equivalency code and Generate from it all the different character sequences that can represent that equivalency code So in the case of a a with an acute One of the strings you would get back is obviously just a single element you know one character string of the combined a with acute but it would also return a a Two-character string of the Latin a with the combining acute mark It isn't Probably not super apparent why that would be useful, but it's actually very useful for the variant string Which is used for the I mean it primarily just drives the glyph adaptation into the patterns engine It's not super useful outside of that, but it is very important for that Because if you put a glyph into the pattern What it's saying is hey, I don't care how this is encoded. I just want you to parse it One thing I want to look into is whether or not I could completely remove that because if we're using a try For the equivalencies and I can adapt that to try to also be used to drive the parsing of glyphs Then I don't need the very NC table so potentially be two tables, but just Different one of them is going to be different. Well, both of them are going to be different But one of them is going to be entirely semantically different The other table that definitely needs to be added and This is because if I'm not storing the sequence the string Inside of the glyph struct How the hell do I support things like the two string? there are other functions, of course that need to to To utilize the actual representation, but how do I support those and that's where we do the the reverse The code the equivalency code Needs to have a mapping back to a Character sequence to a string that should make sense You can use the variance a table for that However, it's inefficient because every time you do a lookup you're going to be returning an array of strings and then only be Interested in the very first string, which is Incredibly inefficient So if instead change it This table that the you know string lookup table To only return the shortest sequence that represents that a equivalency code Then that's more efficient so Obviously enough. This is going to be leading to Performance improvements. So where was the performance before? so this is Also part of why I would like to open source the Entity of it the benchmarks for life itself were private because they were in the Private side of the repository That should make sense because you want rapid access to them because when you make changes And you're concerned about performance regressions and or you're deliberately coding in trying to get performance improvements If the benchmarks are not in that same repository, then what do you do publish the entire thing run the benchmarks? I hope you didn't break anything. That's fucking stupid Now there are ways around that but I don't have a monorepo Sort of mask set up for this yet. So that way around it is not there and Since I intend to open source the entire thing anyways, that Whole things addressed It actually Even with the rather bad implementation already had Was performance competitive with Microsoft's text element enumerator? Which is really interesting because I do not have an efficient implementation at all. That is painfully obvious so Not only would this provide a type safe implementation of Working with you know graphing clusters It's gonna wind up being faster And that's fucking awesome. It's not only do we have the additional safety Which usually costs you gonna be getting better performance? That's fantastic Now I believe part of the reason for the performance issues with text element enumerator is that it's actually doing a Lot more you know like I had stated glyph is really only an API for working with graphing clusters extended graphing clusters and There are other Text elements that have equivalency Ligatures are a fantastic example of that where you have Was it sz as a great example of this I forget what languages actually use it but They treat it as a unique character in their actual Alphabet Which is weird because that's two letters. So why would that be a unique letter in their alphabet? But you know how whatever regardless this has had led to the creation in the Unicode standard about Certain ligatures having a specific code point assigned for them And there are actually some advantages to that especially since when you have digraphs Which technically that's a digraph not a specifically a ligature, but I they can there's some They're conceptually this similar enough that you can actually treat them through the same API And parsing them would actually be identical you parse ligatures and digraphs the exact same way So might as well just represent them both using the same type. There's no situation in which they're contradictory for each other So you can literally represent them through the same API But technically that is a digraph not a ligature But with both you have title casing now English speakers are familiar with uppercase and lowercase, but not title case and with title casing for a Digraph you only uppercase the first part not the second part You see how that's different an uppercase digraph would have both parts uppercase a lowercase Digraph would have both parts lowercase But title case is the first upper the second lower if they are represented as a single Code point that conversion becomes immensely easier, but they're not always and in fact some languages like German Specifically do not do that Where you have a Zeth for lowercase Becomes a capital s capital s That's interesting, but we need ways of supporting that a glove doesn't isn't meant to handle any of those How you would address that is to have you know that another API to deal with that and I There's there's another one beyond that We don't need to talk about that because that's not what this video is about But I think part of the reason why text element enumerators performance is so fucking bad Has to do with the fact that it's handling all of those through a single algorithm Whereas Could support the individual parts individually, and then if you Truly need them all I can provide some unifying thing that can handle that But if you don't and you often don't you can just use the parts that you need Composition is a fantastic thing because not only does the code be easier to maintain on my end It allows you to select only what you need Which is useful Yeah, that's what I meant Once this work is done, I can go back to working on core again And that that winds up serving as the foundation for actually Providing everything I need to support glyph in all of course functions Because I want to do that Core and patterns and streams should all support working with glyphs and they will They will so that's it for this video. Have a good one guys