 Hey guys, I want to talk about The back-and-forth I do where I'm at right now and why It's not a bad thing, but it's definitely an obnoxious thing So Especially with more complex things that haven't really been done again It's hard to know what the best approach to take is you know if you're say part of a dev studio who makes websites for businesses and Especially if they're similar to businesses you've already made websites for it's very easy to predict what the best way to tackle it is How long it's going to take all that stuff? There's nothing like string here In fact, even though there are other text processing libraries string here are their targets different things So for example the work SIL does is drastically different from What string here is meant for even though They're both tech processing libraries SILs deals much more with Spoken languages especially Whereas string here is just sort of general text processing but also includes like manipulation for creative writing and things like that It's Not focusing on spoken languages There's there's a component of it that does in a much more simplified way, but it's Really not addressing the same thing, you know So Since there's nothing like it. It's really hard to know How best to structure the project along anything's going to take what the best way to implement anything is and so on I don't I don't have the answer to those questions. And so that means there's a lot of back and forth now, I think it's important in that In trying different approaches seeing what works sometimes it's okay. That's just flat out better We're going to stick with that sometimes. It's I'm not sure and then you mull things over for a while and decide We're gonna try something else and maybe that thing works out or is a huge disaster and you roll back and There's a lot of experimentation Not much of this is actually involving the code itself though what I've been finding is Good Project architecture is immensely more difficult to get right than the code itself so Yeah, that's the big thing. I've been back and forth on You'd know if you've been following this work that stringier started as a mono repo And that's unsurprising especially considering it just it started as literally just some very basic extension methods Why break it apart into separate projects? this had Started to occur a little bit, but it was still a mono repo when I added another project aside from the extension methods for the patterns engine and Experimenting with that based off of some work that I had largely Largely worked out but didn't quite have working That I've done in Ida for a similar idea Implementing a snowball style pattern matching engine, but not at all like snowball did But closer to how snowball did it then how regex or partial combinators or anything else like that does it But then numerous Numerous things have been added to stringier sense them. We'll skip the whole Chronological history and jump straight to well. I now have a linguistics component that is again more basic than anything else I held us but it Largely addresses some concerns I have with culture info and how that whole system works Allowing for a much higher degree of granularity in the languages allows easier supporting of obscure languages things that don't necessarily have a culture identifier associated to them yet but That doesn't mean that they shouldn't have working computer that it's bullshit So it's a system designed around addressing that much more easily The intention was to have all like too upper too lower and to title case that everybody forgets about but that's an important thing for some languages to go through the Orthography which is part of the linguistics component instead of going through culture info especially since culture info doesn't support the two title case part but that matters any language with a digraph needs to support title casing and There's plenty of languages that support digraphs Some of them I can guarantee you've even heard of Bizarre It's very common scattered throughout different parts of Europe different parts. It's not even like it's specific to a language group. It's There's several different languages that use digraphs Whatever very You get some it's weird you get multinational companies That should know better because they've got departments all over That are disturbingly anglo-centric in how they do the programming I don't get it Meanwhile, you've got this bumfuck from Rural ass area. You can't even really consider this the country because it's more like the backwoods that I'm in cares more about Globalization than an actual international company That makes sense But you've got other components as well like the categories that I did for far richer unicode categorization than the unicode categories enum It's an actual Class that can be derived and supports all sorts of complicated extensions Considering it's an actual class You can have immensely more than just 32 or 64 different categorizations and It is of itself derived from set so set theory and Set algebra are applicable to it Which means you can freely compose through set algebra the categories, which is fantastically powerful You've got the encoders API for which Has been way faster and more efficient than the crap that Microsoft has produced Like a third of the memory usage that it utilizes because it actually supports stream decoding and encoding because UTF is stream encoding this Why would you limit it to buffers when it literally does not have to be limited to buffers? It's dumb The encoding was specifically designed so that you could constantly have access to that information so that you could decode on the fly Or encode on the fly. You don't need to put that into a whole buffer, but Conveniently it is very easy to adapt stream encoders and stream decoders to buffers So I still provide the buffer encoding and decoding. It's just it works on streams by default Way more efficient It also means you don't have to buffer huge amounts of shit out of a stream as Part of the encoding or decoding phase you can buffer as part of an optimization So you just removed a buffer in a way that actually speeds it up. I know that's weird. I Promise that makes sense. There is such a thing as over buffering and it's bad But you know this allows the the one buffer to exist where it needs to to provide the optimizations of read and writes and Not just to support a feature that was implemented inadequately That's never a reason to use a buffer. That's that's bullshit. He didn't need to implement things, right? Glyph didn't exist when I first started this now I really don't know too many languages at all. I did discover that D actually supported this in its native library Implementation looks a little bit different than how I did it, but an actual type for working with unicode graphing clusters Which is fantastic I've been going through and adapting everything to utilize that and to making sure that the approach is very efficient and so on Because that matters. That's how people think they're working with the text and That's that's how you want to actually deal with it Otherwise you get all sorts of globalization issues and that's one of the things that stringer is trying to help prevent so their search functions and metric functions and the streams API as I Realized I needed to completely redo the streams API because Microsoft's streams are broken As in they did not test them adequately enough and it wasn't even hard to find the area in which they were broken, but they're just They're broken They can't do something what they say they're supposed to do and it's non fixable because fixing it requires moving a buffer from one location to another from one class into another and You can't you can't actually fix it Unfortunately due to some leaky abstractions you cannot actually fix it So the Microsoft streams just have to stay broken That's terrible. So Stringer streams are a replacement on that along with a redesign of it Because there are conflations of abstractions that I'll get into in a more appropriate video but Stream refers to a specific concept and I see it sometimes Shifted to the wrong level of abstraction to where streams kind of On the same level of abstraction as a file and that's not right files and streams are different and If you have them operating at the same level of abstraction, then how are they any different at all? why would you do that and So care was taken to make sure that it represents a stream and only a stream and that's the end of it that putting that into a file a Level of abstraction would require a layer on top of it for adding in the file specific stuff you know appropriate abstractions Why I was bringing up the architecture and then all of this is that Architecture has been something I've been having a Lot of back and forth a lot of struggling like I said it started as a monorebo, but it didn't develop that way I strongly favor modularity and I Think that is something we should all strive for modularity is a good thing having things as discrete modules Enables them to be more efficiently placed. There's less coupling. So testing becomes Generally far easier Coupling in and of itself is not a testing problem, although it's very prone to introducing testing problems because code that is tightly cobalt Tests tend to run through multiple pathways and get really good coverage Don't tightly couple your code just under the idea that you'll get good test coverage though because it is Extremely error prone. You need a lot of tooling to help you with that However, you need a lot of help tooling to help you with this situation of high modularization Now didn't break stringier apart to the microservices level of modular Then we'd be talking about a single package and library for a single function and that's a little insane I don't think microservices are a good idea Still trying to find somebody to convince me otherwise, but listen to quite a few conference talks. I'm on it now Probably about 20 and it's just consistently like I can't see this ever being a good idea And I'm somebody who thinks everything has a niche And I don't think it can ever be a good idea. It's too far in the other direction Things can be too monolithic though, too I don't think the monolith introduces as many problems as microservices do though but ideally you have a good middle ground a good modularization based on discrete concepts not necessarily the smallest level that you could potentially separate but just a Conceptual boundary And this is what I was doing the string here. I'd broken down each of those components that I'd mentioned into their own Modules their own libraries their own repos And I had built the entire thing up like that the stringier version 3 had been designed entirely like that now what I had been noticing however, especially during the v4 audit is there were quite a few instances a Lot of instances actually quite a few downplays it there were a lot of instances where code was being duplicated across projects for various reasons and the duplication wasn't exactly orthogonal Now this is a problem commonly seen with Microservices and I didn't even hit microservice level and I was still sitting this problem It's a strong indicator that you do not have things as granular as they could be and I Shouldn't say that it may be an indicator that you do not have things as granular as it that you could have them And that should be the first thing that you start looking into but the other possibility is that your logical boundary is wrong And that's what it wasn't my case Stringier actually needs to be tightly coupled because of how it works That seems a little weird, but let's consider an example I Had this little aside resurrected collectathon built that up to a level where to a satisfactory for use as a collections framework and collections library For use with string here to simplify a lot of the code that I had been introducing because there was a growing number of Data structures that were being implemented string here and having a framework to build upon to make those easier to implement easier to test Less code in general It's very helpful one data structure you would like to implement is the rope Ropes for those unfamiliar are a type of dynamic string It's complicated, it's not just a matter of Like a dynamic resized array or a linked list although I see a lot of people claiming to implement ropes and doing it through link lists That's not a rope guys rope refers to a very specific computer science concept If you're gonna do a linked list of strings call it something else but Again rope refers to a very specific computer science concept the ropes They have benefits I'll talk about that another time. I don't need to get into that just for the sake of this But you want to implement This data structure and have it usable throughout string here wherever it would be appropriate to use it now This means that Jesus This means that the Ropes need to be I can get to voicemail The ropes need to be accessible to all the different functions that are implemented in string here Those functions are in core. What's generally supposed to be the lowest major component. They're a bunch of minor components that are below it, but It's supposed to be the last major component structures much like the other single type libraries Should probably be below core reason being so that you could depend the core can depend on structures and then You can utilize that the pipe though the rope type and anything else that needs to be within core has to be Implemented in part with some of the functions that are in core. This isn't the only case where this kind of Coopling had happened something I wanted to do was to Certain functions would make sense to have a patterns overload much like how I was introducing categories overloads so this is a quick example of a category overload where a category makes sense as an overload is trim trim makes sense to pass in a character or a rune or a glyph and Trim those But also makes sense to put in a category So you want to trim all white space characters ignoring the fact that the one without any parent parameters at all does that Just because it's an easy example It's a trim all punctuation or trim all box characters That would be another example where this actually makes sense and isn't provided by an overload Do you want to list every single character that is there? That's a little less than ideal How do you make sure that you have them all? Unit tests, but then everybody's got to duplicate their unit tests and that means the library really isn't providing as much as it could be because Ideally you'd like to do something like that within the library Okay Category pass the category in the category does the check against every single character or rune or whatever it is iterating over Whatever that function needs to iterate over As long as it applies it does that for trim it iterates over Oh, I believe rune And so it would chop off all the runes that Match that category in the case of whites but trimming white space would clean off all the white space characters Although that overloads specifically since there is no white space characters above the basic multilingual plane What it actually does is iterate over the characters because it's slightly faster, but that's implementation details They accomplish the same thing String here ties to be a little clever at times As long as the behavior is the same the implementation deviate a little bit where where Appropriate where you can make certain assumptions. It makes those assumptions But it there are certain instances where Pattern would also be appropriate. I don't know. I'm not looking through the source code right now, so One one example I can come up with off the top of my head is like a variant of insure begins with her and insure ends with where You give it a pattern along with a default because once you have a pattern You can't just attach that pattern as text. That's the that that doesn't make any sense But you can give it that pattern and if it has You know any of those then it just returns the string as is but if it doesn't it attaches the defaults to it Say something like the lineage designator senior junior the third the fourth the fifth If it has any of those don't add it But if it doesn't have any of those then the assumption is that it's senior into senior that kind of thing it Makes sense for some methods to be able to provide patterns overloads so do you overload all of those functions and put them in the patterns library even though the Functions are supposed to be within core If you do that you have a circular dependency, but if you don't do that then you're not actually segmenting on logical boundaries, so it's It's less than ideal what it became clear is that Generally speaking stringier doesn't have these tight logical boundaries that I was trying to find in fact string your overall is The logical boundary with one exception the rune backport That can be its own thing that can stay as its own library the entire rest of stringier Has to all be merged back together. That is the appropriate architecture for it. I've almost positive that now spent the last two days doing this and What I have noticed is that when you start linking these back together when you start putting them all within the same project Oh my god, the code gets a lot simpler. You can bypass a lot of the public APIs and go to internals now that's necessarily seem like a huge deal and There are some of you that are going oh my god what the hell is wrong with you But here's a fantastic example of one of these cases glyph type The whole point of the glyph type is to as efficiently as possible represent a unicode graphing cluster Oftentimes the Unicode graphing cluster can be directly Taken from a single character or room going through the public constructors involves a whole validation check No, I can't use that example anymore because it doesn't do that anymore. It actually just directly imports it never mind But there are other there are case the encoders The public interface for the encoders and there's a lot of encoding changes done throughout stringier for various reasons the encoder public API does validation if you're encoding to surget halves into a Single unicode scalar value, it's going to make sure that the high surrogate is the actually a high surrogate and that the low surrogate It's actually a low surrogate And that's going to take that scalar value and construct a room from it and as part of that it's going to validate that the Scalar value that is which were to turn back is an actual valid scalar value There is an internal Unsafe version of this encoder that the public one calls That doesn't do those checks because essentially it's the public API's job to do the checks and then just call the unsafe one so if you're doing work otherwise and You know that you produced valid stuff for various reasons, whether it's some type of assertion that you can Make or you've already just validated it Why would you validate again by passing it into the public one? just Pass it into the private one that the internal unsafe one that's You don't need to check the same thing multiple times And it's faster if you don't so all of that modularity Flood and kind of modularity given Right kind of system the right kind of architecture and everything you can accomplish that and in fact a big part of why I constantly tried to do that with my projects is that Actually the way that I Name spacing works through its packages enforces that that is possible which is clever, but There are a few leaky problems There's certain exploit you can do because of an information leak that should not exist and then exploit is problematic But the general idea about how to add deals with that is actually Fantastic and allows that higher degree of modularity Which is then never utilized by any project that I've seen out there and it's mind-boggling why that's the case Because you could actually build each package as a discrete shared library and swap them out freely because I had a has very strict rules on how that's done unlike C++ if that's a much more feasible thing to do That's fantastic a good package manager would be able to exploit the living hell out of that to provide all kinds of special Efficiencies, but it's just not just not done So bring your has to be built as As what is essentially a single mono repo Not including the room back for is what it is, but this is fine because the point of string here is to be a large part of the runtime for Langley and then What Langley gets used to build? and those things can Design modularity right into the language and to address this problem a little better It's only a delayed problem. It'll go away in time for now string here is just Very tightly coupled