 So I've been working on this for a while and it's time. I actually formally introduce it because I happen to be very impressed with this project This is going to cover two separate assemblies that is in the dot-net world projects compiled into individual libraries That belong to the same solution and The original one was called stringier as it's sort of an extension library to strings in general So it kept with that name The other one is going to be an implementation of pattern combinators Not to be confused with parser combinators That was built sort of on top of this does use some of the Stringier extensions We'll get into that So first covering the stringier extensions library what is it and I think one of the best ways to start covering this is to actually talk about what it's not It's not a new string class. It is quite literally an extension library on top of car innumerables of car string and innumerables of string Now because it works on innumerables that means whether you have an array of car or string List of car or string or whatever as long as it's an innumerable. It will work on those So you've got your bases covered and this is going to be a non exhaustive Coverage of some of the extensions that it provides It's pretty minimal to be perfectly honest, there's a lot there I will have down in the video description a Link to the documentation website all this is on I highly recommend checking that out and it covers fees in considerably more detail, but Let's show off some of the useful ones. Well, they're all useful, but clean Takes any instances of multiple spaces and Reduces them down into a single space while also trimming spaces from the beginning and the ending So if you have you know the string hello world But with excessive amounts of spacing that you don't want there a single clean operation will reduce it down to a nice tidy hello world string Ensures begins with there's also a related insure ends with I wound up needing these quite a bit And found it's much easier to just call this instead of check if it's there and then Do the appropriate thing if it is or isn't so Ensuring that mister is in front of any name It takes Bob Saget and returns mr. Bob Saget, but if you've already got mr. Bob Saget It just gives you back mr. Bob Saget I have listed here just for string array the join works on any I numeral so It's not specifically arrays In this case There already is a string dot join however it This was before extension methods were a thing and so they kind of had to do that where it's inside of a From the C sharp world a static class from the visual basic world a module And I added this in here to just kind of show that a lot of these things that were in these modules before You can call through extension methods now just Kind of makes it more one orthogonal and two kind of convenient to see what's there since when you're Working with any of these languages Primarily do it through extension methods and if you don't see the or not extension methods but you primarily do it through methods and if you're You know going through seeing the instance methods, and it's not there. It's actually a static method belonging to a static class It can be easy to forget that these things are there So this just kind of helps with that and I've got the example there for anybody who doesn't know what a join does but Most of you are probably aware Then there are some utility things like occurrences Which just simply counts the occurrences of whatever you specified in this instance the single character that you specified and Because these are useful the conversions are a little bit trickier than some people expect We've got two camel case where it takes a string and converts it to camel case To pascal case is another one although there are others as well Just to kind of expand beyond just to lower and to upper So then some ideas for future work Right now as I had explained these are only working on car I innumerable of car String and I innumerable of string It would be nice to make these Take spans of car as a parameter and also return spans of car as a parameter now. I suppose I'm pretty sure the span already implements. I innumerable I might be wrong there, but just because of specifically what these are it would be a good idea to have a Specific implementation around it to kind of cut back on the amount of copying or allocations or other things that the more generic Methods would actually have so then getting into the real meat of this presentation is The patterns implementation and stringers so what is this exactly? What is a pattern combinator? And again, I think you'd be a good starting point to say what it's not It's not regex and I think I forgot to add this in here But it's not a parser combinator either This is quite a different approach and this why I'm explaining it here. I Did remember good on me So now what what is this? it takes inspiration from Snowball 4 spitball and Unicon in my instance primarily from Unicon but Unicon ultimately comes from spitball 4 or Snowball 4 and heavily inspired by how they did parsing because it was I find a very elegant approach But I need to add some further information about what this is is not and It's not a snowball clone. It's not Copying that approach. It's inspired by it, but it is not the same approach So why do this why? Create a new thing And regex sucks say we have a regex for matching a name in this specific instance Really just any word followed by an optional amount of spaces in other words Would handle many cases of names although I'm sure someone's name out there probably violates this But then you have a closing statement like a closing statement for a letter where you you know you sign off and I'd like to combine these I want to do This the Sincerely followed by the name that I had already written Followed by a comma But instead I have to create an entirely new regex with everything contained inside of it Now just to be fair because I know somebody's going to bring this up you can Save parts of the regex as independent strings and string can cannonate them. There are Issues with that like having to remember Downstream when you're actually using the declaration that you had before Whether or not you need to add weren't boundaries or not And you have to be careful for a lot of reasons. It winds up being quite fragile Really just regex was not designed around Any kind of don't repeat yourself or usability in any way So that same kind of example using patterns The name declaration is exactly the same We'll cover exactly what these different operators are Really soon, but this is just to sort of show the general comparison the closing statement The closing pattern rather What's up working? You do exactly what we want to do And it winds up basically being the same comparison Some of the internals are slightly different, of course, it's not the same thing But it's basically the same comparison and in these instances everything is so similar that it's pretty much one-to-one comparisons and This is clearly a reusable approach now just to address something on This this idea that patterns died out There is actually quite a list of Snowball for or spitball inspired pattern libraries There are even a Language group a small language group, but a language group inspired by these So some of the concepts behind this a literal is Just an exact match and as you can see it's declared by just assigning it a string as if it's just a string As long as the language you're using does implicit conversions this works I do have in that documentation I alluded to earlier Examples in both C sharp and F sharp. I find just given how this works The C sharp and visual basic implementations are pretty much identical So you don't really need to know you don't really need the additional Examples for visual basic, but the there is F sharp specific support for this Partially dealing with the fact that there aren't implicit conversions in F sharp partially just making it more of a functional friendly design and There is some more I want to say about that later You'll know when I get to that, but just keep in mind for now that much like regex You can use these patterns in any language It's not a specific implementation So then we have the alternator Basically just match either of these you're defining the second choice as an alternate of the first And it's just done by an or statement Chained match or concatenators. I see I used the old name I had here when I was originally implementing it That's my bad. This should say concatenator But it's just chaining them. So the first before the second before the third or conversely The second comes after the first the third comes after the second Repeating matches just saying it has to happen this amount of times I do have planned making this work with the new net range type So saying that it must happen between this and this other amount Right now it's just on the specific amount There's also the spanner which What I'm calling it, but it's repeating it but repeating it one or many times You'll find and most cases these are very very similar to the regex definitions and I'll cover a little bit more about this, but basically it's because the design of this is You describe What it is that you want to parse You describe the match Not how to do it and that's what regex does and it's different from what parser combinators do Optional same kind of thing I do want to say that in this instance the regex this works because it's declaring that it's lazy That it should be lazily parsed or lazily matched or however it is they describe that The pattern does not quite work the same way, but Based on how most people use it. It's it's the same kind of thing and in fact in the regex example I gave there it is the same thing where that lazy symbol the question mark actually Differs is when you put it behind like a dot star and then the question mark then it's changes behavior quite a bit and The next example will cover a little bit about that Wait, I think the next examples on kids well, it is okay, so What about the clean star because I don't have a specific symbol for the clean star? You just make a spanner optional and it just it does exactly what you would think it does Just to kind of allude to one of the features of this There are some hitfalls like if you were to reverse these by making Example optional and then spanning so you reversed the operators. So it's plus minus example that actually winds up creating an infinite loop when parsing and That is caught at construction time so when you define the pattern when the Runtime goes in to initialize that variable it will throw an exception there fully Explaining exactly what was wrong So it will catch that issue and tell you that you probably meant to write those the other way around Capture and back reference so this one is on the capture part of it and Basically, you just call a capture method and have your output with the capture type the capture type basically being a string Just with some trickery to make it lazily resolved Because that's actually important when you go to use it as a back reference For the actual back reference side of this you just take that output result and throw it back into a pattern and because of lazy resolution It winds up working quite well so Those of you who watch my other videos know damn well that I care quite a bit about performance and Don't think people measure it anywhere near enough In fact a big gripe of mine is the large amount of programmers who love to say things like oh Our solution is really performant and lightweight But you have no Evidence, I'm not doing that So let's get into this. I've benchmarked these extensively and these benchmarks are in the repo So if you want to run them yourself on your own machine Go right ahead. I Want to see any variations any oddities. I Don't want to pretend that I outperform these others implementations of Text parsing matching Whatever you want to call the broad group. I don't want to say I'm better than them or whatever I want it to be very very clear How these stack up so when it comes to parser performance So this is specifically you already have the pattern or reg X or parser combinator thing Done. You've already Initialize that to a variable. This is the performance of when you call the reg X dot match the Run for the parsec or Any of the parsers that the my my approach works with There was no section on that technically there are different Because of the separation because what you're defining is a pattern There's different parsers that can be called on that Again check with the documentation There'll be a link That'll go into far more detail, especially since I don't want to cover them too much here because I don't know how many more I'm going to add But basically as you can see the performance of this stacks up pretty well Defined the performance of Microsoft's reg X implementation tends to lack quite a bit although it's not Terrible like generally speaking at parsec beats it out, but it's it's not terrible But overall you can't see a whether I'm a clear winner or loser. It's just kind of I'm competing One thing to note. I am not super familiar with F parsec. I did not want to create Misleading benchmarks For things that I could not figure out how to implement well or could not find a good implementation of already So I left out the web address and IPv4 address Implement benchmarks, so you will not see them for this I'm sorry if you know how to implement those well in F parsec, please add them to the benchmarks I don't know. I'm not going to mislead people so the next thing is How much memory usage occurs while parsing and you can see that F parsec uses up quite a bit Microsoft's implementation is a bit more lightweight as far as memory goes and in What two cases? I'm competitive with Microsoft's and One case Microsoft beats me out really badly I'm pretty sure I do not have an optimal description of a web address pattern Yet that's sort of a pitfall and using a new design this I don't necessarily know the best way to use my own thing But in two cases I'm using up Substantially less memory I'm quite happy about that One thing to note is for the phone number while it looks like I forgot to do the benchmark or whatever At least on my machine. It doesn't require any memory allocations to do that So That's cool. So then we got into How much time is spent actually constructing these? So that is when you assign the reg X or the pattern or whatever to the literal How much time does it take for the for the runtime to construct that object? For both my own and F parsecs, it's pretty efficient actually Doesn't take very long at all. I beat that parsec by quite a bit and Reg X is just oh my god like it takes quite a bit more time now While this might seem quite alarming You only initialize it once and you're probably using it multiple times. So it's really not that bad Don't use this as your primary factor with deciding what approach you want to use This is not your top priority in the majority of situations But The fact that mine winds up being the fastest is actually extremely important with something I'm going to talk about towards the end with how patterns can be optimized under the hood Self-optimizing Whatever you want to call it, but it's possible to have the constructor Rewrite parts of the pattern in ways that are known to be more efficient That is extra work But because mine is so fast when it comes to construction that extra work still has me as fast So then the memory usage as You can see it winds up basically being the same the exact values are a little off so Ratios relative to each other are off But it basically looks like the same thing mine winds up being the most lightweight as far as construction goes Followed by a parsec with Microsoft being a bismuth but this isn't the whole picture and in fact one thing you want to do is check against a Failing Match That is you have something like a pattern Reg X whatever that would match a phone number, but the phone number is typed incorrectly There's a letter in it or something How does it respond to a bad match? You can see mine does well really well In fact the numbers wind up being basically the same for the successful match Microsoft's as far as performance Wound up being about the same Without and but when it failed it did not need to use any memory. I believe that is because Reg X systems typically and Microsoft's definitely does not use Any type of error reporting when it doesn't work You just kind of know that it doesn't work and you don't know at what point it failed that's part of the reason why these these Reg X tester websites are quite popular and I know I've used them extensively because it can be a little tricky to debug those kinds of things but My approach as well as F parsecs do error reporting Clearly one of us handle failures and do error reporting much more efficiently so that I stress test them and stress testing was done by generating a very large string But it specifically was from one to or from zero all the way up to the integer max size divided by four If you were to write this out to a file, it would be just over half a gigabyte large and The string contains Random selection from an array of every lowercase letter in English As well as a space So what you get are these words delimited by spaces This was done because it's quite easy to implement a match or a parser for these Because the primary Purpose of this test is how does it respond when consuming incredibly large files? because Things don't scale linearly Something that performs really well on a small scale doesn't necessarily perform the same way on a large scale And you can clearly see what the results here that the performance did not scale the Examples that the charts I showed before are not represented here now and What are the findings with this? Good job with the scalability Microsoft. I seriously commend that I expected you to be the worst performing and as it turns out You beat mine across the board Like seriously really good job Clearly if you're working with absolute massive files There's a good approach to use Mine holds up well though. Just barely trailing behind one thing to note is that Microsoft's approach does use a little bit of memory eight kibby bytes exactly I'm Very surprised to see them using less memory than both mine and f pars x I I don't know Why this is the case. I I don't know why mine is using so much It's you know still a fairly new product. So I Still have quite a bit of performance optimization to do. I just I Clearly need to do some diagnostics and figure out why that's why that's the case I did not expect it to use that much But still it's not that bad. It's definitely beating out what is Claimed to be just as efficient as a handwritten parser It's that parsecs. That's that's what they like to claim I Think they should do a little more testing Because that performance for a parsecs is really Really bad do not use that on large files at all. No So if you went and implemented a thing for a parsec for you know, really massive bits of data like a CSV from I don't know seismographs or whatever I'm sorry. I'm sorry You got to rewrite that whole thing. The performance is terrible So ideas for future work for this because there are a lot in this instance There needs to be more pattern types and parsers Something that I had not implemented at all is look ahead and look behind I'm looking into no pun intended a way to unify this So that you can just simply state that a specific part of a pattern should be looked for But basically what a look ahead or look behind is You check to make sure that it's there it has to be there for a successful match But it's not actually going to be returned as part of the results. I'd like To unify that to where you can just say hey look for this and based on where it winds up in the Constructed pattern it makes it a look behind look ahead a look behind or I guess that skip And that's something in the middle that Must be there, but you don't want as part of the results. I don't have a find parser yet either That would work similar to the match in reg X where unless It's anchored to the beginning. It goes and finds the thing and then others There is something I'm working on that that I will talk a bit about I don't know where So there are some need for further Optimizations in that I need to identify more instances where patterns can be rewritten and implement that rewrite Luckily these rewrites are actually really easy to implement because they work through a sort of synthesized multi dispatch It's not exactly multi dispatch, but it's not function overloading There is some dispatching going on Another thing is to investigate whether or not to use unsafe code I'm not sure about Microsoft's approach, especially since they have different implementations for the net framework and net core I'm not sure What they're doing on both of those but F parsec is using unsafe code and is part of why it's so fast but I'm not using that yet and To be perfectly honest, I'm not sure if I want to there we go another thing that I need to Wait for in this instance Both of these libraries are implemented on net standard 2.0. I don't really have access To the stuff that was provided with net core 2.1 or net core 2.2 Yet the ability to make a span of car from a stream Was introduced with net core 2.1. I will implement this as soon as I have the appropriate net standard version, but It hasn't been officially released yet, so that's that I'm sorry, but I got away another thing is more predefined patterns Completely forgot to mention this in its entirety, but Inside of the patterns class. There is a large collection of static Predefined patterns. This includes basic things like what you would know as a Character group in regex for things like hey match any letter match any number match any space match any control character These are already defined It would be good to have predefined patterns for bigger things like an Entire email address or an entire URL this way Not only do you not have to worry about the complexity of some of these patterns because some like you are a little bit really complex But also it's implemented in the most efficient way known So I want to just give some notes about F. Parsec. I suppose you could say I'm shitting on it a little bit here So some some stuff I got to say about language compatibility F. Parsec was defined in F sharp There there's part of it that's defined in C sharp so that they can have the unsafe code bits But the libraries are forced into The assemblies are forced merged into one single thing one single nugget package There isn't really C sharp support not officially anyways. It was defined overwhelmingly In F sharp and is meant to be used in F sharp. There is a third party binding to it in C sharp It's a little wonky in my approach. However, while it was defined in the C sharp Because visual basic is so similar. I did some basic tests just to make sure that the operators did not collide and You could define the same things in visual basic, but it just works exactly as you would expect You could do that same approach in F sharp, but just as to kind of make it more functional I do have a full Binding library for F sharp where everything is exposed in a way that is very What you would expect in a functional environment All of these are official Supported things the entire design was Took into consideration the means of all these languages And there is also an official API for bindings that I created that was used for the F sharp Implementation, but you could use for bindings for other languages as well Updates for F parsec seem to have slowed down Substantially, I don't know if the project has been abandoned or not, but not a lot has happened in what over six months now there's a growing number of issues that have been reported that Aren't even getting comments on let alone fixes But I don't know what's going on there. It's definitely not Finished there and I haven't declared it complete. There are things that they have declared in the works that It just seems like nothing is happening anymore another note is that because F parsec is a parser combinator. You're stringing together parsers You're describing how to parse the text So there are Specialized parsers that you have to use even though they parse what is basically the same thing You have to know Which ones of these you want to use in which situations? you have to go through and You know structure the thing optimally because you're responsible for performance Whereas with my approach, you're only responsible for Describing it so there is what I'm calling the underscore nightmare with these specialized parsers their languages like F sharp that allow underscores Inside of numeric literals Ida is another one. In fact, if you're watching this video You there's a good chance here one of the people who regularly watches the stuff from my channel and I primarily cover Ida It's another one that allows underscores inside of numeric literals They're Kind of used as digit separators like you would use a comma. Let's not advance yet The issue is that the rules for where to put these underscores vary quite a bit between these languages In some instances, you are allowed to have multiple underscores next to each other while in other instances And that would be a violation So because of these specialized parsers They have to make compromises that winds up either Strictly supporting one approach and kind of saying Sorry tough shit for the other ones Implementing all of the approaches, which means either you have a large set of these specialized parsers and need to pick the right one for your grammar One single one that unifies these but you have to specify the configuration for your grammar or There's one other one. What was it? It winds up being a very complicated matter and it just sort of Easier on everybody when you describe the grammar instead of what to parse The same exact thing happens with how to specify bases While many languages use the syntax that originated with C There are other Syntaxes for specifying a numerical base and it creates that same kind of problem So finishing this up some of the technical details about this approach why it works what works well the error reporting why mine was So efficient The errors are not exceptions at least not until you need them There is an actual error type that Are mapped to the exceptions that is there's a little tree of errors that is mapped to this own little otherwise identical tree of exceptions Throughout the entire parser for the pattern The errors are assigned This winds up being considerably more lightweight than exceptions clearly because you're only assigning a structure with the relevant information inside of it When you assign an exception you wind up doing this whole Stack information another shit that gets carried along with it that makes it very very heavy weight But because of this mapping Well, I'll say this first these errors allow you to get fine-grained information about the failure So you try parsing something and it doesn't match it will tell you at what part of the pattern it failed at Which is super helpful when debugging and is something that I really wish where I get supported But because of this mapping between the errors and exceptions you can throw the corresponding exception when you take the result type from the parse operation and you can call a Throw exception method and it will automatically convert back to the appropriate exception and throw it If there didn't happen to be an error then that method call winds up doing nothing So it is safe to use. You don't have to worry about null checks or anything like that. It's done for you and Clearly that significantly improved performance Another thing to note is that The entire thing makes extensive use out of being mutatable The whole thing with functional design that caused it to not scale well at all Don't ever mutate something make everything immutable. It's got problems. I Don't wholeheartedly did Disagree with it though because there are plenty of situations in which you don't want things to be mutatable externally That is you as a consumer of this library Cannot mutate the source or the result objects They are immutable as far as externally you can tell Internally Mutate the hell out of them For example when parsing the result object is created at the at the very beginning when you call the parser that you're interested in The result object is created immediately and is passed by reference through the entire chain Mutated the entire way, but what you get back out You can't change at all this I Feel is the appropriate compromise between the two Mutations are definitely faster. They're definitely much easier on the garbage collector or the amount of times You need to manually deallocate either way you wind up Doing more work when you do that Slow things down, but it prevents the errors that accidental mutation can cause Because you don't want to parse something Accidentally change the result and then be wondering why you Don't have what you expected you to have and lastly I want to finish this up by discussing how the self-optimization or the pattern rewriting works So here we have a little pattern tree Yeah, because in this instance it definitely is a tree it would be more correct to view it as sort of like a branch list But it's here. We definitely have a tree There are three literals We've got hello a space Those two are joined by a concatenator and Then we've joined that by a concatenator with the literal there So it would parse hello there as it turns out parsing a literal is faster than parsing a concatenator So during construction time you can replace concatenators of literals with another literal that has Internally those concatenated It just slight tweaking The result is exactly the same But it winds up being faster So if we concatenate the hello and the space We can replace it with just a literal Hello space Concatenated with the literal there Now here we also have two literals that are concatenated so we can do this rewrite again getting Hello there and on that note Hopefully you found this interesting Try it out again Documentation is available. There's quite a bit of it. I have the link down in the video description and Until the next video have a go