 All right. Hope all you had lunch. Nice lunch. I hope we'll have the afternoon sessions informative as well. So the next talk we have is with Patrick. Patrick will be going over pattern combinators. It's all yours Patrick. All righty. So it is good to be back. Thank you again, Buscar, for organizing this whole thing again. Thank you. Last year I had talked about, or I guess earlier this year really, I talked about text processing in general. And this time, one of the things that I had mentioned was on parsing frameworks. And this time I'm going to be talking about a very particular style of parsing framework that I have been working on and developing. Some people like to have slides in front of them. And we're not going to get into everything that's in here because towards the end, because it's less interesting, the most people is a deep dive into performance stuff. So if you are interested in that, you can still get that information without bogging down the entire talk, explaining charts that most people aren't going to care about. So there's a link to that. And I've posted that on Twitch as well. Let's get this moved over. Okay. You guys see everything good? Yes. All right. Great. So this is on pattern combinators, not to be confused with parser combinators. And I will be talking about how that's different. Overall plan for today is to talk about what the problem that I'm facing is, because the various options that are out there from regex to parser combinators to parser generators, haven't really been satisfying my use cases. Then it'll go into a general overview of pattern combinators. I will be referencing the specific one that I'm working on, but nothing about the general concept requires my specific implementation. So if you're interested in this, but can't use mine for whatever reason, like it's different language, you would be able to recreate this idea in other languages. Then we'll dive a little bit into the technical details that make things work and work well, as well as future work that I have on the backlog that I'm either working on currently and haven't finished or will get to in time. Then we'll do the Q&A. And if there's time after the Q&A, people haven't had questions, then we can delve into the perf, but that's not as interesting to most people. So we'll put that in the back. There will be a little bit of perf stuff scattered throughout, but the perf details are towards the end. So the problem that I've been facing. Most exchange data is text. And for anybody who's worked in business, this shouldn't be surprising at all. That's what the overwhelming majority of data is. It varies, obviously. If you're working with sensor networks, like that you're at Boeing, probably not the case for you, but the overall majority of business is working with text. Text parsing is difficult. And there's a variety of reasons for this from rather poorly designed libraries that are out there, or even just the complexity of language itself. It is widely believed unstructured. I'm going to be a bit pedantic on this. All language is structured, but the grammar of it can be remarkably complex to where you might as well consider it unstructured anyways. And text parsing is very computationally expensive. There's a number of reasons behind this. One of the biggest barriers that we've been facing, however, is that it's very hard to parallelize text parsing algorithms. So as far as the existing solutions, obviously there's ad hoc. There's always ad hoc solutions. I would generally say this is ill-advised. There's complexities in text parsing that you probably don't want to deal with in most cases. That being said, there are examples where this has been done amazingly well, like with SIM to JSON, where they were not able to parallelize the parsing, but were able to use SIM to vector instructions to do the parsing. It is fascinating. I'm very curious as to how they've worked that out, but I haven't had the chance to dive into the source code yet. But if you're working with JSON a lot, that is the library that I would recommend more than anything else, even using my own framework. Use SIM to JSON if you're working with JSON. As far as regular expressions go, they are ubiquitous, which is fantastic. Pretty much any programming language and runtime and all that other stuff, you're going to have reg X support. It is declarative as well, which is also nice. You don't have to tell it how to parse or how to search for the thing that you're looking for. You just tell it what to look for. That is ideal. It is moderately well integrated. It varies, of course. Certain languages offer better support for this than others. Perl is notoriously fantastic. JavaScript also has dedicated reg X support. In other cases, it's often more just something that lives inside of a string, which is a bit obnoxious. It varies. With .NET languages, it's sort of a middle ground. It's not too bad, but it could be improved. It's also not composable. This is the big issue most people have with it. When you're doing more complicated work, you'd like to be able to, much like you do with functions and objects and all sorts of other stuff, take discrete units and use them in other places by combining them together. You can't really do this with reg X. You could try to do string concatenation. People have tried. It fails in a lot of cases. I would not advise doing this at all. They can be a bit hard to read. This one is a bit of a double-edged sword because for what reg X expressions are ideal for is a search and replace. You don't want to type out this big long thing in a search box. For that kind of thing, reg X is actually fantastic, but for more complicated parsing situations, it winds up being way too hard to maintain and to read. Parcer combinators, on the other hand, they are often but not always language specific. You have some cases like Fparsec or Haskell's Parsec, which are very tightly tied to a language because they rely on language specific features. Other options like for .NET Pigeon or Sprock are much more general. You could use them from any of the .NET languages. So it varies, but they are more often than not language specific. They're also not declarative. There are some claims where they are, but I have here two examples from Fparsec that parse the exact same thing, but in different ways. And you always want to use the bottom one. If this was a truly declarative thing, you could use expression rewriting to always have the second one, even if the person wrote the first one, but that's not what's being done. So I'm hesitant to call it truly declarative. What parser combinators do is more so combining specific parsers, hence the name, rather than combining what's to parse and letting the system figure out how to parse it. They are very well integrated though, which is nice for using parser combinators. The only barrier to entry is referencing a NuGet package in the .NET world or some other library with whatever your system is. This is ideal because the more you increase your barrier to entry, the harder it is for your product to be used. And as a user, do you really want to be fumbling around with huge amounts of setup just to be able to use something? Usually not. They are very composable due to their basis in combinator theory. That is a huge, huge benefit and is a big reason why that they have been taking off recently in comparison to Regex. The composability is extremely important for management of more complex systems. Now, readability does vary by implementation and I find readability for parser combinators tends to be a very subjective thing because so many of them rely on unique operators for their operations. Some people like that, some people don't. Parser generators, on the other hand, are a class of frameworks like Antler or Yak. They are extremely powerful. If you've worked in compiler development at all, you have definitely used, almost definitely used one of these. Integration for them, however, is very complex. It's an entirely separate compiler really that you have to integrate into your tool chain. You then have to take the output that it generates and integrate it into the actual thing that you're building and it's substantially more involved. This barrier to entry discourages a lot of people from using it. They are, however, very composable. It is not strictly based on combinator theory, but it is still a very well designed system for composing patterns together. And due to it being a unique language around this, they are extremely easy to read. But all of this ties back to an old language called Snowball. Snowball, for those who don't know, is a language that was designed specifically around text parsing. It wound up becoming largely obsolete because it... I don't want to put this. At the time it was created, regex was just far more preferable. When you're dealing with like the mainframe terminal style situation, typing out a long amount of characters and having to send all that between a RS232 line is cumbersome. We don't really have that problem now, though. One other major issue with Snowball, however, is that it's pre-unicode and that's actually a really big problem because we... not just in the United States or England anymore, there's a whole planet of people that speak all sorts of different languages. And if a technology only works for a very small subset of those people, that's not acceptable anymore. So the overview, what are patterning combinators? What was focused on in the design? Key things. It should be tightly integrated. Now, we're going to get into how this is sort of a multi-stage thing, but at its core, how most people I suspect will be using this is just a simple reference to a NuGet package. Just as easy as parser combinators. Another key is that they should be highly composable. Don't want to repeat the problems that you're seeing with trying to use regex in more complicated situations. I had developed a rather sophisticated add a syntax highlighting extension for Visual Studio Code that because I couldn't reuse anything, it was a nightmare. There was an insane amount of duplication and it was definitely... I felt that it should be fully declarative. There should be a strict separation in what is written and what is actually done. And I don't want to recreate everything under the sun, so it should be based on well-established theoretical models like combinator theory, graph theory, and algebraic identities. If you're not familiar with those, that's fine. I'm going to have plenty of examples so that you're still going to get it regardless. But if you know what those are, then you know what I'm talking about up front. Algebraic identities might seem a little weird, a little out of place, but it was actually critical for the method in which optimizations of these are done. Another one that I don't see too much of a focus on is that it should be debuggable. This is actually an advantage regex has over some other options in that sites like regex 101 exist, which allow you to actually see what's happening when the expression is evaluated and the match occurs. This kind of system needed to exist from the get go and the design that I had come up with takes that into consideration. So there is a trace mechanism where you can actually go through and see at each stage what was consumed by the parser. And while not as important as usability, performance is incredibly important. Say your business is evaluating a thousand user documents a day and those documents are important for orders or something like that. If things perform well enough that you are able to go up to 1380 documents a day, then you just increased your business throughput by 388%. That's more profits. That matters. Performance is not king, but it definitely matters. And the single responsibility, something that I had said about the parser combinators that I disagree with a bit is how what you're composing is the individual parsers and it's often described as in a way that it's the patterns. By strictly separating these, it allows for some very important things that we'll talk about later. So what is a pattern and why do I keep saying that this is different from just stringing together a series of parsers? It's an object expressing what a parser should attempt to parse. It's not how to parse it. It's what to parse. It is purely declarative. It is a first class type. So it's not an expression that exists within a string that needs to be analyzed every single time. It's a type you work with just like any other type in the language of choice. Internally, it is an expression graph, not an expression tree, an expression graph. And this was done deliberately so that recursive grammars can be supported, which is important for certain types of languages. By being a graph, cycles can be created. And I know that's kind of a scary word because it is kind of a scary concept, but it's you're not going to get cycles and therefore not recursive grammars with trees. There are, if you're using this as a library from say C sharp or F sharp, there are some hoops you have to go through to enable recursive grammars. So it's not like something you can accidentally create. You won't just, oh, oops, I have an infinite loop now. You have to do some specific things to make sure that you're aware that this, you're now in a situation where you can create these cycles. Something that we'll talk about towards the end helps address this without you having to jump through hoops. And the pattern is not at any point a parser. Rather, it feeds information into the parser so that the parser knows what to do. But it's just what to parse. As far as pattern types go, this is everything that I have implemented as of a few months ago. I've been going through some major changes that aren't done yet. So there are some new stuff that is being added. Despite how complex this looks, rather, this is overwhelmingly stuff that exists under the hood. And you only work upfront with a very small subset of these. The whole thing with it being based on graph theory is that graph rewriting can be used where you take certain parts of the graph and replace it with something else that is identical, those algebraic identities. I'll have a very simple example of that later. But most of these exist for that purpose, essentially providing optimizations that you don't have to go out and deliberately use. They just exist. And the system will do the right thing for you. So a quick dive into performance or some really real-world examples. First up, oh, I got to explain the keys. MSRE is the Microsoft regular expression library. If you're using .NET, that's the standard regex. Yeah, the standard regex you're using. PCRE is the pro-compatible regular expressions. I wanted to compare them to see how MSRE was stacking up against PCRE. It's generally preferable. So don't go out of your way to use PCRE unless there's a very particular need. The two examples that have C underneath or next after them are the compiled examples. Because regex are expressions, there is an expression compiler for them. If you've dealt with the regex flags, you've probably seen that and maybe even wondered about that at times. You usually want to enable that. You get some pretty big improvements. Fparsec is the parser, I believe the very first parser combinator that ever got implemented onto .NET. It's inspired by the Parsec library from Haskell, but it's implemented for Fsharp. Consuming it from Csharp is a little bit weird at times, but there is adapter libraries to help with that. Pigeon is written by Benjamin Hodgson of Stack Overflow. Overall, pretty good parser combinator library. One of the better ones I've seen. Stringer is an implementation of the pattern combinators that I have implemented. First up on the more real-world situations, a source code identifier. This would be something like the name of a function. You can see that compiling your regex expressions definitely makes an improvement, but I still beat them out by a little bit. IPv4 address. Everybody here is a bit more competitive with each other. Again, compile your regular expressions. A phone number. It shouldn't matter all that much, but this is US phone number format. There may be very minor differences for other countries phone number formats, but I will not expect much deviation from this. Line comments. This would be like in C or Csharp, where you have the double slashes and then it reads everything until a new line, and then a string literal, just like an ordinary string literal that you'd have in any programming language. It may or may not have an escaped double quote, but it's not like raw performance is the only thing that matters. Memory allocations matter too. Just start to notice something kind of interesting here. Mine does not do any heap allocations at all. That winds up being very important at scale because you can have a large number of things going on and not be filling up your memory with anything other than the text that you are parsing. There is no extra junk. No smart stress test because we're always concerned about how these things perform at scale. I have two examples here. They're on different slides because you'll see for the next one why that's the case. Everything else would be impossible to read, but overall things perform really well. When Microsoft said they had improved things for their Regex engine with the .NET Core 5.0 release, they definitely did. Some rather substantial improvements in the stress test. Good job on that one, but there's a caveat. Got ahead of myself. I don't know why this happened, but there is exponential growth in how slow MSRE gets when a very large source text does not have the expected pattern. Not with mine. You get very predictable behavior. As far as technical details, what is making this work? Why am I getting performance benefits over especially the parser combinators considering things sound rather similar? One major factor is in using reference slicing. This was introduced in the .NET ecosystem with the memory, read-only memory, span, and read-only span. These work by not allocating whole new chunks of memory. Rather, the existing string or source text that is already there just gets two references to part of that. Those two references get passed around instead of a whole copy of the relevant part. That winds up mattering a lot. Another one is in implementing stream parsing. Now, I haven't fully implemented this yet, but it winds up being very significant from what I've seen with what parser combinators have done. Stream parsing is different from buffer parsing, where with buffer parsing, like with what RegEx always does, the whole thing just exists in memory. With stream parsing, like how streams work in general, you wind up seeking through stream finding what you need, and the whole thing doesn't exist in memory. You just copy what part is relevant to you, and the whole rest of the stream just gets discarded. Another one. I know this is kind of a taboo in certain programming circles, but a lot of the internals wind up getting mutated a lot. This is important for performance, but I will quell some people's rage on this matter by saying that once some of these relevant objects, the things that are generated during parsing, leave the parser, they are no longer mutable. When you as the user get, say, the captured result, it's not mutable. You cannot change that unless you resort to funky things like reflection. Why would you do that? Don't do that. Graph rewriting by dispatch. I mentioned this one, and again, I will have an example, but this wound up being one of the most critical performance boosts that I had found. The dispatch part winds up being important too, because normally, like in a more academic setting, when they're talking about graph rewriting, it's usually as a phase after you've built the entire graph. Instead, what happens because of the large amount of types of nodes that you saw in that chart earlier, everything is implemented by dispatch, so virtual method calls, overloads. When certain parts of an expression happen because dispatch is taking place, there's less information to have to work out, because the compiler in many cases has already figured that out before the program is even running, and then the replacements can happen based on algebraic identities. Another one is using error codes instead of exceptions. Exceptions, while useful, are very heavyweight, and so instead of throwing exceptions when a match doesn't occur, the match doesn't take place, the parser backtracks, and efficiently backtracks. I worked out a way in which it's only writing two int64 values, so very fast compared to some of the other methods of backtracking, but errors are reported through an enum value similar to C-style int returns, but an actual enum with significance. The graph building and rewriting, just how is this thing working, visualized. Say we have an expression like this, a string literal of hello, a concatenator, a single character of a space, a concatenator, and then another string literal of there. So the full thing is just hello there, but broken up. Just like you would evaluating an arithmetic expression, let's start with the first operator and two operands, then we'll want to add on top of that the last part of the expression. Now the whole part about algebraic identities, one of the identities that I had found is that instead of concatenating two different strings inside of the graph, it's actually more efficient to do normal string concatenation and replace that so we can take the hello in the space and replace it with a single string containing both. This leaves us again with a situation in which we have two literals, so two non variables and an operator. We have an identity again and can replace that with a single node of hello there. In this instance, the dereferences went from one, two, three, four, six, seven, from seven to one. That's a performance boost, rather substantial one. No, actually from 14 to two because the methods virtual, that winds up being a dispatch as well, so that's a pretty considerable decrease in the amount of dereferencing you do and is responsible for the performance improvements. Now, sometimes the way in which an identity would be parsed is different and so in those instances it will call a different parser, but you don't need to know any of that information. The system figures it out for you based on rewriting on these identities. As far as future work goes, in general, I have implemented a system for effectively working with Unicode Graphium clusters. Think like letters with different accents, but it includes other things like more complicated writing scripts tend to have where you're composing different symbols together into a single character. I've implemented that, but I haven't tied it into the parsing framework yet, so that work still needs to be done. Stream parsing, I've had to implement an entirely new Streams API that bypasses the .NET Streams because of a design decision that makes them unable to work with parsing because once you've created a text reader or writer and have used it, you cannot seek the Stream anymore, which winds up being quite an issue. I have implemented those Streams on all the different types that you would expect, so file streams, memory streams, network streams, and so on. I just have not tied that into the parsing framework yet either. Categories, like from category theory. Think Unicode character categorization just done as a full expressive system that you can compose together as well. That I have implemented and again have not tied into the parsing framework yet. Pattern searches, I have a few search algorithms implemented, have not implemented a expression-based search yet. Full genomic parsing, so parsing genetic expressions, this is largely supported. Two of the pattern nodes that were in that graph were the negator, so parsing something that does not match at all, and also the fuzzer, so fuzzy string equivalents, like if one of the nucleotides had changed for another, but only one, it would still match, and the tolerance of that is adjustable. I am not sure everything necessary for genomic parsing is present yet, but there is at least some support for that. I just want to make sure that it is well supported. Stringing together with some other libraries are forming the runtime library of Langley, which is meant as a sort of snowball replacement. You can use all this stuff from any .NET language on its own, but the idea behind Langley is to provide a very snowball like experience where you have a dedicated language to get this kind of stuff done. Just a simple DSL. It will not be a full programming language because C sharp, F sharp, they already are. We don't need to compete with that. This would just be to help assist with these kind of text processing tasks. Because the .NET runtime treats everything as the same bytecode, because CLS compliance exists, you would be able to use them together very seamlessly by just open up Visual Studio, create two projects of the different languages, and you can reference them from each other. Nice and simple. We got to that because I already knew the stuff. So any questions? I don't see anything on Twitch and how do I see the chat on Zoom now? You can be like on the top. There will be three dots. Oh, over to more and chat. Perfect. Okay. Any questions now that I can see them? So overall, you have been writing about Stringer and you are doing text parsing. So I think we discussed it the other day where you were mentioning about ropes. So is it something you will be using going further? Will I be integrating ropes with this? Yeah. They are not as necessary, although it may be useful. One thing I was thinking, because when you're parsing a buffer, you can obviously use reference slicing. The thing is already in memory, and so you don't want to copy that. But with streams, it's not guaranteed to be in memory because you could have something like a network stream where one computer is sending text to another and you're trying to parse it without saving the entire thing. You could be dealing with a multi-gigabyte-sized file. That happens. And so what I was thinking is for the parser to the result type obviously has to be different. And it may be useful in the context of stream parsing to return a rope. So maybe I'm not sure if it would make more sense to just return an innumerable of string in that instance though. So that's something I have to play around with. But I make my decisions based on benchmarks and usability studies. So we'll see. But it's something that I'm considering. Any other questions? I think we don't have any further questions, Patrick. Okay. And let's see. I started at 145, so I'm done at 155. That leaves us with 10 more minutes. Yeah. We do have. Okay. So then that means we are diving a bit into specifics about performance, which should give you some idea of, one, how the syntax varies between these two options and two, how things stack up on very nuanced levels. One thing you'll notice is that stringier and even across these examples, stringiers and rig X's syntax winds up being actually pretty similar. Just in the cases of, say, like escape sequences, I don't use those. I'd rather have a full variable name. But things are much more implicit than like Pigeon or Sprock, which requires you to specify that a string is a string, which I've always found it's weird. But the performance of even the most primitive aspect of any parser. For this specific example, I'm using the same pattern that I had showed you. So for a source text of hello, obviously, that would be a successful match. Everything except Sprock performs really well here. One thing I have noticed, and I, it's a, even if you're not working with text processing, even if you're not working with text processing, I highly recommend that you benchmark not just successful cases, but also cases where you know your thing will fail because the drastically different performance that F parsec is getting in cases where it does not match is something. But one thing I am happy to say is that my performance across both successes and failures is consistent and consistent is important. You really want to be able to predict how your thing will perform and curve balls make that a lot harder. Some memory allocations as well. You'll notice that MSRE and PCRE wind up being highly consistent. One thing that makes measuring PCRE allocations a little difficult is that because it's a native library, benchmark.net can only measure the allocations made on the managed side. So I really have no idea what it's doing allocation wise for the native code. There may be heap allocations going on there. I have no way of finding out, at least using benchmark.net. I can get into native analyzers as well and combine the two results. But I got more important things to do than just that. So the alternator, this concept is that you have one or the other pattern. You've written this numerous times with regex like this. And then they have their own syntax associated with that. Again, we can emphasize how my approach actually requires much less typing because things are much more implicit. Stringer specifically happens to implement both expression and fluent syntax. So you would be able to use a similar or method if that's something you find more readable. And performance, similarly, I'm doing quite well. Allocations, same kind of thing. Concatenators, one followed by the other. This was something I had to synthesize with regex because it doesn't have any concept of combinators at all. But I had synthesized it by just doing the full expression that I should expect to match. Here is where pigeon and sprock become incredibly verbose. I am surprised that this was not something that they had in question. One of these is a clear winner in the syntax department, right? And in the performance department. So clean's closure, this one doesn't have the same kind of name because it's an actually formally described concept, but it's zero more of anything. You may have already also heard it called the clean star. And that comes from the specific regex usage. Different ways of writing it in fparsex pigeon and sprock, and then how it also is in stringer. This is also another fantastic example of that graph rewriting by algebraic identities. There's no upfront method or operator for specifying the clean's closure in my implementation. And so instead what's done is it call too many and then maybe sort of both of its behaviors. What's done internally however is that the many and maybe parts are rewritten into just the clean's closure so that you wind up with much less junk. And the parser knows specifically what it's looking for. It winds up being a bit more optimal. I think for this one, I got a 9% performance improvement just from a simple rewrite like that. So it adds up quite a lot. Negator is something that does not exist at all in regex. So this is something that I had to really try to come up with how to synthesize the behavior. Don't take these results too seriously. It's something that regex cannot support at all. But the idea is that it should not match the source text. It'll match or the expression. It'll match anything that is not the expression. But if it's the expression, it's considered a failure instead of a success. And you can see again compared to both pigeon and sprock, you don't have to deal with the explicitly specifying that a string is a string. The compiler already has that information. Let's just use that information. Optors or something is optional. Oh, I had a copy error that should not say not the pattern. That should just say that the thing is successful regardless. So if the expression is there, then it is captured and is reported as a success. If it is not there, nothing is captured, but it is still reported as a success. Essentially it's optional. You're definitely familiar with the regex syntax for that. Again, you don't have to specify the type with my implementation. And pretty consistently, I'm higher performing. Always nice to see. Ranges are not a concept that regex really has. So don't take this one too seriously either. Within a single line, this is something that can kind of be supported by using a combination of the any dot, the claim star, and then the lazy modifier. So the dot asterisk and then the question mark. That only works with single lines. So if it's a range, like a beginning bracket and an ending bracket for a function, regex is going to have a substantially hard time with this. So I greatly limited the example just to make sure that this worked. That pigeon example should not be there. So that's my bad. But my approach has direct support for this kind of thing, because I had designed the system a lot around parsing programming languages. And this kind of thing is hugely common. In fact, it winds up being very useful in that if you start the compiler with two tasks that are doing the interpretation of the text, you can take that range, pass it to the secondary task, have it continue the parsing with your main task, continuing the rest of the source code, and effectively get a limited form of parallelization, which is very difficult to do in text. So that's nice. That speeds up the whole thing. Now that is time. Thanks, Patrick. Thanks for going over all about pattern combinators and the work which you are doing with respect to parsing text frameworks. Absolutely. And if anybody is interested in trying this out or seeing how it works or anything like that, there is a link to the GitHub where this lives. Yep. Thank you. You're always welcome to reach out to Patrick if you have any further questions on anything about string processing or text parsing in general. I think Patrick is very active on social channels as well. You can always reach out to him. Yeah. Granted, I'm not always talking about text stuff, so it's a pretty wrap bag. But yeah, you can reach out if you have any questions on anything from literary analysis to parsing frameworks or any of that stuff. Okay. Next we have Rob with us. Rob will be going over GitHub actions and everything about GitHub actions. So let me make