 Okay, thank you. Does this work? Yeah. Okay, so My name is Danny Engelbart I've been working in Unix since 1991 so quite a while and I wanted to talk to you about something completely different You've been here for a few days, right? Some of you even a few days longer had some trainings. I've seen all this nice new stuff and I'm gonna talk about very old stuff Extremely old stuff. So I thought that's something completely different and well if I'm gonna do a Monty Python quote Then well, I need to add Monty Python. Sorry for that, but I Had to do that. Another thing I have to do is I have to warn you Because this talk may contain traces of pearl Yeah, I'm sorry. Actually, who am I kidding? This talk contains traces of pearl. There's no way around it Sorry for that. Just so you know. Okay. So regular expressions or reg-axis in Python Why would I want to talk about that? I mean, it's all really old, right? Well, I've seen a talk and I'm gonna get back to that later Which says everything you know about reg-axis is wrong And I think that applies for a lot of people here as well and I want to talk to you about that So first a little history Reg-axis like I said, they're all we started with reg-axis actually Professor Stephen clean Was looking into something called regular events back in the 50s and That was a way to program non-deterministic finite automaton That's all theoretical computer mathematician stuff, right so in the 60s Ken Thompson came along and he was working on an IBM 709 computer and If you don't know what that looks like it was something like that Well, actually that is only the CPU module. There's a bit more to it but He was working on that thing and they had a QED text editor and he added regular expressions to that So back in the 60s. We already had reg-axis on computers Now this Ken Thompson fellow he moves on and in the 70s He writes a operating system called UNIX. You might have heard of it right on this operating system you got The ED editor which he wrote as well and so it includes reg-axis and others picked up on this First we got grep, which is actually just a ED function. It's actually global regular expression print and That works on a directory full of files other things or Includes reg-axis set includes reg-axis and all of these have their own little syntax because they have to fit With the original program, right? So they all have a little slightly different dialect Emax reg-axis as well and a dialect which contains a whole shitload of back slashes I've been told if it doesn't work in Emax just sprinkle on a few more back slashes and There you go The X editor also includes reg-axis and you're thinking X. What's X? X is fi before it became visible So we move on to the 80s and that's when this guy Larry Wall came along you might know him He's that evil person that created pearl Right and pearl does some funky stuff with reg-axis Really back in that time? Oh God so pearl But well, let's just forget about pearl. We'll just use PCRE Pearl compatible reg-axe engine this this got on later in the In the latest times, but we move on first to the 90s and of course in the 90s we get Python Yay So Python has its own little history with your reg-axis first we had the rejects P module which had API version one and Which was replaced by the reg-axe module not to be confused with the other reg-axe module that came a lot later. This was a Did the first Reg-axe right this had API version 2 and it had its reg-axe syntax sub-module Which actually allowed you to choose? Which dialect you wanted to use in your reg-axe and for some reason emacs was the default Go figure so then we get to Python 1.5 1.1 Alpha and we get a re module re Version 3 API and you're thinking hey, that's the one I know No, you don't Because that module was never Well, it was released in the final release, but as RE1 and RE itself was replaced with another module called RE RE version 2 Which used the PRE? sub-module and PRE is actually based on PCRE and that's where the pearl comes in so Okay, we've got PRE we've been working with that for a while and Then we get unicode and we need something that can work with unicode So we get the SRE module and the PRE module is deprecated it's a little later removed and The RE module is actually only contains the SRE module So we move the SRE module into the RE module and that is what you're working with today Because it's all I think it should look like this really So back to that statement everything you know about regexes is wrong Like I said, this was a title of a talk by Damian Conway and Damian Conway is also a pearl Guru he does a lot of things with pearl 6 and So we'll just call him that other evil person But he does in fact give very very good talks So if you have a chance and you dare to go into the camel camp if you will then go and see his talks He's really good but the thing I learned from his talk first of all is that You might think you know regular expressions, but they might work a little different than what you think they do and It's not your fault You see you've seen the history right we heavily rely on We are heavily influenced by pearl and well pearl is a bit fuzzy. So Regexes probably are as well So if you write up a definition of a regular expression, it might look something like this So a regular expression is a decorative Specification describing the textual structure to which a matching string must confirm that sounds about right, right? Yeah, just a few things wrong with that First of all a regular expression is not decorative It's not descriptive It does not specify structure and it does not conform So what do we get we get something like this? and well, yeah Maybe maybe this would be better. It's just not not that Right. So what is it then? What is a regular expression? regular expression is code and I might give a fancy definition for that too Yeah, a specification of a block structured instruction sequence, which is designed to execute some task on a highly specialized virtual machine But it's just code It's commands loops assertions Exception handling exception handling. Yes, exception handling. It's code Rotom code. I'll give you that really is but it's code So that are regex string thingy. That's actually a function you're calling it's a function by itself and It's executed by the regex engine so this function Returns the result it returns true false or maybe some extra stuff. We'll get back to that later Right, but they're code regex is code and To understand how to create a regex. You just need to learn the language That's all that is to it. Now you've all done that. So how hard could it be just learn another one, right? Okay, so you need to exit you need to understand the execution model as well because that might be a bit weird I mean Regex is just run in their own little world or VM as I said in the definition, right? Theoretically regex is are based on a finite state machine But they're not really in practice. They're just a stack-based machine like everything else but theoretically a finite state machine and that's handy to use because then I came if I use this theory I can explain regex is Using graphs state state graphs, right? So if we look at a very simple regex that just says Match and a then match to be then match to see and that's what you need to think how you need to think about regex is Mention a then match to be the match to see if I put that into a State diagram, it would look something like this, right? We start we first mention a then we go to the next state we then match to be we go to the next state And then match to see we go to the match if I were to code that in Something that looks a bit like Python It will look something like this Yeah, we got the for loop there We've got an index that walks over the whole string and then we try if the position at index Matches and a if it does not we raise an error index plus one match B if not raise Index plus to match you see if not raise and if we did not raise then we would turn through If we get an exception either because one of the races here or because we are Going beyond the end of the string We continue We're not gonna stop short because the regex would never fit. No, we're gonna try up until the last character And only if we tried all that and we did not return it through anywhere Then we did not match so we turn a false Right in the diagram. It looks a little like this. Okay, we've got the regex Mention a match a B match at C and we've got a string to search That's one two a B a B C so we start With the first character in the string and we match that one to the a Right the first step in our diagram Does not match so we try the second Character doesn't matter does it to match the a now it doesn't Okay, does the a mention a yes, it does. Yay. Move on. Does the B match B? Yes, it does Does the a match C? No, it doesn't Okay Now what we move back and we move one character forward in the string. Remember the last one was the a We move one forward and we try again Now this might take a while right? Anyway, we match all the three characters and then we have a match Five words And that's it really well more or less You see this of course some other stuff in this regex What if we have an or statement so we want to match either a B C or a B X now if we look at this then we might come up with a State diagram that looks something like this mention a match B then match either a C or an X but no Because regexes the regex engine is dumb. It's really dumb. It will create two parts We've tried to match a B C if that doesn't work. We try to match a B X What does that look like? Well if I match that against P a B X We first try the a in the upper path Doesn't match to the P so we try again To mention a and it still doesn't match weird So we move on to the a and we match that to the a in the upper path Yes, it matches. Okay, the B matches the C does not match So the upper path does not match Okay, you move on try the lower part the a to be the C the X Sorry, not the C the X. Yes, we've got a match So what the execution model does is it tries every path? It tries the leftmost path in your regex first or When you look at the graph, that's the upper path in the graph is tried first and It returns success on the first full match It fails when all parts fail and It will try all parts Before moving along. Now, let's try that in a little more complicated regex still nothing fancy, but we're looking for And either antelope or ant And I'm using RE compile here It has been said that compiling is not necessary because The regex would compile for you anyway and it would use that compiled Regex later on it does not just try it with Time it and you'll see that if you do not use compile, but re dot search for instance every time you call Your RE dot search that regex is compiled again and again and again Compiling it first and then using it later in your loops is going to save you a lot of time so What would this Compiled regex match when I Try it against a string an end an end and current an end eater It will match and Because it tries and eater against and that doesn't match it tries antelope against and doesn't match And it drives then tries and against and and we have a match now if I were to write that regex around and place all the words in alphabetical order a bit easier to read maybe and We try that again against a slightly different string an and eater Encounted an ant what would it match? the same bloody thing It matches and and it does not look any further. It's done There you go So it looks for the first match. It does not look for the best match right, so okay if this regex is a Programming language then there must be a command language, right? there is Every alphanoma character is a command and And the command is does the current character match me if so move on if not backtrack and some punctuation is Also a command and I know you're looking what some punctuation. Yeah, this is where it gets a bit a bit icky Some punctuation This for instance is Not a circle. It's a dot and The dot matches a dot so it does exactly what an alphanoma character does except that it also matches any other character So actually a dot matches any character except for a new line Because that's something different. That's when we get to multiple multi line regex is Don't think about that right now we also have a carrot and The current matches doesn't even match a Character it matches the start of the string the dollar does the same thing but then for the end of the string and We have some other Tools we've got or for instance, that's two square brackets and then the characters between the Square brackets and this matches Either an O or an R We also have ranges so we can match All lowercase a through that That dash the hyphen in there right now doesn't do anything except Create a range if it was in another position like this it would match a Minus or an A or a Z if the hyphen is in front of the square brackets it matches a minus or Sorry minus and an A or a Z We can concatenate those ranges. So if I want to look for a extra decimal digit This is what I do we match either a Decimal digit or capital a through to F or lowercase a through to F Now you might be tempted to put some spaces in there just for visibility, right? But then we'd match a Zero two to a nine or a space or a Zero two to F or a space or any to an F. So that does something different We also have nor not or The same square brackets and now that carrot has a different meaning all of a sudden It's between the square brackets. So now it means Anything that is not an O or an R And we have loops Are you thinking loops what loops in regexes? Yes loops Just like the while in the for loop except In regexes the while in for loop the loops are Actually both a while and a for loop at the same time That sounds great, doesn't it? What does it do then? Well, it loops while there's no exception But only for m to m iterations Now m to n may be zero to infinity But it's still Only that much, right? And this is how we program a loop This means Match the previous character zero or more times Zero to infinity If I have a regex a star We will match an a zero or more times So this will actually also match an empty string because there's zero a's in an empty string We can Look for repetitive pattern actually like this if we we match either We match an a and a b Put that between the round brackets Put a star behind it and we're looking for a b a d a b a b a b zero more times A or a b zero or more times Use the square brackets And dot star we match anything zero or more times Dot star is used a lot in regexes But it's not your friend It really isn't And we'll show you why but first I have to show you how this This state diagram for this regex it works Right, so we remember that we try the upper path first So we try to match against an a and then we end up back in the same state We try that until there's no more a's Then we take the lower path And we move on to the next state in the regex If we have the or statement we try either the a or the b End up in the same state or we try or Otherwise we move on to the next If we have the a b c We match an a then we match a b then we match a c repeat Unless we encounter something else and then we move on to the next state The plus another punctuation that uh is a loop it matches one or more time And it looks exactly the same as the a star well except that we do one extra stage in front of it Try to mention a Then loop zero or more times question mark Is a very short loop matches zero or one And we can specify specifically how many matches we want to have And that's where we use the curly brackets and specify How many times you want to Find this how many times you want to match So we match and times We can also use m comma n So we match m to n times Now let's loops except that in regex We can unloop And I don't think I've ever seen this in another language, but We can roll back The previous loop match What how does that work? Well, let's have this example here Right, we've got a short sentence with some stray html codes in it and I want to find that that html code So I create a regex With a loop I look for a pointy bracket and I'll look for anything in between and then I look For a closing pointy bracket Right, we start to look for that and we find the first pointy opening pointy bracket Then next we match any character zero or more times Well, the b is any character pointy bracket is any character The e to x a and so forth they're all Any character And then we reach the end of the string But we still have some regex left We need the closing pointy bracket It's not there. We're at the end of the string. There's nothing there Right, so we fail Except the regex engine goes back and says hey, give me back that last character So we get the dot back and we match that against the pointy bracket Still doesn't match. Okay, give me back The other character them Okay, we get back the pointy bracket we match And yay We've got a match And we rolled back two characters So in this example, that's not so bad, right? We only rolled back twice But what if I have a longer text and this is just a generated lower mentioned don't try to read into it but there's a Few stray hml over there Now if we had to roll back here, we had to roll back all the way from the last enim dot Back to that pointy bracket That's more than half the text it would have to unroll to get to the match That would take a little long So what do we do we use minimal loops And the minimal loop is exactly like the maximum loop except that it has a question mark It's more like zero or more So what does that do? Well, the same thing it finds the opening pointy bracket And then it has to match anything but because there's a star question mark here It first checks if the next character isn't The next character in the red jacks Well, this doesn't match so it matches anything We move on Match again Hey, this does match So we get a match and we finish Now this might even be a better match than we had back with the previous example, right? And it's way faster So the minimal loop execute the commands in the loop as few times as possible So what would that look like in a diagram? I think you're gonna like this We just do this We try the upper path first And then we do the loop So what do we do? What do we use minimal or maximal? Well, first of all Use the one rejects that produces the expected result We saw the Difference in the results just now Second Use the one that does the least backtracking And on that backtracking, I've got the lorem ipsum text again, but this time I've got two hashes in there So I'm just gonna Make this a little smaller. It doesn't contain any readable text. Anyway, so Let's just say we use the maximum loop To search For the text between those two hashes Including the hashes Right, if I run that through timeit on my laptop It gives me a 633 Nanosecond on average Okay, what about the minimal loop then? If I run that I got a whopping 1.79 microseconds It's three times slower Why? Well, because most of the text is in between those two hashes So in between those hashes, he's comparing each and every character to a hash So it takes A bit longer So You'd say, okay That means that the maximum loop is The winner here, right? Well, hold on There's another trick we can do Or we can forget about the dot star But match anything that does not match A hash Zero or more times So how would that do? Well 429 Millis nanoseconds We have a winner It does not do backtracking Because when it hits that second hash It stops and it moves on in the regex Matches the hash and it's finished so One takeaway at least Don't use this use this right So a little bit about the re module. I was going to talk about regexes and python, right? So the re module Like I said has the compile method You compile your Regex and then later in the function you use your regex To search for a string If I Use the search method We will match anywhere in the string We could also use the Carrot we've seen that would match only the beginning So if I Search the string here We'll match from the start of the string Now the regex has a helper function for that You can use regex match Which actually does the same thing so you don't need to use the carrot If I want to match against the whole string the complete string I will use the Carrot To indicate this regex must Start at the beginning of the string And we use the dollar to indicate that it must end at the end of the string So it will only match if the whole string matches the regex Again, there's a other Helper method for that Which is the full match and So the full match Well, secretly I think it just adds the carrot and the dollar. That's what it does The Split Too fast here. Sorry. Um, the split is just like a string not split Except it allows you to split on a regex Which could be handy Um find all We'll find all matches And return them as a list And find itter Does the same but returns an iterator So you can loop over your matches Um sub Myreg.sub is the same as the string that sub replace A regex matched regex with another string um So we talked about that, uh, unicode The sre supports unicode Now if we have the the range zero to nine that does Well, I'm not sure how many digits there are that are not just like the zero to nine, but The backslash d is a shortcut. It will match Any digit Even it's a unicode one That I don't know about Um backslash capital e This is within the regex right backslash capital e Does not match a digit backslash lowercase s Matches any wide space character And that's not just the space. That's also the tap the new line The the page feed the vertical tap even so Backslash capital s Matches Any non-wide space character And then we have The w backslash w Matches a word Capital w Matches not a word And finally to A lowercase b Matches a word boundary Which means regex found a word And either at the start or the end of that word We have a lowercase lowercase b Uppercase is Not a boundary I had thought of doing an example of this but I forgot about it. Sorry Okay, other things the flags The regex the regex module also supports flags And um This is just a few of them. It's not complete But we've got the flag re.eskey And you can inline that into a regex with a Question mark a between the round brackets And that will ignore all the Unicode Compatibility it will just do askey ignore case Is henley You've seen that I use the uppercase a through f and lowercase a through f Ignore case and I can leave one of those ranges out Um, we've got multi-line and that's a special thing um If you want to do multi-line Regex then look that up because it might not work exactly as you thought The carrot and the dollar Actually work per line in a multi-line regex And not for the beginning of a whole paragraph and the end of a whole paragraph for instance um The re.exe Or re verbose Um Is a way to enable you to add comments Into your regex So you can build a multi-line regex And Add comments so that your Colleague that comes about it in three months from now still understands what you're trying to do in this thing So there's multiple ways where you can Set these these flags um You can add them In the beginning of your regex opening Bracket and a question mark and then the different options Closing bracket And another way is to add them as flags in your function call And here you need to do the um That thing there the pipeline to combine the different options you want to use and That's it really it took me a bit longer when I practiced this But That's it any questions Thank you, Danny. So we have like four microphones in the alleys for the questions And we have we have some time to to to give our questions Hi, just out of curiosity, um Do regex support Chinese characters you said unicode, but what are each Is that a word is that uh um Yes the the In the sre module there's Unicode support. So yeah, I don't really know how that works with Chinese characters, but Yeah, it should support that some way I've never tried but it should Hi, hi, you mentioned compile Can you give us what? And some indication of what happens under the hood When we compile the regex that this produces the code you saw water before or something different um The the compile will make sort of like the the Byton array that is executed by your regex engine So it's it's um I know I don't know how I could show that but It's actually what is physically executed within the engine that is produced by the compile Yeah Hello, um, I have a question about parentheses Because you showed an example of looking for a and a plus b string in parentheses and then asterisk How does it relate to groups in regular expressions because as far as I know you use parentheses for groups as well Yeah, that is actually a group you are looking for Um, I've not come to I wanted to include groups, but I was afraid that we're going to go beyond my Time limit. So but that is actually an unnamed group Okay, thank you. We also have look ahead and look behind assertions Is it easy to give us an example of what happens like the diagram we used so much before for more simple things? Oh Um, no, I'm sorry. I go back Uh, no, I wouldn't know how to just create it. Just sorry anyone else I do have one um, it's more basically what based on your experience with regexes and your examples with minimal loops maximal loops and um, and refined character ranges So, yeah, when we have or when we know the the input then we can compare the performance and decide which one to use What would do you have recommendations or how would you how would you go? Before you actually have input, right? I design time. Which which loop would you would you choose? Um, well between the minimum and the maximum It's actually pretty easy because if you look at that if you get if your match is going to be beyond Half of your string Then it's better to use the maximum Because that will have the least backtracking if it's shorter then it's better to use the minimum But yeah, that's Of course, you you need to know what your input is going to look like before you can make that that decision Yeah, no civil bullet Okay, thank you Uh Maximum and minimum loop is what is the documentation is stated as greedy and non greedy regular expressions Yeah, it's the same thing. It's the same thing. Yeah All right. Thank you again, and let's let's just give a round of applause for Danny