 Okay. Next up is one of my favorite, and I think a lot of other people out there, regular expressions. We've got Ilya right here. Let's have a round of applause. Let's hear it. Thank you. So, I am indeed Ilya, and I am indeed here to talk about regular expressions in Python. I'll just quickly say a couple of words about myself. I'm just finishing my first year in a master's program at the University of Potsdam. I tinker a lot with the natural language toolkit library in Python, and I made very, very small bug fixes to see Python and Matplotlib. I also work at EGEM, which is a German startup that's using digital technology to really change how we interact with fitness. But my talk is not really related to any of the things I do. In November, there was a post by Amin Runaha, who you guys I'm sure all know as the author of Flask and Jinja2 and a bunch of other useful web libraries. He wrote about how he used an undocumented feature of the regular expression module to improve his lecture performance. And reading that, I thought, hey, what other hidden gems are there in the regular expression module that we just don't know about? And so I went through it a little bit and compiled a bunch of things that I thought were interesting, and that's what I'm going to present today. The talk will consist of the following things. I'll just give a very brief short history of the module's development. Then we'll talk a lot about compilation. Then I'll go over the RE module's flags. And finally, we'll talk about the match object interface, like what to do once you actually have a match. All right, the history part. The current implementation of the regular expression module in Python is actually the third attempt at tackling this problem. At first, Python came with a module called regex that was sort of more similar to awk and grep in the sense that it was a deterministic engine and it was very basic functionality. Then people heard about Perl and they said, we want the same in Python. And the regex module was phased out and replaced by the RE module that had PRE as the back end. The origin, the PU is a little bit unclear to me. I think it's probably because of Perl. And finally, PRE was optimized and basically rewritten from scratch as SRE. It's called SRE because it was written by Frederick Lund from Secret Labs, and I'm guessing that's where the S comes. Since then, for about 15 years, it's really just been like the only major feature that was added to it was unicorn support and other than that, it was just basic bug fixes. So it's kind of old as far as code goes. And you can see that it consists of a C module and a Python component. And sometimes if you put the two next to each other, it's kind of hard to tell which one's which, just from the way they're written. Another feature that sort of got carried over, I'll mention later when we talk about flags. So enough about history, let's now let it sort of rip on a real problem. And namely, let's tackle something that's been bothering humans for a very long time. Let's search for God. And I think the most appropriate place to start searching for God is actually the Bible. So let's take the King James version, since it's freely available and it's just a four megabyte text file, really easy just to load into memory as a string and then let RE do its thing. So we just import RE and perform RE.search. Wonderful. We get some results. Interesting. But we kind of want to expand our search. We want to start looking at maybe other texts, other gods we can find in some other texts. So let's say we try the New American Bible or let's try the Wall Street Journal just for the heck of it. And we can keep adding these until we're blue in the face, but probably you guys are all itching because I keep rewriting God all the time and it would be nice if I wanted to change this regular expression, didn't have to go and change it in 50 places. So let's reuse the pattern. The naive way to do this is to just save it to a variable and then plug that variable in everywhere where we had God before. Another way to do it is to compile it into something mysterious called a pattern object and then use the methods on this pattern object to search. And the question is why would we want to compile? Why would we want to use this method instead of just using a string? And there's several arguments used for this to encourage it. The official documentation says we can modify the search scope a little bit. So a pattern object.search is different from re.search in the sense that you can give them this start position and end position so you can search in a part of the string instead of the whole string. That's cool. That's neat. Other people say it improves readability. This is kind of a question of taste so I'm not going to touch that in this talk at all. I'm instead going to zero in on an argument that I've seen on Stack Overflow and that claims it's faster. And I'm not entirely so sure about that. So a is arita compile faster. The claim is this sucks, this is better for speed. Let's investigate that using the implementations of all these methods. So let's look at re.search first. As we can see, it uses something called underscore compile and then it does search on that object. Wonderful. What does arita compile do? Oh, it uses the same function. So just based on this evidence, we could think that, probably it's better if we compile it first and then use search because we would be saving ourselves a step. But it's not all that simple. If we look at the implementation of underscore compile, we can notice a couple of things. First of all, it uses the cache. Secondly, before it even starts doing anything else, it checks that cache. So what we thought was two compilations essentially, now really boils down to the first compilation in our re.search and the second time we do arita search, it's just a dictionary lookup. So we're not actually using that much speed. Of course this is dependent on when the cache gets cleared. It's set at 500 and normally, just based on playing around with it, I wouldn't expect you to run into that limit. But if you do, if you're loading some framework or module that uses regular expressions heavily, you might. And then you sort of get unpredictable performance. But realistically, for most programs, there's not really going to be a serious benefit to using compile in terms of speed. It's going to be slightly faster if cache is cleared. And if you really, really care about optimizing that much, I would recommend you really think about the regular expressions themselves because Python and Perl, for that matter, most advanced regular expression libraries, use a non-deterministic regular expression engine in the back end. And that is entirely driven by your regular expressions. So if you find a way to optimize that, you'll gain lots of speed. I'm not going to talk about that specifically in this talk because people write books about that. It's kind of a heavy topic. Instead, I'm going to sort of close this topic by saying, yeah, sure, use arita compile, but don't expect it to be super fast right away. All right, let's get back to the Bible. You were reading it and you came across this line. And you realized, oh, crap, my regular expression doesn't capture this. What do I do? So you go to the documentation, you read a little bit, and you find that there is a solution. You can use something called arita ignore case, give it to arita compile and then search, and your searches will be case insensitive. But what is arita ignore case? If we print it, it's just an integer. But we can stack them so we can combine several flags together using this pipey thing, the bitwise or. We can do this again at infinitum. So what happens underneath the hood? Actually, the bitwise or basically takes advantage of the fact that all integers are encoded in binary, obviously. And if you choose your integers well, namely if you choose them all to be powers of two, they will be basically one hot encodings where the one, the only one in the sequence will have a unique position. So combining them, chaining them with the bitwise or will just let you know which ones are set. And conversely, if you use the bitwise and, you can then figure out which options are present and which are not. Now, this sort of was not on my radar for one simple reason. I realized I don't use this pattern in Python almost at all. And then I thought, well, maybe I'm crazy. Maybe I'm just, I'm a linguist by training. Maybe I just write weird Python code. But I also use other people's libraries and they don't use this pattern either. So maybe it's just rare. Maybe it's uncommon for Python these days. So I decided to verify this. What better way to verify it than to check the standard library? So I read through the documentation for the whole standard library, a couple of sleepless nights. And I found only two modules out of the 240-ish that use these bit.ra flags. And they are OS for opening and accessing files and socket. So there's two things that are interesting about this. A, like the standard library confirms my intuition that they're not very common. And B, the ones, the two modules that do use bit.ra flags are very low-level stuff. So to me it seems like the RE module somehow miraculously retains something basically from an older era that was just sort of refactored out of more or less the rest of the standard library, but stayed in the modules that had to do with level operations. And RE. Cool. Well, that was a fun rabbit hole. I'm not in the sort of natural progression in the life cycle of a regular expression. Normally it would be to talk about search and matching, but unfortunately my C skills are just not up to par to have a coherent picture at this point. They've improved quite a bit since I started working on this, but nothing that I can present publicly. So we're going to go straight to the match object. So this section, this part of the talk is a little bit different from the previous ones because I'm actually not going to try to say anything new whatsoever. I will just be rehashing things that everyone already knows. There are no real underwater reefs or some weird stuff going on when it comes to match objects, and the documentation is actually very clear about them. And yet, I find, at least personally, whenever I use the RE module, I have to look at it, look it up every time. All the difference between group, groups, group dict, and then all the other stuff that you can do with match objects kind of throws me off a little bit. And I don't think I'm the only one because I occasionally see code like this when I read others' code, and my hope is that I'll come up with like a simple and succinct rule of thumb that will encourage people to sort of avoid using that because you're not playing to its strength. So let's have an example. We compile a regular expression, and this one is, I on purpose chose to be a little bit complicated, it has two groups, the first one, and they're both named, the first one is called leads, and we're searching for the string God, then we have a space and a second group that's named follows, and there we can match any alphanumeric character, at least one or more. Then we have a text, I just chose one sentence, and we get a match, we do the pattern object out search. And if we print match, we see that, wonderful. Now what do we do, what can we do to get more information out of it? Basically what I really want folks to take away from this is that the match object responds to three types of requests, three questions. First of all, you can tell it to give you the whole match string. So this includes groups, non-groups, stuff in between everything. Then you can ask it for an individual sub-match, so you can ask it either just for God or for the follows group. And finally, you can get all the sub-groups together. You ignore the strings, you ignore the parts of the regular expression that are not in groups, you just get the groups individually. All right, so the total match case. You simply call match.group and you get the entire string that matched. You can also call match.group with zero, zero is implicit. So the more clear way is to call it without one. And that's literally it. That's all there is to the total match. Then if you want to get individual sub-groups, you can start calling .group with integers starting with one because zero is taken. Or you can give it the names. If you named your group, you can give it the names of the groups and that will also, oh, I made a typo in there. That second match should be created. Finally, if you want all the sub-groups, you just call .groups and that returns a tuple. And if you have named groups, you can also call group.dict and that returns obviously an unordered dictionary. So when people do call .group.dict and try to access individual keys in there, really what they're trying to go for is .group with the key name. You only really need these .groups and .groupdict if you plan to somehow then pass on the whole data structure onto whatever you're processing. And that's more or less it. The things I want you guys to take away from this talk, number one, the RE module is old. The use of flags in RE is kind of unique. You don't really get that anywhere else in Python, these days at least. Use RE-compiled, but don't hope that it will magically speed up your code by lots of factors. And finally, I hope, fingers crossed, that you walk away with a slightly clear motion of what the match object does and how to access it. Thank you. Okay, do we have any questions? Do you know about Python RECs? Python RECs? Yeah, it's a request, like, no expression for humans. Oh, okay, no, I haven't heard of it. I've heard of, there was an attempt to rewrite the REG-X module again and expand its functionality quite a bit. And a few years back, we were having lots and lots of wars on the mailing list about adding it back, but they decided in the end basically not to do it. You can get it on PyPy. Yeah, yeah, I think it uses RE anyway. Oh, okay, cool. I got the questions. First is about compile. You compared compile and search by comparing code. Have you measured it if one of those is faster? No, I haven't, I have to admit. I sort of went by the, which steps would be necessary to go and do it. From the code, I couldn't see if there were any sort of optimizations that were not apparent that would then show up in timing, but we could definitely have a look at that as well. Okay, and the second question is about BitRios, the second one. Which library does it use? Socket and OS. Okay, thank you. Yeah, also a question about the flags. I mean, I've used them a few times and I didn't think there was anything strange, but what's the alternative anyway if you want to do something insensitive? I mean, the usual thing would be to have Boolean flags, right? So you have explicit stuff saying this, true this, or what I've seen also in some places, people use strings. So you say this flag and then you equal it to a string and then you check it later in your code whether it is set to that string. Okay, thanks. Any other questions? No? Okay, thank you. Thank you.