 size. All right, so we go there. And so the exercise is that I give, we go back to the other example where we had this FASTA file with bunch of sequence, so text sequence, ATG season stuff. And you have your sequence similarity function. And your goal is to try and optimize them for using either NumPy, Noomba, or Scython, preferably all three. And try to see which one gives you the best, what are the sort of tricks that you can put in place. Try to think and try to get the best performance that you can. Let's say that this is kind of really the challenge there. It has happened a few times and have been quite surprised by some of the results that people have come up with. And with very, very good improvement of performance. So my hopes are up. Okay, so now we want to try and optimize this. I think that now you have seen that these values method, they all kind of can work. But a very simple example that are provided in their documentation, they work super, super, super well. But as soon as you try and apply them to stuff a bit different than what is exactly thought for them, you can see some, you can have some varying degree of success. And it's perfectly normal. And it's also part of the learning curve to learn how to and when to apply them. It takes also a little bit of time and practice. So don't hesitate to come back to this sort of exercise later on. All right. So we have our two functions. First is sequence similarity. Okay, takes two sequence, compute the similarity between them by just going through a list. And if they are the same thing, you had plus one, two little similarities chord, which is then divided by the length of the sequence. And then this just wraps that function for all pair of sequence in there. All right. And we, when we time it, we see that, okay, it's fairly slow, right? Because there is only 500 sequence of small size. And so you, but it still takes a third or a second. And then we may want to change that using our various libraries. So first thing first, and I will spare you that one, but you can already get the times to speed up very easily by just using the little trick that we had used earlier. Remember, if J under I continue, and then here you just say that the matrix is equal to itself, but in, but in symmetry. All right. So just remember that one, we will, we will forgo that trick for, for now, okay, because we want to focus maybe more on the, on, on the libraries themselves and what they do. But remember that this should always kind of be your first go to write. Okay. Because this will always work. So just remember that. And now we are going to try and see what we can get without relying on that presuming that you've already removed redundant stuff. Like if for instance, our similarity measure was not symmetric, we could not apply that. So, yeah. Okay. So now, first, let's start with Numba. Numba, I think is always a very good, like first approach, maybe NumPy or Numba, but Numba is nice because it's just a decorator and it can get us very good results. So there is just the exact same code. Okay. But with the adjeet there. And now, if you try and run this, you will encounter a first little problem there. Okay. So let me try that directly. Okay. Let's say I just take that and try it out. And I will just apply this function there and see what we get. One, two. What you see is problematic stuff. Okay. That's mostly a Numba pending deprecation warning. So in itself, that's not an error. But that's something that you might want to care about. Okay. What they say is that right now you use a type that is scheduled for deprecation type reflected list found for argument LSEC. Okay. So we don't know exactly what is a reflected list. But we know that it has to do with this argument there. It's not very happy with our list. Or so second little thing, we see that our performance is actually worse than before. So first, let's try and address this little part. So this, if you go in because they are nice, it gives you this little link there. You will learn that what they call a reflected list is a list here that could potentially be changed inside this, inside the function. And in Python function, that's fine. And that's something that you can do. But in whenever, when you have a relationship between this list, for instance, sorry, when you have a relationship to account for an interface to account for between two languages such as Python C, it becomes a nightmare. And so they decided that they would deprecate the support for this sort of feature. And so they let you know that. So then the way that they propose that you go around that is by building a little interface. And so that's what I built here, where you from number the type you import list with an uppercase there. And then you create that list that uses about the same exact code that they propose to use in when you follow that link. And then you feed that to the function. Okay. So maybe we do that. And next, if we do this, all right, and we call it again, I will just do it now just with n1 so that it goes a bit faster to run. Now you see we don't have the warning anymore. We are still not that much faster, but at least we don't have the warning anymore. Okay. Next, of course, again, I know that I'm kind of fighting a losing battle there, but I still want to spend a bit of time on that. Before we get too much about the time here, we have to check that we do that we would get the same result. Okay, because if you have a function that is faster or something, it doesn't really matter if you don't get the same results in the end. So you need to check that you still get something that is good. Here, there is not too much risk because we did only fairly minimal change, but nevertheless, it's nice to check. So here, I propose two ways of doing that. The first is I create a set of random sequences, and then I check that both functions return to me the exact same result. All right. And then the second one would be maybe more visual. Okay. And I just create a very small example where it's fairly easy to compute manually what would happen. Like you know that this one and this one, there is one and then two differences, and then here six and so on and so forth. So I can just apply both and check that they return the same metrics. Here they do. And so I am happy to say that at least the function performed the same way. And then of course, now once we have checked what they do, we can actually measure a benchmark to implementations. And then of course, what we have seen there is that actually it performs worse. Okay, so that's really not like here Numba fails to do it magic. It's magic basically. So not always that simple. Yeah. The main problem is I told you Numba is super good with numbers, but when it comes to anything else, it's actually quite bad. Okay. And you see that in action. So something to think about. All right. Okay. So that's with the Numba one. Are there any questions at this stage or have you done things differently there? No questions. Everything good so far. I know it's getting a bit late. Maybe some of you are getting a bit tired. Yes. Ah, some signs of life. At least one person is alive. All right. For the rest we'll have to see. Okay. So the next one would be to go to maybe NumPy. So NumPy solution, I describe here a little bit what I do, but here the idea is to say, okay, rather than do a for loop with NumPy, what we can do is we can say, okay, we have a NumPy.array. What doesn't it type? Sorry. Okay. There we go. Yeah. And then we have like maybe one, two, three. Okay. And then we have another NumPy array, which is maybe one, and then it's different. And then it's the same, for instance, one, four, three. And what we can do is an equal equal. And then we get true, false, true. Okay. So we get false for everything that is different and true for everything that is the same. And so then if we sum this thing, the true is considered a one and the false it considers is zero. So it will count the number of stuff that are true. Okay. Oops. Sorry. All right. There you go. So the sum will be the number of things that are true. And then it will be because we want to divide it by the length of the number of elements. So we could do a divided by three there, but it's actually equivalent to just asking for the mean of this true and false vector. So far so good. Yes. Okay. So now with this in mind, what we do is that we rewrite the sequence similarity in NumPy. So there we have sec a, sec b, and then we just say sec a equal equal sec b dot mean. And that's it. Check equality and compute the mean of this equality. And that's all there is to it. There is no other stuff. And now the little thing that we have to do then is that we want to then use a little trick that I've given to you there. So if we want to transfer a string to a NumPy array, so if we just do np dot array of at ggca, we see that it doesn't really work. It's an array with a single element, which is just everything that we have there. So that's not what we would like. What we would like is to have that separated. So we have to add this little additional step of transforming to a list and then this. And that's when now we get an actual array with each letter being distinct. Okay. So that's the little trick. And this is the trick that I have given you there to transform a string into an array. You can use this. So with that, we have the two elements that we need. So the first is just that one. And then we first have the lsec, we transfer it to NumPy arrays of characters. And then is here the same thing. So we do the sequence similarity NumPy and so on and so forth. And that's it. Okay. So I'm going to first do this one. Then I do the two check to verify that they give the same thing. All right. So indeed this gives the same result. And I can check visually on a very simple example that they do. And then finally, I can test the two against one another. And what I have experimented when I tested this thing is that I get about a times four speedup. Okay. So while this run, I know that many of you tried, okay, here only five times three. So I know that many of you tried then NumPy, but you did not see an improvement. What did you do when you tried to switch to NumPy? Please write or raise hand and speak to tell us how you tried to tackle that and why you think maybe that didn't work. Yeah, you're okay. I have to find for a bit, but I created two NumPy arrays for the sequences. And then I compared that and summed it up. But that was much too complicated. It was for individual statements. Yeah. Okay. So you did the comparison with the for loop maybe or something like this. I can paste it. Yes, I think I think I see. And that might that that might be then a bit, a bit complex on the surface. But since there, what you would do is that you would then switch, you would convert the sequence from string to list several times each time, because you do n square operation. So this would be transferred from string to list n times. And this is costly operation that then if you had three done it once as I do here, then I think that you might see better performance. Yeah. But yeah, as you can see, it can be sometimes a bit tricky and we have to think, and that's where sometimes also if you saw that you could use profiling to try and see what is the function like, is it the call to list or the call to equal that takes the most time. And then you would see that it's maybe the call to list or the transfer from one to the other. And then you would work on reducing the redundancy there. So it's also kind of a process. But yeah, could try otherwise. Okay. So then we get, we do get a speed up. Okay. So, right. Number not easy. Here we do get a speed up. And then of course, as I've shown, if you also apply the little, sorry, the little trick, here I do it slightly differently than what I did before. But if you just ensure that you don't compute the same thing twice all the time, well, then you can also multiply by two your speed up just because you don't do redundant computations. Okay. So you get something better. Now, are there questions about this NumPy optimization stuff that you may have done differently or stuff that are still a bit unclear? None? All right. Yes. So there was a quick question about broadcasting. I've pasted there the link in the chat to explain what broadcasting is. I surely don't think I can do a better job than what they do at explaining what broadcasting is, because I think that they do a very good job. In that particular case, it's not going really to help you broadcasting, because as we have seen what we want to do is to just compare two vectors which are the exact same size. We had broadcasting refers in NumPy to those rules that happen when you try to do operations on vectors which are not of the same size. Okay. And then NumPy has a number of small rules, but basically what it does is that it tries to find dimensions where the arrays onto which you try to do the aspect have a similar size or have a size, which is what they say here. They go through all the shapes. So let's say you have a 2D or 3D array, and they go through all the dimensions that you have and try to find some that fit or where one is equal, such that if you try to multiply an array, which is 3D, you get an image 256 by 256 pixels times 3 for your R, G and B, red, green, blue channels. If you multiply it by a vector of size 3, it will understand that these three correspond to Z3. And then it will just multiply each element to each element in such that you have one number that applies to each of the colors. All right. So that's the sort of rules, but there they are not really what we want to use because we want to compare two things which are the exact same size. And that's why we can just simply say this equals that and that does the test for us. Okay. All right. So I think that this was a question from Tess. Tess, does that sort of clarify what you wanted or did you have further questions? Yes. Thank you. Yeah. I think I just overcomplicated for myself. I was trying to make a smaller similarity array than applying to the original. Yeah. I see. To me, it's also a fairly common thing that when we start playing with these libraries, we tend to go very complex. And when in fact, actually, and that's one of the beauty and the elegance of these libraries that they can, at least for NumPy and Nuba, that they are deceptively simple sometimes. It's a sort of solution that they offer to us. Okay. And that can take a bit like a while sometimes to get used to do something simple again. Okay. Right. So that was for NumPy. Yes, Tess? Oh, no, nothing. I just want to say thank you. Ah, cool. All right. So now the next one that we have is Scython. And in Scython, what I just said don't apply is that in Scython things are complicated. Okay. That's one of the ugly reality of Scython is that it's, yeah, it's not always easy to play with. So first off, I just here did something relatively simple where I just did some simple typing and mostly so I did the typing and then I just replaced the Python string and their typing will be STR. So it's simple Python typing. There is a smarter way of typing them. I will show it just next, but it's a bit more complex to play with. So I'll start with a simple one. Okay. So this is the exact same thing as what we have done earlier. Then this here is STR, STR. And then inside there, you see that we are kind of back to our native Python. Okay. And I just do here for, you know, element in and so on and so forth. And my goal is to have this be a fully typed function. Now there inside that one, I've done the same thing again. You say here, this is the exact type of code than what we had before with our pairwise distance. So I copy pasted that from, I think that was this example there. You can find the exact same style. Okay. And then here, this is the link between the two functions. So nothing too, too, too crazy there. Okay. Now then what we see when it has compiled is that it's not too, too bad, but then it has some difficulty in there. And if you delve a little bit in that, you see that it's mostly to do with this string there. So all of these elements there, each time you have a reference to one of the string here, it's unhappy because it's still a Python string. So it tries to access it and that creates interaction between C and Python elements. Otherwise, it's not too unhappy. All right. So then with that, when you execute this, you will see that already still, you do get a good speed up. So here just right now by, sorry, while playing with that one, I had a speed up of nine, which is now because that's without the little divided by two trick. So that's actually three times faster than the speed up that we had, that we had with the NumPy, if I'm not mistaken, right? So now we really have a nice speed. And that's, as I said, one of the nice thing that there is with Python is that you have more leeway when it comes to the way that you want to do the optimization. And so it's a bit more flexible than Numba because with Numba, if you are not in the right proper configuration that Numba likes, well, tough not, right? It can be hard to correct. But with Python, you can always play around. Now, with Python, we can go more complex. We can always go a bit more complex with Python up to the point where we write pure C++. So here, I will go a bit further and what I do mostly is that I switch from the Python type string, which caused problem earlier, to the C type, which is a character array, which is written like this. If you know a little bit of C or C++ will speak to you for the others. This is how we represent a pointer. So something that references some characters. It might be one, it might be several. In this case, it's several, right? That's just a difference of type. You see that the rest of the code is unaffected. It's only this that changes. So it's really kind of a Syson trick whenever you work with strings. Okay. And next, what we also need to do is that we need to ensure that the string that we give there are properly encoded to be understood by Thyson. And this one took me a bit of a time to understand how to do that and how to do that efficiently. But that's basically, I think, the simplest way of doing that. I just convert it to a UTF-8 and that seems to interface very well with then C code. So I do this. And now when I compile, you see that now my functions are pure C, okay, except for, of course, when you get in and get out of it, but otherwise, it's proper C. And then there as well, you only have a few interactions there, but they are fairly limited. Okay. So now, let's say in two levels, I'm able to achieve better and then even better performance. And if you execute that, then you see that you have a speedup of, here I achieved 17, but during my test, when I didn't have Zoom to slow me down, I achieved a 21 times speedup. So now I'm seven times faster than the NumPy implementation itself. So I really get the best speedup there. Right. So that's the Syson one, which I think that here, when it comes to not changing the problem too much, yields the best result. Do you have any comments or any further ideas which you thought about, but you have not seen me test and maybe we can discuss that together? I know that I have one further idea that I want to show you. Nothing for now, maybe, or maybe you're just frankly typing on your keyboard. Well, before we end this exercise session and go on a well-deserved break, let me show you what I think is also a good way of going at it. I told you repeatedly, Numba is super, super good with numbers, but it's not very good with strings. Well, then one, the reaction that you couldn't have with that is to say, let's shift the problem around and say that we want to work not with strings, but with numbers. All right. We have ATG and C. Okay. And that shifts that to numbers zero, one, two, three. Okay. Let's, without even thinking too, too much. Right. So that's what I tried. Okay. Sometimes, you know, you go for a very simple idea and it's nice. So I just made a simple function, okay, which has dictionary zero, one, two, three, okay. And then you just give it a string and it changes ATGCs for zero, one, two, three in the form of arrays. And then I just create a function to automatically change a list of these to a list of arrays. So, for instance, AAGC becomes zero, zero, zero, two, three, or zero, 88G, zero, one, zero, two, two. Okay. So far, so good. All right. So we have one person still alive. That's good to know. For the others, don't hesitate to use the reaction to show your enthusiasm, I would say. Okay. So, right. So now we have that. Now the question is, will that still work? If we come back to our code, I will not come back to it. Awesome. Yeah, I will. Actually, I think it's quick enough to see. So if we come back to our code, actually, we see that there is nothing specific to string there. There is nothing that forces us. I mean, this bit of code should work the exact same way if you have numbers or if you have strings. All right. And that's one of the beauty of Python sometimes I would say. So from there, we can just come back to our code and just say, okay, now I get that. I try Numba, the Numba implementation, I give it the string version. And now I give it the string version, but that has been transformed to an integer. So basically that, and we will first check, of course, that we get the exact same result. Right? You can also do the other check, and you will verify that indeed, you do get the same result. So once you are quite happy with that, you can say, okay, right now that I've seen that, now I can check which one is the fastest. And actually, when you do that, you see that you go from something that is worse to something that is actually much, much, much better. And with that, we are, I think this is about the same performance as the best, Python, right? Yeah, it's about the same thing. Right? So it's about 16, 17 speed up. So now just by changing, just by shifting around, what is the data type that we play with, we see that we have gained a ton of performance. And now, but not least, of course, we want to be as fair as possible. And so you can't just count this, because you have to also count the fact that you have to then apply this function that transforms from the sequence of the strings to a set of integers. And this in itself can take a bit of time. Here you see, it's really not negligible. It's about 25%. So in fact, this whole solution there takes about 25 milliseconds. So it's slower than Python. But you see, it's getting there, it's very comparable. And also, it has asked me fairly minimal development. I just still have my adjit on the original function. And I then just had to think of the trick of switching from string to integers. And this could also maybe, I think that that might be sped up. I think I could make this a bit faster and apply a couple of more tricks. So I might be able to get myself slightly lower than the Python implementation. I think if I applied more time on it, optimization. But yeah, that's a good remark. And that's a good thing too. Okay, so now you have seen, I hope that I've shown you different ways to go at this. I've shown you also how different libraries can have sometimes different areas where they might shine. Okay, and that sometimes there is some catch. I think that here, transforming to integer worked. And that was lucky for me. But for more complex operations, I'm absolutely certain that the number in the end will not work that much. And that the Python would be the one that shines. Okay, so it really depends on what sort of operation you want to do and how deep you want to go as well.