 All right, so now, okay, we know how to compute, how long it takes to compute one line, a bunch of lines and specific sets of line and nested in the middle of all the others. There is one more important thing that we are building up to, which is the actual profiling. Okay, profiling is a very generic term to just say we want to see what resource each part of the code take in order to decide when and where we want to intervene to do some optimization or just in order to know. And of course, you can do that sort of manually as we are doing here. That's a form of profiling, but there are handy built-in tools that lets you do this sort of stuff sort of directly or at least helps you. So here's a little bit of code that we are going to profile together. So first I have a function that is called readFASTA that reads a FASTA file and returns it as a dictionary. Then I have a function that computes the GC% of a sequence. Then I have computeGCdict that wraps computeGCfunction, okay, and return then the GC% of as a dictionary. And then I have a function that computes the sequence similarity. So it takes two sequences and computes their similarity. So basically for each position in the sequences, so presuming that they are both the same size, if that position I, they are the same, then we add one to the similarity score. And at the end we divide by the size of it. And then we have our, let's say main function, main script, okay, which first read the FASTA file and compute the GC% and sort the GC% okay, so sort the sequence by GC% then compute a pairwise distance matrix and then write this matrix to the file, okay? So here that's, let's say classical bioinformatics stuff, but you can also think of it as a classical pipeline of data that exists, we read some data, we perform a few operations on it, we compute some metrics and so on and so forth and then rewrite some results, okay? So that's quite generic in a sense. So I do that and then maybe you can come on and say, okay, but I think I have a, you know, I have a new function that I think will compute the GC% faster than what we have above. So what we have here uses there a native for loop and then this one uses the built-in function of string in Python. And you could say, I think that they will be faster. So you execute that and you show and demonstrate that indeed you have a very, very, very good speedup. So definitely using this function is much, much, much better. See, we go from 1.32 millisecond to 36 microsecond. So commendable speedup, but now you could say, all right, you got something very nice, but okay, this one is super simple, but you could say, okay, maybe, but it took you like three days to create this super nice function. And was it worth it? Because yes, it goes faster, but was optimizing the compute GC person better really what you should have started with, okay? It's not because you have good ideas for a function that it's necessarily the one that you should focus on first and profiling is there to help you with these sort of choices. So again, you can either you execute it from the common line and it would look like this. So this is the script you want to, sorry, this is the script you want to profile. And then you specify here that you want to use the C profile module, which will do the profiling for you. And you want an output there in profile.log. And then here dash s come time will sort this by the metric that is called come time, which is the commentative time spent in a single function because the C profile will execute your code and will time each function that is called in there with the number of time it was called and how long we spent in there or in code specific to it or in sub functions. All right, if you have then a log that is given to you as like a long text file, you can read through that and most of the time that's what we do. But if you are also curious, you can try and experiment with this little library, which, I mean, it's not magic, but sometimes with a slightly complex report, it can be a little help, okay. But you'll see that most of the time just when you know your code and when you don't have too many functions, see here we don't have a ton, we have five also. Just the text profile suffices. So you can use the comment line or you can also use Jupyter magic. So here P run, then dash L so that I only report 30 line. I don't want to have the full report with everything. And then I sort by cumulative time and then I call the main script function there on the medium file, okay, and I will output to test that out. So let's now execute that one. So execute that one too. And then your first exercise will be to look at this profile and try to determine where do you think your optimization effort should go first, all right. And yeah, and also ask yourself what would be the effect of using our better disimplementation that is much, much faster there. How much we would gain from that. So I will ask you to do that on your own. Don't hesitate to ask question if there are stuff that are unclear. When you think that you have a good answer for both questions, please put a little green tick reaction. So I know that you are done there in the chat. And after I think let's give it three to five minutes, we will see what everyone answered and then do a little correction together, okay. All right, so I'll let you work. So execute this line of course to get started so that you get then the report that should appear soon enough. Okay, so from tests, we should optimize the double loop in the main and compute sequence similarity module and compute GC, yeah, all right, compute sequence. All right, thank you, great. So that's indeed I think personally also the best answer. Let's go through what we have there. So I will copy the report here so that it's a bit nicer. We can comment it a little bit. So we see that it took in total something like 70 seconds to run. Okay, one thing we can check while we look at it is also this, sometimes it pays to just do that. And we have then so the total time is total time that it spent but excluding the time spent in sub-function. So that's nice, you know how long it took just to do the operation in that function. And then come time is cumulative time spent in this and all some functions, all right. And we have sorted by cumulative time which is usually what I do. So we start with the bigger level stuff. So we have the whole module which is the whole set in our case then the main script. And you see that the main script takes of course all of the time because this is the call but the total time spent in it is 0.5 seconds. So we only spend 0.5 seconds in it, in itself, right. In this main script itself and we spend almost all of our time, 17 point something seconds in sub-function. And then the sub-function that pops up the first is this compute sequence similarity function which we spent here in cumulative time, 70 second in it. Of course it's called like a large number of time. Okay, actually it's called the square of the number of sequence in there. So 500 squares, yeah, checks out, 500 squared gives us this. So per call it doesn't take a lot of time but accumulated it takes a large amount of time, okay. And then we have our compute GC percent and compute GC percent dictionary. So computing all of the GC percent here takes cumulatively 0.042 and in particular the operation itself 0.041 second itself. So then that was also the second question like if we optimize this, if we multiply that for instance we speed up this operation by 100 then well we have gain maybe 0.04 second in 17 second execution. So that's basically we wouldn't even see that, okay. That's within the marginal variation, all right. But if we do even like we just divide by two the time it takes to go through the compute sequence similarity then we cut by 50% the time of execution of this little code, all right. Is that clear to everyone this interpretation and are there much thing, all the things that you remark or that you would like me to elaborate on this? So far so good then. Using P run and C profile is really like a tool of choice in order to try and understand where you should start optimizing first and so on and so forth, okay. It's really super, super useful, right. Especially when you don't have too much ideas and when you have better ideas then you can start deciding that you will optimize this process or that process and start using the other method that we've seen like such as the times or the time it and so on and so forth when you already know what you want to do but your first approach should always be to run this through and sometimes also you can run it with different problem size because for example maybe compute sequence similarity I will not say stuff that are real but maybe compute sequence similarity has a linear complexity. So it will become, its complexity will become its time of execution will increase linearly with the number of sequence and it might be that because of weird implementation compute GC might have an exponential complexity. So if you test with the medium file dot fast you might see this where compute GC doesn't take much time and this takes most of the time but then if you tested on much bigger data you might see the converse in terms of ratio where the compute sequence similarity might not take most of the time and most of the time might be spent in compute GC. So again, sometimes it would pay to try it with two or three problem size to see what is the effect of this on the different functions, right. And of course, most of the time we have let's say small ideas about this sort of stuff. For instance, looking at the code of the compute sequence similarity there is here a single for loop. So it evolve linearly with the number of sequence at least we can think so. And the compute GC percent you see here it evolve linearly with the also with the number of sequence because we do it once per sequence. So they should have indeed the same complexity but as I said, as code grows more complex sometimes it pays to check and to take this sort of stuff into account because again, you might have surprise if you always test stuff with a very small data set when you get it go to a bigger data set it might not behave exactly as you expected. Okay, now I want to pick your brain a little please think a little bit about look a little bit about this function here not the compute sequence similarity but the whole let's say step four which is the compute sequence similarity for a whole set of matrix and try to think of a way that we could maybe fairly easily make that faster. Without even going to an umpire or something like this just a way to think about the way that we do this operation that would make the whole script run faster. Think a bit and write on the chat when you think you have an idea. Okay, so Eliza wrote a good answer for us. So you see that and let's say this is one okay, and then you're also proposed something. So here, Eliza, you propose that you remarked that here the sequence similarity will be the same irrespective of sec A or sec B like whether or not they are switched will give the same similarity. So the measure is symmetrical that means that the output matrix is symmetrical but we compute all pairs of I and J. That means that we will, for instance, we will compute, I don't know, the similarity for one and 10 and then also the similarity between sequence 10 and one. So we will do, in effect, we will compute each similarity twice. All right, so this is something that maybe could be avoided if we were a bit smart and we would reduce that and in effect that would just cut by two the number of operation that we would do, right? Because rather than computing 100 times 100 differences, we would compute 100 times 100 divided by two, right? So that's a easy, let's say, way to just get the same result but in a fraction of the time just by thinking about what operation we do and what is redundant. Jörg, you also propose that we save this sec S1 in a temporary variable to reduce the access time. So it's something that might be useful in another language. With Python, the access time of dictionaries is extra fast and it's, it does, sorry, it is not dependent on the size of the dictionary, okay? So even if you have a super large dictionary, accessing an element in that dictionary always take constant time, which is super fast. So in that particular case, I don't think it would be that useful if you even see an increase in performance which I'm really not sure you will. On other languages, for instance, if you play with R or C++, then the access time of dictionary is, I think it's dependent on the log of the size of the dictionary. So there, that would be indeed an issue and your proposed solution would indeed, I think, have an effect. Although I want also to point out that when we look at our profile, most of the time spent on the sequence here is spent on the function itself, okay? And not on the for loops around the function. So optimizing these access there around it would have minimal effect. But then again, that's for Python. Maybe in C++ we would spend more time with this access there before we enter the function. So it might have more effect. Okay, so then, for instance, if I just take the proposition from Elisa and I just, well, we could, for instance, just create a new cell. And now we could just write, we duplicate our main script. I call it now main script too, maybe. And then we would say something like, if J under I, then we just ignore. And now we just need a little additional call which say that sim, so the similarity matrix fills itself in a symmetrical manner, okay? So I will ignore one in two cases where J is under I. And then I will fill the cases here, J, I from IJ, which I just filled there. And I think that is all that is needed there. So see three lines added to code. And if we go there, see then we spent nine second. And there, if I call main script too, it should takes hopefully like maybe five second or so to run through. See about half the time there. So, right. So what I wanted to say there, again, we won't spend too much time on the algorithmic aspect of stuff. We will see programation trick to make the same code run faster, but it does pay to think also about this sort of complexity about the redundancy of the operation and so on so forth because with very, very little cost, you can gain already something, okay? So you already trim the fat and then you optimize what you have after that. One last thing maybe is maybe you see here that to execute this, it took me nine second but then here this call to PRUN actually took 17 second. So the measurement of the time, the profiling has an overhead, all right? It takes some time to monitor the stuff. And that's sort of unavoidable. Different module have different overhead time, but that's also always something that you have to keep a little bit in mind. And there are some modules which have huge overhead, some which are very small overhead, but also take all of these numbers as relative to one another, all right? And not as the absolute actual value that it will take to run the whole thing without the monitoring record. All right, so now we are back from the break and so we just saw a little bit how to monitor the time that it took to execute a bunch of lines of code. And also we have finished with the profiling, so time profiling there, of all the functions that happened in our code. As I said, something that we really have to do all the time before we start optimizing because sometimes the function that takes the most time are not the one that we think and you don't want to spend three days optimizing a function that only takes 2% of your execution time. Okay, but this time is not the only thing that we should care for. So here for instance, I show to you what happens if I try to do the same thing, but on the large file dot fast, all right? Of course it's larger, so it will take more time, but there what we will see is not a problem of time, but a problem of RAM, if it wants to reach to that point where we have the problem, it would be nice. Let's see, let's wait. Sorry for that, and then it takes a bit longer than others. But you see that here, it's always, let's say possible to, ah, there we go. It's always possible to wait longer, all right? Okay, not so, so long, like you cannot wait five years or 10 years for your result, but sometimes we are like, okay, we have one code and then we take maybe one minute or maybe two hours up to a point. I mean, you know, you time the stuff so that it runs during the night when you are doing all this stuff or while you go, before you go to a meeting, it works, okay? You just have to wait a little, but sometimes you have the problem that to do what you would like to do, you would need more RAM. So for example, here what it says is that when we try to create the matrix or differences between our, between all our sequences because there is one million sequence, it tries to create a one million by one billion, one million by one million matrix. It tells to us, well, it's not possible. I would need 7.1 terabyte of RAM and you don't have that on your computer. I mean, I guess at most of our laptops and small computers nowadays we have between 16 and 32, maybe 64 gigabytes of RAM, okay, minus 32. It's not too bad, but we are very, very far from the seven terabyte. And of course you might have access to a computing cluster which might have a machine that have maybe one or two terabyte of RAM, maybe four, but more than that is still fairly rare. It's a fairly costly resource to access such big computers, all right? So that's when also it can be interesting to try and optimize and try and monitor not only the time that it takes to execute something, but also the RAM that you will need because the RAM is much more like a hard resource in a sense that you can always wait one more hours, one more hour, but if you don't have enough RAM, the only solution is to find a bigger computer there, all right? Or to rethink the structure of your code in order to reduce your RAM requirement. Right, so that's what I wanted to illustrate there and that lets us jump to the second part there about measuring the RAM.