 All right, so as I said, this first notebook is about measuring resource usage and profiling some Python code, right? Before you start doing some optimization, you first need to do when your optimization effort should be applied, all right? Otherwise you can spend tons of time optimizing something that is already quite fast or that is not very efficient but only takes 1% of your execution time. All right, so first off, I start with a few startup sales. I encourage you to do what I do, what is that is just let's run these first few cells there. Okay, we are reading too much into what's in there because this cell here takes a lot of time. So I would just launch it right now so that while it's running, I can come back and comment what's above, all right? So just go and launch everything up till one, okay, there, and then you can scroll back up and we can see a little bit what's in there. So this first part there is just some Python magic comments so that it allows me to reload the same module or same file again and again and it does not do like a lazy load. I'll use that later on you will see when we monitor memory. Then comes sort of the little basic function which we'll play around with. So for our play around function, I chose a task that is ubiquitous in data analysis in bioinformatic in whatever you want and it's to compute a bunch of pairwise distance between many points. If you think about it, when you do any sort of, I don't know, any sort of distance based clustering, maybe you do a K-Mean or maybe you do a, maybe you do a hierarchical clustering, maybe you do a KNN classifying, maybe you do some sort of alignment sometimes requires that you almost always kind of rely on the concept of a distance between different points or between different samples, okay? And so it's very, very common that we have to compute all distance between all points or at least between subsets of points. And also that's a fairly, let's say difficult test that can be lengthy as you gain more and more points. So this also makes it a good candidate for optimization efforts. So our function there is some base Python, all right? It accepts just a two dimensional numpy array. So basically a matrix of points, all right? And each point is a row. And so each point is made of different measurements which correspond to then the columns. We get the number of vectors, number of measurements that we have there. And then we compute a distance matrix. First we start with just a bunch of zeros. So it's a nested list with tons of zeros, okay? At first, it's square in shape and it's number of vector times number of vector in size. And then we go through each vector, each couple of vectors. So that's why we have two nested loops inside and inside that we go through all the dimension, compute the square difference between this dimension of this single point of this, sorry, two points. And that gives us the value at position ij of the distance matrix when we sum and then do the square root of this. And then we return this thing. So far so good. Yes, you can use the reaction to let me know if everything is fine at this point. Okay, right. So that's kind of our core task. And then from there we'll see if we can get farther. And to just play around, this sort of data that we'll be playing with will be created that way. So basically I have a number of vector, 200, okay? And each vector has a number of measurements, 100. So then we have 200 points with 100 measurement each. And so I draw them from a uniform distribution. So randomly between zero and one doesn't matter too much exactly what we have at the moment, but the shape of the data is what matters when it comes to performance there. And so that's what we have there. Right, so the first, first, first thing that we are going to look at when it comes to monitoring time, we also use that when generating some data for a slightly different problem that we'll gather later on. And so that one for that one, I create two FASTA files. So FASTA files are text file that contains some sequence data, organized such that there is one line which contains what we call a header for a sequence, and then the second line which contains the sequence, and so on and so forth, all right? And so I generate two files there, one which is large with one million random sequence of 500 nucleotides and 500 or 500 nucleotides. They are created randomly and the content of this code is not really what is interesting to us. What is interesting is what comes at the top there. And so that's what we call an IPython magic command. So they are fairly specific to either IPython or Jupyter notebooks. That means that on other Python code you will not be able to use that. And basically they come with generally two flavor. Either they start with a single percent in which case they will be applied to a single line or they come with two percent in which case they will be applied to the whole cell. There are many built-in command for many built-in or for many purposes, all right? The one that is of interest to us here is percent-percent time. So because it's percent-percent it will be applied to the whole cell and it measures the time that it took to execute that cell. So you see here I launched that one and now it gives me too many information that it took 2.23 seconds to run through, all right? And you also have slightly an additional stuff. So if you go there you have the one above and you see here this here CPU time there and you can see that it took here most of its time with the CPU and not too much time with the system that is reading writing. Okay, so that's here already something that lets us monitor what whole long it takes to do one cell. So basically one task if we want. All right, now this works inside the Jupyter notebook but then of course you're not always using that. So for this usually we use the built-in kind of commands of your OS, we won't experiment that but I still give you this there. So on Linux and macOS it's usually the same plus plus you just say time and then whatever command you want here for instance python myscript.py, all right? And on the windows you have the measure command and then you have here you give the command that you need to benchmark. So that's when we want to just measure a whole script at once. Okay, so that can be useful but oftentimes it's a bit too coarse-grained for what we want to do. And that's why sometimes it's just useful to take your code and then cut it in smaller pieces and maybe put it in a Jupyter notebook or do some experiments in the Jupyter notebook between different to test different functions or different implementations and so on and so forth such as we do here. So for example, you have our compute pairwise distance with our basic data and when we use person time we can just measure the time that it takes to execute this single line, right? And we can contrast that with what I've just shown you above, right? Which person-person time will apply to the whole cell irrespective of the number of comments that we have there. All right, we can see that there, it's relatively similar, okay? But there is some difference which is maybe on the order of 5% also. So it's something that is not also not entirely negligible. And this also depends on the number of comments depends on the amount of data that you have to go through on the time that it takes to run the comment. Basically, the faster the comment, the larger the uncertainty that you have on your measure. We can see that if we just run it on a smaller subset of data, so only 100 points with 10 measurement each you see that now it takes only 110 millisecond but then you see that now the variation is more on the order of 10 to 20%. You see I can go from 110 to 85 millisecond. So which is which? Which one should we take into account? And also like when we compare to implementation we should make sure that we sort of take into account this uncertainty there. And so that's why we go farther than just the time comment, but we use the time it module, which tries to solve that. So you can either use it at the Python module. So if you just launch a Python comment such using this syntax or you have the time it magic comment. So there you have person time it or person person time it and you have a number of options. So the idea is that it will execute a line of codes many times, okay? And then report the result. So if we just execute the default one it will look like this. You see that now it takes much longer to run. And what it says is that it takes 80.7 millisecond more or less four millisecond per loop with a seven run of 10 loop each, okay? So what it has done there is that it says, okay there's some inherent variation when we run. And so it will run, yeah, it will run here batches of 10 execution and takes the best out of 10. And then it will repeat that seven time and report then the best out of 10 seven times to compute a mean and standard deviation for you. So that's why we get this result here. And then you can manually specify the number of repeat and loop performed because if you just let it run it's normal stuff, what it does is that if the faster sorry, yeah, the faster the comment that you try to execute the more loops and run it will do it will try to adapt it so that you always keep the same sort of level of precision in the report. But then sometimes if you want to measure something that is very fast it will do so many, many, many repeat that it will actually take a very long time to do the measurement. And I mean it's nice but sometimes we want to do some quick check, it's a bit faster to just do some manual setting of the number of repeat that you want to do because then it will take less time, okay? You get less coarse grain, less fine grain precision but for quick checks often it's useful or so very often you'll see that the sort of performance in gain that we want to get are not on the order of 10% what we want to aim for is more like a speed of two times three times, 10 times, 100 times, okay? And so at this level is no use to have all of this precision. All right, so then that's how we do that with dash i and dash r so let's see there what we have as a report, okay? You see that now it runs much faster and that we have more uncertainty and we have now done 10 runs of two loops each. There is n and r. Okay, now I have a little question for you. Why should we take then the best out of our loops? Okay, why not the mean at this game? Please write in the chat if you have ideas why or if you think that this is a bad idea and that we should not do that. Okay, so from Yerk, we propose that longer times are probably caused by background operation on the OS test. You say that maybe it's because Python may be caching our function. So I would say that yeah, it's mostly Yerk's answer. So the idea is that on your OS there are sometimes also stuff happening and so if you monitor a function but there is a background process that is also getting executed at the same time and this artificially slows down your function and then that is sort of noise if you will that you would like to avoid. And that's why we want to then run it several times to then get the best, okay? So that we get out of this sort of noise and so then this best is of course repeated in order to gain an idea of the variation around it but that is the main reason, okay? All right, so now we are thanks to this able to compare different implementation of our function. So for that, I will then repeat the implementation that we have had before, okay? So that is the pure basic Python version of the function and I did the same thing but using NumPy functions. So NumPy is numeric Python. It's a library that implements tons of operations on vectors operation on matrices. So vectorized operation and is supposed to be much faster than native Python when it comes to these very numerical tasks. So let's put that to the test. So my number of vector and measurement is gotten there. Then I create an empty NumPy matrix first which I will populate and then I still have my double loop there, this hasn't changed but now inside that instead of going through each measurement kind of with a for loop I just vectorize the operation of doing the point I minus the point J. And so I do the two there then I square each element and then I sum and compute the square root of everything and I do all of that with NumPy functions and populate my matrix. So now if we compare these two implementation using 100 by 100 matrix of random value as input we will see how long it takes. So the native Python takes a bit of time. Okay, so here you see it takes 700 millisecond more or less 13 seconds whereas NumPy takes 78 millisecond more or less three milliseconds. So we have a speed up there on the order of about maybe eight or nine if I'm not mistaken. Okay, so you see also that all the fine-grainness that we might get by having a large number of runs and loops is not 100% necessary there because we are looking for, at least at first we are looking for super large speed up. When we then maybe later on start having very efficient code and we are aiming for a smaller reduction because we have already like done most of what we can and then we will go for the fine-grain precision, right? When we just try and grab like a few milliseconds here and there, but at first you don't need that at least most of the time. Okay, is there any questions so far? Is everything still making sense? Yes, all good. All right, is the NumPy implementation there also making sense for everyone? I know that not all of you might be extremely familiar with NumPy itself. And so maybe you don't necessarily know all of these functions but the concept of it is okay for everyone. Okay, right. Otherwise don't hesitate to ask me to repeat or something. Then one thing that can be useful also is because there you see that I do that and then I show it on the screen. Okay, that's super nice, but then sometimes if you want to compare tons of implementation it's useful to keep this in a variable. So this is achieved with option dash o to the time it comment and then you can grab everything in an object which I call here time interest. And so we can have a little look at it because then it contains all of the results. Sorry, that's not why I'm taking two time interest. Yeah, so then you have a time it results. You see that it contains all of the results that you wanted and typically what you want to access is you can access the average there but you can also access precisely how long it took for each run to play and so on so forth. Now there is a question by Kionhei asking is that true that modules always run quicker than basic Python? That is not true. Some modules have been built explicitly to be super fast and in which case in general they will be faster than basic Python but plenty of modules just reuse some basic Python and so they are not faster. And also some module have been developed by people who did not care too much about performance and sometimes they even perform worse than basic Python because they did not always use the most appropriate function or built-in function when they were developing their module. Okay, so that's not a general rule. With that being said, a lot of modules that are thought with performance in mind, I'm thinking about NumPy, about SciPy but maybe Escaler as well, are in general quite fast and you can create functions that are faster than what they do and what they propose but it requires a little bit of legwork but you'll see this afternoon that we actually can do something that is faster. All right. All right, so then we have our time it res object. So that's nice because then we don't only show something to the screen but we can then play around with this data and monitor it and so on so forth. So for example, one application that I propose is when we investigate not just compare like to implementation and see how long they take but then see how the execution time changes whereas the size of the problem that we provide. So for example, we try to see how the execution time evolves with the data size there. So I test different number of vector size, okay? So I apply the function to bigger and bigger data sets. So here a little for loop and then I create the data I grab the result here and then I add the timings there to a little list here. Okay, so let's execute to that. It will take a few seconds and then time it res is so there's a question by Elisa asking what type is time it res. So time it res is a sort of a custom time it result type, if you will. So it contains as attributes all of the sort of elements so it's got the average, the standard deviation and so on so forth. So if we look at it, you see it's this time it result type. Okay, and then in there, there is the runs of the average, the best, the time that it took just to do the compilation of the little function, the loop and so on so forth. All right, does that answer your question? Yes it does, thank you. Okay, so here you see that it took a bit of time to create but now we have kept all the timings across all of the different problem size and so now we can actually plot them in order to show them and that is what we see. So you have here the number of vector and then the evolution of the execution time there and you can see that the shape is not fully straight. And now a little question to you, what do you think about the shape of this curve? Where's this do you think expected? And so on so forth. So take a little bit of time to think about it and then please write what you think in the chat. So you wrote both basically that you expect the time to indeed are there's tests as well so that you expect the time to grow according while there's a little bit of confusion there but you expect the time to grow according to a squared of the number of vector. I want to dispel a little bit of confusion there because two people mentioned an exponential growth. This is not an exponential growth. An exponential growth is something that would grow that would grow something with X to the power of, let me see, X to the power of N would be an exponential growth. All right, what we see here is an N to the power squared. So that's a sub exponential growth and that's a square growth. It's just a point of, let's say maybe vocabulary but I think that as a scientist, it's important that we see this difference. Because a square growth is much more controllable that an exponential growth, which will grow much, much, much, much, much faster. And so when we think about our computational problem, it's very important that we keep this sort of difference in mind. Okay, so as you see, as you've seen, indeed the growth here is not linear. It tends to be bigger as the problem grows and in particular there the time will evolve with the square of the number of vector that we have. The reason why is if we look at the code fairly easy to get is that we have two loops. We have nested loop inside one another. So for I in range, sorry. Okay, num of vector and then inside this loop, there is a 4j and then inside that we do something. Okay, so then that's why we have to go from each of the number of vector. So we do an operation and time and then for each of these operation we go again through all of these. So for each n, we need to go for each n. So we do m times n operations, hence there the squared. Right, is it clear to everyone why we have this square growth then? Okay, so let's spend just a little bit of time on that. I think that most of you already have the ideas but in place, but it's nice to put in place just the concepts because that's something that you will see written here and there. So the complexity of an algorithm, here we are going to talk specifically about the time complexity of an algorithm is basically what we just did is to say given the number of the parameters of my algorithm, I want to describe how long it takes as a function of these parameters with very broad strokes, okay? I don't care exactly about precise measurement but mostly about the broad stroke of how it is going to play with. So for example, if we take our algorithm there, it depends if there is two for loop, one on the number of vector and second on the number of vector and then all of the operation that we do, they also depend on the number of dimension, right? Because then we do an operation that will be if you remember here, the NumPy function, we do here a subtraction of two vectors, each vector is of size number of measurements. So we need to subtract all of these numbers, right? Or in the native Python, we have a for loop in the number of measurements as well. So both of these operations here, this one and that one, of course in practice, they won't take the same amount of time but they are both dependent on the number of measurements in there. Is that clear to everyone? Okay, so now if we then write our function again, okay? It's like if we have irrespective of the two implementation that we have seen so far, a third loop that depends on the number of measurements, okay? And if we call this number of measurements M, we now have our general code that depends on something that is N times N times M, okay? Now, of course we have seen together that the NumPy implementation is about 10 times faster than the basic native Python implementation but both have what we would call the same complexity which is usually noted with a big O, all right? Of N times N times M, all right? So basically what we say is that if we, like for both implementation, if we double the number of vectors, we will multiply by about four the time that it takes to go through that. And if we double the number of measurement, we will double the time that it takes to again do the execution, right? So this lets us also have like an additional complexity to add there when we think about the optimization is that sometimes it's not necessarily only about the time that it takes to do a specific task but also to hold a specific implementation or specific algorithm will see its execution time evolve when the problem changes, all right? That is something very important to keep in mind because most of the time we do our little testing there on smaller datasets. And so being aware of the, being aware of the time complexity, let's us make better prediction for when it comes to applying that on our real dataset, okay? This is also what lets you know if like, okay on your test dataset, it takes one hour, then knowing the complexity, you can know if with the real data, it will take four hours or four days or three weeks or 500 years, okay? And so it pays sometime to just draw this little line and think about the sort of code that you use. Here during this course, we are not going to go delve super deep into the algorithmics but sometimes when finding, trying to find alternative solution for the same problems, it can be interesting to delve on that because if you are able to reduce the complexity of a problem, for example, when it comes to this n square problem, very often there are tricks to rewrite them as log n times, in log n times n complexity. I'm thinking for example about the algorithm that we can use to sort some data. For example, that's a common, let's say algorithmic exercise. You will obtain not only something that is faster for the same size but also something where the difference between the two implementation will become bigger as the size of the problem increases. And so suddenly you have problems which were not possible or not tractable before that becomes tractable. As I said, we won't see too much example of that in there because it's mostly about algorithmic course and not coding course. We will see here, let's say, programmation tricks to make the same code go faster but that's still something that is interesting to keep in mind sometimes when you do your, when you go and see at your code. See if there is not like a nested loop that is not really useful, some computation that you do several time for no reason and stuff like that. All right, any questions so far on what we've seen and on this little blurb about complexity? No, everything okay? Cool, right. So then let's continue a little bit. All right, so sometimes we have seen together how we can time a whole cell or a single line but sometimes you have bits of code that will intermesh different operations. Typically take here our large FASTA file there and let's say we want to read it, okay? And we want to compute the GC percent for each FASTA line. So the GC percent is basically, you've got some sequence of DNA. It's made of ATGCCGTA, okay? Something like this. And the GC percent is just what is the percent stages of GNC among that. And so what we do there, you see if I open the file and for each line in the file, if the line doesn't start with this little chevron symbol which specifies that what follows is a sequence ID and not a sequence. So if it's not that, that means that I have a sequence. So I count the number of Cs, a number of Gs. I multiply by 100 and divide by the length of the line and I get my GC percent. And it takes me about four seconds, 4.5 seconds there. But you could ask yourself, okay, but there is two operation they are intermeshed. There is reading a file and there is computing a GC percent. And so it would be legitimate to ask, well, what time does it take to do this part and that part? So of course you could kind of refactor your code maybe to split these two operation into separate steps, but also this would maybe and very likely force you to store all the sequence in your RAM in the form of a list or something like that. And so if your file is very, very large, that might not be really super nice and sometimes not even feasible if the file is really too large. So for this, we need another trick. And there, the one that we use is very, very classical. We use the time module in particular, the time dot time function. So the time function of the time module. So I import time and then I call the function time dot time. Then what it gives me is a very large number there. And that is the number of second that has elapsed since what we call the epoch. So that is the first January 1970, which is sort of the accepted, let's say start time for all the machines, all right? And you can see that it returns it with a certain degree of granularity as well. So now what we do is that we keep a start time, okay, in epoch and then we compare it with a stop time or with the end time again in epoch. And with the difference, we know how much time has passed, if you will. So applied to our code looks like this, okay? I start to, I know what time it was when I started entering that code. And then I have the time that it took after, all right? So by doing the difference between the stop and the start, I know the total time that was elapsed to run all of these bits of code. And now on top of that, inside the loop, I can just take the year start time just before the count of GC and then this just after, and then I keep that in a running sum, okay? Such that each time I get in there, I compute the time that it took to compute the GC. So I can sum them up together to get the total time that it took to compute all of the GCs across all of the lines, okay? But without counting the reading part. And so then I can fairly easily get the total time. GC, the time it took just to compute the GC and by doing the difference, I know exactly how long it took to do the actual reading of the file, all right? So now there we go. It took five seconds to do the total, about four seconds to 3.8 seconds to compute the GC percent and about 1.2 seconds to just read the file. So we have also a small ideas about what operation cost exactly what. And we also know a little bit what we could do. But for instance, if we know that if we made the GC percent function perform instantaneously, this code would still take 1.2 seconds to read, to go through, right?