 Okay. So in this last notebook, we will see a little bit of an introduction to multi-processing and multi-freading with Python and some of the libraries which we have been using up till now. I will not give you an extensive course on parallelization. I mean, we don't have enough time for that. And we have courses dedicated to it. It's really a topic that deserves a lot of time by itself. So Sieb has already an open MP course. We will only touch about on the multi-processing, multi-threading and not on the MPI types of parallelization, which is another style of parallelization entirely. Okay. And which, again, is like a topic onto itself with some very, very good libraries to do this in Python. Okay. But that kind of goes beyond what we are going to see. Just in case you're interested, if that's something that you think that you want to do, I heartily recommend MPI for Pi. It's a very nice library to do MPI parallelization in Python. All right. This being said, let's delve into what we are here for. So I want to show you some of the cases where it does work well, but also some of the caveats, because a lot of people will tell you that multi-processing like speeds of stuff and so on and so forth. But the truth of it is a bit murkier than that. Okay. And so I want to show you the little, you know, catch and things that you need to be mindful of otherwise it can be very frustrating. All right. So the first thing is that in Python, in general, we talk about multi-processing and not multi-threading. And that's because Python is a interpreted language. And so the way that it's built, there is what we call the GIL global interpreter lock that sort of by design, you know, blocks us from doing proper multi-threading. That's why we have to try and use these different things of multi-processing. Okay. Where we start up different new different processes on your computer and we send them some computations to do. Now, to know if something is like parallelizable in terms of tasks, you have to check a few things. It's somewhat of an art onto itself to be able to, you know, see where and how people think tasks can be properly parallelized. But a few little like the smallest checklist are basically that you want to check that your main task can be divided into sub tasks, which are not dependent onto one another. All right. Are very similar to one another. And use independent part of the data. Ideally, all three should be, should be, should be true. But the first one is the most important. Okay. Because if you have elements which depend on each other's results, then you can't really have them in fully parallel because you will need to have parallel process that tends to constantly communicate with one another. Each of these communication takes time and slows down the parallelity, if you will, of the whole process. Now, let's consider again, our small little example, which is about computing a simple integral that is given there. Right. Here we see that by default. Just with this, it takes, let's say about 200 milliseconds or so to compute. Right. And now we can sort of come and say, right, but then as we have seen, we are, what we are doing is that we are doing a for loop. And inside this for loop, we are applying the function f native, okay, to different data point. And then the results don't really interact with one another, except that at the end we sum them. Okay. So that makes it a fairly good candidate for parallelization. And a good paradigm to think about these, in particular with the multi process libraries to think about the map function, when in maps, you basically apply a function to all point in like some sets of data. So we would go from an expression a little bit like this for I in range length data, and then you get a result for point I by applying function or point I of data to something like this. You map a function to all points in the data. So if we first think of our problem like this, we rewrite it. Now I just put it outside of a function just so that it's a bit faster to write the code. Okay. We define a b n the x, our data is now just whatever we had here in the range. Our function is now slightly more complex because we need to go from I to x and then apply the squared minus x here. And then we map the function to the data to get the result. And at the end, of course, once we have mapped everything, as we did before, we need to sum and then apply this little normalization factor. And well, okay, that gives us indeed, we can check. And it's always a good job, a good action to check that we do get the exact same thing. We get something slightly different there. I think it might be because of, I don't know exactly why, sorry about this. We do get something slightly different. Or is it just a numerical difference? And I'm not picking it up. Yeah, I think it's just a numerical difference there. So, but that's basically the idea there. Now, once we have that in the form of a map, it's fairly easy to apply multiprocessing on it. Because the multiprocessing, sorry, library just has this concept of creating a pool of process. Okay. And then mapping a function on that pool of process and some data. So now the multiprocessing version of the code above is just looking like that. So, rather than having something like this, we have that. So, I get multiprocessing, then I start a pool of here, just two processes. I call pool.map and then is the same thing as before here that will spread here the data and the function onto the different processes which I just created to give me then at the end my final results, which I then sum and multiply to get my final result there. Okay. And you see that I do get the same thing. And now, of course, the big question is how long does it take? Is it much better? Okay. One could say, all right, we have two processes, so it could be twice as fast. That's our hope. Okay. That's our best expectation. But here what we see is actually not so good. Actually, that's even worse than what I expected. If you have seen what I had slightly before, what I had before was maybe something that I don't remember exactly, but was it something like maybe 130 something millisecond with a non-parallelized version and something which is maybe closer to 100 with a parallelized one. So, not divided by two, but still slightly better. Here, we can see that we are even worse. Sometimes, depending on the precise implementation, how happy your computer feels at that moment, you might see slightly different stuff. And this is because basically just starting this multiprocessing there and then disseminating the data and then gathering the data back as some overhead. And it so happens that the time that you gain by the parallelization, so divide by two, you basically divide by two the time that it takes, is you gain less than the time that you lose because of all of this overhead. And that's something that if you don't know about, can be a bit surprising and can be a bit frustrating because you give more resource to your processing system and it performs worse. That's not what you want. So, ideally there, the problem is here, the problem is that we are spreading out a lot of super, super, super tiny tasks. What we ask each of the process to do when things are mapped is ridiculously fast and small. And so basically, it spends very, very little time doing the computation inside the process, but then a lot of time starting this process and shifting data and so on to them. So, usually, it works better if rather than having thousands of very small tasks, you have a few long tasks. That's when we start to see better performance with this library. So, let's see here. For example, one, let's say our long task will be to run this Integrate Native on 400,000 points. So, a subset of what we need, but a large enough subset that it takes here, for instance, it should take here 80 milliseconds, right? So, it's 0.08 seconds. So, much, much, much longer than what we had before. And so, this is our task. All right. And now we say, for instance, we have 100 tasks to perform such that when we all put them together, we have done four times 10 to the power of six. Of course, you would adapt that so that you spread evenly the load depending on the number of points that you want to run and so on and so forth. Our serial execution should then be 100 times that, so that should be then maybe eight seconds. I now realize I should have started running that much earlier on. No, there we go. And now, if we start the thing, we'll see that now we will actually do gain some time. Let's see. Fingers crossed. 6.46. So, now, indeed, we have gained some time, right? So, we see that we have, we should have spent eight seconds doing the computation. Now, it was spread onto with two pools, okay? So, we have spent maybe four seconds doing the computation and then the additional 2.45, 46 seconds is sort of the overhead there. Here, my overhead, you can see it's bigger than what I usually have because usually I have only like 45 seconds. Now, I have 2.45. So, I think that here my computer is just unhappy with Zoom doing all the recording and so on and so forth. But now, we can start talking now. We can start to see some improvement. And then, of course, from there, we can also try and play with a number of processes that we can allow and we can see what sort of speedup we get, okay? Of course, this will highly depend on what your computer is doing right now and also the number of threads that is allowed by your CPU, okay? So, there, this is what I'm going to see. Here, I know that, again, Zoom is taking at least one and a half CPU permanently. So, I don't know exactly what sort of performance I will get when I allocate more threads than there is CPUs available. Now, there is a question by Alessia when we call m.tool2 is a data split into two lists. Not really know. So, data is not split into two lists, but it opens a certain number of processors. That's what this statement does. It opens a, it opens and manage a number of processes. And while you are inside this with statement, you have several processes available to you, okay? Does it clarify a bit? I mean, I was confused by the fact that task takes as input an integer and then the fact that data is a list. I'm sorry, could you repeat your sound was not too great. So, I'm sorry. I don't know because my sound is very bad. Okay. So, yeah. So, here, the data is, the data is spread among the, among the different processes that are, that are, that are started. But the list itself is not really like kind of split into or something. All right. So, there we go. So, there we see what sort of performance we have. You see that in, for the moment, we are just, you know, increasing here our performance, right? It's just that it comes at a cost in the sense of here we get something, of course, better, but not twice better. Again, if you go from one to four, you don't divide by four your compute time. And it's, you need to kind of spend eight times the amount of resource to actually divide by two. The execution time is kind of the idea. So, there's always this overhead there. And that's why it pays to think a little bit about what sort of task and how you spread your task. And as I said, in general, think longer tasks, like subdivide your, your, your different sub tasks in, you know, longer sub tasks such that, you know, you will get in general better performance with multi-processing. Okay. There's a question then. So, by Alessia, I was just confused by the fact that task takes an integer as argument, but that is at least. All right. Now, there's a question by test, can multi-processing run on GPUs? No, it cannot to run some data on the GPUs at the moment with Python. I think that the simplest is to use a sub part of Numba, which is numba.cuda. But then you have to kind of rewrite and rethink the way exactly that you send data to the GPU and then data back and so on so forth. We will give a course at SIB in, it will be in February. We don't know exactly when in February, but it should be announced in a one or two months. So, keep your eyes posted on maybe the SIB training mailing list and you will see this coding to GPU course appear. Then from Qionhei, I tried M2-383 errors, it has to be even. I don't think so. Let me try. Actually, here you see, I can just put three. And I don't think I will get an error. So, I don't know why it gives you an error. Maybe you could paste the errors that you get in the chat so that we can maybe help you. All right. So, that's the thing. All right. So, remember, there's a question by Jörg. No, no, the question. I just wanted to say thanks because I have to leave. Okay. That's a lot then. I hope that you had a good day and wish you a good evening. Bye. Goodbye. Bye-bye. All right. Then, so now I've given you the theory. I've shown you briefly how this works. I've given you also an idea about the sort of little catch stuff that you have to take into account. And it's your turn to practice. So, if you go back to our exercise notebook, you have here our little function there. And try to rethink it a little so that you actually make it properly parallelizable using this idea that we want to have large tasks. Okay. Because what I've done here is mostly, so what I've done, let me see. Yeah. What I've done here is mostly a proof of concept just to measure sometimes, but it does not actually properly compute the integral. So, now you have to kind of leverage that, but actually do the work of doing something that does return to you the proper result. All right. So, then I'm going to resume the recall. Okay. So, let's see what we wanted to do there. We have a task, okay, with our two functions. And we have already seen earlier on during the course that if we just apply multiprocessing without thinking too much, we get worse performance. And the reason why we get that is that the each task here, which correspond to one call of F2 is too small. And we will get better performance if we have bigger chunks as we have tested here. So, now my idea, if you will, is to say, okay, let's say that we have, I don't know, four processes, then let's cut the data into four chunks. And then we send, you know, each chunk is a task. And that way we have big tasks as big as can be for X processes. So, first off, we need to do some then coding. First, we need to have a little function that cuts our data into X part. So, it takes number of tasks in total and the number of processes that we have. And then we have just here a little something that cuts everything into the different subpart. And with a little bit of smartness to try and take into account the fact that, you know, at the end, we may have, I don't know, maybe we have 1000 tasks to do, but we have only three processes. And so 1000 is not divisible by three. So, there will be a very slight imbalance. And we want to take care of that. So, for this, what I do, I create nested lists. And then for i in range n, I say that I add i to the process i modulo, the number of processes. All right. So, modulo is the reminder of the integral division. So, that's what is written there for two processes. If i is free, then modulo, so the rest of the division is one, then goes to sublist one for four, goes to sublist four, and so on and so forth. So, this ensures that all processes, all chunk, get a somewhat equal number of tasks. Okay. And then, of course, you want to test that and make your, maybe your little functions. Okay. So, first I try to get my data part. All right. And then the task is to apply f2, not to a single point, but to a set of points. Okay. So, now my function task takes some data, a, which is a, which is a starting point and dx, which is the difference between the two. And so now for each point in the data, I apply f2 there, which is a single point, a and dx, because a and dx are used to compute x there. And then I do the computation. All right. So, this is what you can see here. If I try to apply it with n equal 10 on two processes, okay, I get my data part and it's basically 0, 2, 4, 6, 8, and then 1, 3, 5, 7, 9. So, both chunks are of equal size. So far, so good. Everything kind of clear? Yes. Okay. That says, did you answer the question? Maybe it's already answered about from Alessia, who creates chunks with empty range. Well, we're sorry. Oh, you're sorry. I didn't arrange here. So, that one is good. Yeah, I just, sorry, I just gave it a thumb up because that's also all right. All right. So, it's another way to go at it, depending exactly on how you want to do the thing. All right. So, there we go. Now, we want to apply that and we want to apply that on a large amount of data. All right. So, they're two to the power, two times like 10 million data. So, then we can, like again, when we have bigger chunks, we have bigger tasks to do, then now we start to see the benefit of this and it starts to go beyond the sort of overhead that we have for starting multiple processes. So, here is how it would work. For instance, just first manually for just two processes. Okay. So, whoop. So, a equals 0, b equals 2, then plenty of points, dx, then you get your data parts here. And then here, I just compute tasks for the first part and then tasks for the second part. If you really do that bit manually, and then you need to sum. So, that's kind of what you have to do. Okay. So, indeed, there our data parts are the two chunks. Remember, our sum is always this little number there. And so, you see here, I have my two chunks if give the result and the nice sum. And of course, now what I want to do here is I want to do a small adaptation because to be applied with map, I will make just a specific version of the task with a and dx fixed. When there, I use the global definition of these two variable. It is not always the best of practice. But for a specific case like this, where I need to have a function that takes a single argument, because this is what map requires, then that's what I will do. An alternative, if you know how to do that, would be to also do a lambda function, which is basically the same thing. And then this is what happens. So, without any multiprocessing first, and then with the multiprocessing. So, for different number of process, I have here my with statement to start the different processes. And then I just get the data part and then time the mapping. Okay. And the only thing remaining after that would be summing, but summing is super simple, because then I would just sum eight numbers that will be nearly instantaneous. So, there we go. Without any processing, 3.5 seconds, then you see here the overhead of just starting just being in this framework of the multiprocessing. And then you see that there, boom, boom, boom, I can get here, currently at 4, I can get slightly better performance. All right. If I use now results that were sent to me by Robin, who does not suffer as much as me as the fact that, you know, he's not hosting the zoom, what he gets is something like this. So, you get about six seconds without any multiprocessing. And same thing with one, right? Okay. And now, as soon as you get two or more, now you start to see much better performance, right? The other gain is much, much, much clear. And it's just kind of, I would say, my case where here, because I'm also hosting the zoom and bunch of other stuff I get as business performance. Okay. So, I hope that this helps you gain a little understanding of multiprocessing and how to actually make it work for you. And the secret is really this sort of, I would say, chunking concept that is there. All right. And that's really what makes it work in practice because otherwise it can be a bit frustrating, frankly. Okay. Are there any questions there? Is everything still okay? I know it's getting a bit late and there are now quite a lot of new concepts, but hopefully it's all still making sense. Yes. Yes, still a bit. Okay. Then let's see another way of doing multiprocessing or multiprocessing in Python, which is maybe even a bit nicer. It's again with Numba. So with Numba, we can do both the, sorry, the compiling, but then also do some parallelization. And then of course, it works better if we try to do super simple stuff. If we try to do slightly more complex stuff and so on, it might kind of break down a little because Numba being Numba, sometimes it has a few troubles, but when you want to do simple stuff, it's super nice. So first, let's go on. I import from Numba NGIT and I have my integrate function, same thing as before. Now I will do first integrate version of Numba if I, if you will. So with NGIT, NGIT is shortcut for JIT, no Python. And then I do the same thing, but I set up parallel equal true. And so then it will try to apply its magic to understand what could be parallelizing that and so on and so forth. And when we try, first off, we want to check that it gives us the same results. And also a little trick of the trade, it lets me do the compilation there. So by checking that I get the same result, I also at the same time check that I also do the compilation. So I won't have to do that later. And next I can times a different version. So the native one takes 77 millisecond. The Numba file one takes much less time. No, it takes more times pricing. So it has a bit of trouble to try and try and optimize it. But then the parallelized version, yeah, you see, it's actually much, much, much smaller in time. So almost 10 times as fast, because it is, I think it's using all of the thread that it can on my computer, which I think I have 16, while a few of them are used by zoo. And apparently, now Numba is much, much better able to use them than multiprocess. The reason behind that is that so it tries to understand what to do it understand that here I'm doing a reduction in parallel. So it paralyze and wear multiprocessing using a system of processes, because it cannot do multi threading. So it kind of tries to hack around a multi threading. But it's not really that that cheat here. Numba uses actual C code. And in actual C code, you can do actual open MP. And so you can do something which is much, much, much more efficient in terms of how you handle the communication between the different threads. And that's why also it gains, it has much better performance in general. All right. So now you can see with that, when you do stuff that Numba can understand as being parallelized double, you can get super nice performance. Now, this is when things go super nicely, but sometimes it's not that simple. So let's come back to an example where we have basic Python code for which we know that usually Numba will work much better. So I do this. So super simple native Python. Then I do the ng there and ng it with parallel equals. And we see what we get from there. We get the proper result. Though there we do get a little warning. Okay. It says from Numba, it says, okay, you gave parallel equal true. But I was not able to transform that to parallelize this. So I defaulted back to non-parallelize something. Right. If you are curious, you can turn on some apparently diagnostic to try and understand where things happen. Right. So here I won't go and do that, but you can do this. It's fairly simple to check and to see because it tells you where it tried to parallelize and where it failed. Here, it failed everywhere. And that's when you start to have to become explicit about where you would like the parallelization to occur. Here, it's actually fairly simple. You just have to know that the parallelization, you want to happen on this for loop, right? You would like operation on there to happen in a parallelized fashion. So you just rewrite a little bit your code to specify that to Numba. And the way that you do that is to import P range from Numba. So P range for parallel range. I say from Numba import P range. And then you just change this range to this P range. And then Numba will be like, okay, now I know how to do that. And I will do so. Of course, if you've messed up something, for instance, if here execution, i depends on the execution of i minus one or i plus one, you have parallelization problems and it will be a bit harder to solve. But there it's fairly simple. And now if we time it, we can see that the F2 Numba there actually was able to get much faster because the code is simple. And the parallelization version with P range is there better even. It's not 16 time betters such as the one that we had before. But still, you do get a good gain at the minimal cost of just switching a range for a P range. Okay, so it's not too much work. All right, sometimes it does this automatically, but sometimes you have to specify manually. Okay, so that's basically what we want. If we increase a little bit the amount of info that we have there, and we compare the one that it automatically parallelized with the one that we manually parallelized, you will see that they have fairly similar performance. The one automatically switched is a bit better because maybe it was able to understand a bit better as a structure of the parallelization. But still, it's very, very, very close. Right. So there you go. I mean, there is it's actually not rocket science. It's just you can you have this argument parallelic or true and it's usually fairly smart. And when it's not so smart, then you can help it out with the P range. And that's again, one of the beauty maybe of of number is that, you know, when you are working on simple enough stuff, it's very, very simple to get a big boost in performance. Now, one thing that might be interesting is that by default, it will use all the thread that in can, but you might want to actually tweak that to ensure that you have a tighter control on the number of thread that it uses. And for that, there is a little config number. So see by default, it will use eight thread there, but I can use the function set them thread to actually tune that out. So you see default number of thread. I had it 35 milliseconds to run through the data. And then as I change my number of thread, you see that I have a slightly different performance there. You see that again, the performance is not a simple like division or multiplication by two. It's a bit more complex than that because there is some overhead involved. Okay. All right. So that's, I would say most of the stuff that there is to know there, of course, you have a little documentation there with a few more details, because as you play with multiple function, which might have to respond to one another, maybe sometimes you want to have a polarization theme with some amount of communication, but not too much. So you have to sometimes go into the detail and get your hand a bit messy, but for a simple enough use case, usually that would be more than sufficient. All right. Are there any questions on this and on polarization with Numba at this stage? Okay. All right. So then Bush has to leave. So bye Bush. And so then we have a little bit of time. We have two minutes remaining. You have here see a number of additional material. You have, for instance, the description of the polarization of our pairwise distance function, both will multi process. And also with Numba, okay, where you can see how this works and how it works better. I won't spend any time of that, but feel free to go through that later on on your time. But otherwise, I will just close up, I think the session. And I hope that throughout the day, we have been able to show you that a number of consideration, but also a number of tools and way that you can just take a code that you currently have and then understand how long it takes and where you might want to spend some time to optimize it. Okay. Don't start optimizing really nearly do some profiling to determine when, where your effort might be the most rewarding. And then we have tried to show you different method to try and do this optimization, at least at the scale of some libraries, although that should not dispense you from thinking about the structure of your algorithm. We have shown you that they work very well, but also they work within their framework. And when you try to go outside of that framework, what the performance that you see might vary and might not be as beautiful as what they sell you in the documentation. And we have also discussed briefly the micropro, sorry, the multiprocessing and some also of the little caveat that we have to take into account to actually make this method work. Okay. I hope that you learn a lot of things and that you will have plenty of opportunity to practice them later on in your, you know, Python player round. And the only thing that it remains is, I think, to thank you all and to thank Robert for his help all throughout today.