 Are we ready? Yeah. Yeah. So, okay. So the first topic is parallel programming. And that's kind of an interesting topic to talk about in Python. And I mean, well, you'll see in a moment. But the main thing, I guess, in Python is that you are using, you're often using Python to connect packages written for you or written by someone else and possibly written in other languages. So the first thing, when you start considering making your program faster, or especially if you start thinking about making it parallel, is to check if the libraries you're using already are parallel or not. So that's probably like the most important point of this lesson. If you come away with just one thing, that's it. So we will talk a bit about how parallel programming actually works, what it actually does and how it can or maybe cannot make your code faster. So sorry, I interrupted you. Do you have something? No, no, no. Yeah, go ahead. So maybe we can ask, I could ask, what do we mean when we talk about parallel? What do we mean when we talk about parallel programming? Is it like two people on a keyboard or? Yeah, right. So that's called pair coding, I guess. That's a slightly different thing. Although that is actually a good, I think that's a pretty good metaphor for what the computer does or what the program does when you run it in parallel. So if you are writing some code and you want to do it twice as fast, hiring two programmers will make it faster, I guess, but it might not make it twice as fast. But yeah, so basically, the starting point if you're thinking about making your code parallel is that you want to make it run faster. If it's taking too much time for one reason or another. Otherwise, you can just run it on a single processor and it will be done. So that's the starting point. We want to make it faster. If your program is fast enough, you don't need to make it parallel. And would you say that the speedup would come from utilizing multiple processors in one computer? Yes, so when, so parallelization specifically does, yeah, so you try to make it faster by essentially using multiple computers, or there are multiple processors on most computers these days. Basically, if you have a laptop desktop, you're watching this, it has four, six, eight, 12, maybe around 12 processors that are on the same chip, but are essentially running calculations independent of each other at the same time. So yeah, so making your program parallel basically means all of those processors are doing something at the same time, so in parallel. Okay, so the first thing though, if your code is too slow, you kind of want to do some, before going into parallel programming, because it can quickly become a complicated thing, is to check why your code actually is slow, where are the slow spots, use some profiling tools, and then think about can you use existing libraries or somehow make that part faster, and then profile it again, and see what part is slow now, and so on. And then when you get to the point where that's no longer helping, you may think about parallel programming. And then there's, there are essentially two different modes of parallelism that people generally use, and you can think about whether one or both of those will work actually in your program. So the first thing to check is, if there is something that really needs to be running sequentially, like you have to know the result of the previous computation to go to the next one, then you can't do those two things at the same time, so it's not really parallelizable, and then you just have to think of something else. So there will be those parts in the code, but there will also be probably some parts in the code that where two calculations can be done at the same time, and then you can parallelize it. And would you say that like in some cases, like let's say you have, you want to run the same code for multiple data sets or something like that? Yeah, so that's one example. So maybe you just want to run the same code for multiple different parameter values. So you have a bunch of files that you want to process. So that's an example of something embarrassingly parallel, which is like, I mean, it's an interesting expression, but that's kind of the official term these days. So it's, the reason it's embarrassing is because you need to do anything to make it parallel. It just means you run multiple copies of the program on those different processors on those different computers. And it's, you can do that by running the command multiple times, or you can program it in some more fancy way to start multiple copies of the program. We'll probably, we will see ways of doing that, but I mean, there's, there's many ways. Yeah, so it's like, I would say that the embarrassingly parallel is usually like, if you notice that in your code, you happen to have a for loop at the outermost layer, and inside the for loop, you have something that run a model or something, you know that you can like, basically take that for loop and run it outside of the program. So basically run multiple copies of that program. And that's why it's, it's, it's barely called parallelism or it's like, it's so embarrassingly parallel. You don't need any fancy structures to make it happen. So, so the main thing about like this embarrassingly parallel category is that the, the different processes, the different copies of your program, don't need to communicate at all. They just do their own thing. And when they finish, they finish the others do not care. That's embarrassingly parallel. And that is the most common situation, I think that's, that's what you will usually end up doing. It's one the same thing many times. But if that's not possible, then there's two other options. There's multi-treading. Multi-treading means that they are running on the same computer, but on different processes on that computer. So it essentially, the thing is that they can share memory. When you have a variable in the program's memory, the other copies, so-called copies, or which are called threads, really, they have access to that memory. So that makes things a bit easier. It makes sharing information between the processes a bit easier. And then there is message passing, or MPI message passing interface is a common term people use. So, and a common framework for this. So that means you have a bunch of independent processes that can be running on different computers, and then they send messages to each other. They communicate over that over some kind of network. Or in Python, you also have this multi-processing. And what it basically means, that the Python launches multiple processes, and then they communicate by writing small files into memory, and then each process reads the files. But basically, it launches multiple Python interpreters that each run stuff. Yeah. So that's, there's some, well, okay, we'll come, that is a Python-specific thing that is kind of, it's important, and we'll talk about multi-processing in a moment. I'm kind of, well, I'm, we'll come back to it because I'm kind of struggling with a way of going- Go and check the first example of a parallel. Yeah, okay. So if you go down a bit, there's the multi-treading example. So this is still like relatively straightforward, especially in Python, because essentially Python doesn't do multi-treading, but the libraries that are written for Python can do multi-treading. So if you're using NumPy, SciPy, Pandas, and so on, they will be using libraries in the back end that are already multi-traded. So you basically, you don't need to do all that much. Just make sure you're not running a for-loop in Python, but rather you are using a NumPy operation if you can. And there are complicated enough NumPy operations that you can do most things without a Python for-loop. For the technical side, people who want to know how it works, basically like NumPy, inside it, it has lots of libraries, or it uses lots of libraries like this linear algebra libraries, and also there's like, if you think about a for-loop, it's like some all values in this array or something like that. There's a for-loop in many of the NumPy functions, and those for-loops have been threaded in NumPy itself. And if you just tell NumPy to use multiple processors, it will try to, whenever you use the NumPy functions, it will try to run these for-loops in parallel. And it works behind the scenes. So the only thing you need to tell it is this OMP number of threads, which is a pretty cryptic variable, but basically what it tells is that use multiple processors in NumPy functions. And this is also MKL and a third possible option. Usually those are already set so that you don't need to send them. Usually it will, NumPy will just use multiple threads. You will see that by, if you run NumPy and check how what your processor load is, that Python process will be using something like 400 or 800 percent of a CPU. So that means it's already running on multiple processes. Threading is also common for web applications and that sort of thing, but because they are not for scientific computing, they are not so relevant. We won't be going to that, but there's like, Python has lots of these async things, but they are not relevant. So let's I'll just mention it, but if you see parallel using that, they are basically mainly for web applications and that sort of thing. So one big thing kind of worth mentioning, like if you are trying to do multi-threading on your own, I said Python doesn't do it. So it's in fact something that was kind of a choice that was made when the Python language was developed or is being developed. So there's something called a global interpreter log. And basically that means that they can only ever be one thread, one process running Python code. So if you want to run Python code, like have multiple processes running your Python code, you actually need to start multiple Python interpreters, which is exactly what the next thing we'll do. So this is a way around multi-processing is a Python library and it's a way around this issue. So if you actually find that whatever libraries you are using are not multi-threaded, so you're not doing calculations in NumPy, or you are doing something a bit more complicated in Python and you know that you need to kind of split it into multiple processes, multi-processing is a way to do it. There's a bunch of nice libraries built upon multi-processing that might be more useful, but I guess we'll write a list into the into the notes in a moment. But here we'll just try to use multi-processing. Yeah, let's let's let's go on like I can Yeah, let's just run some over here and see what happens. So here we have an example, so we have a function. So we have a function that, well maybe you can explain and I can write. Well yeah, if you if you type and I'll, so I mean this is a very simple function, it just calculates the square of the number that you put in. But this is a Python function on purpose, so like we're assuming that it somehow something we cannot do in NumPy directly. And then map will run that function on every number in the list that you give it. So you notice like it gives you the squares of every number. So one squared is one, two squared is four and so on, six squared is 36. That was really fast because it's a small list, but it was running pure Python code. So if it was a big list, it would be a lot slower than say NumPy. And I'll mention here that this might look for those people who haven't used like functional programming kind of things. This might look pretty strange, but it's basically like a for loop, well for loop in a small space. It is running the same function to every element in the list. So you might see a more Python version might be something like these. Yeah, okay, that's the same. Yeah, this is basically like a map function. Yeah, yeah, that's a good example. Yeah, but basically it's the same thing, but using these maps well it makes sense when we go to the next example. The reason we use the map function there demonstrated is because multiprocessing comes with its own version of the map function. So if you import pool from multiprocessing and and then pool contains the map function. Oh, okay, it's from multiprocessing import pool, not with. Was looking ahead already. Yeah, and then you need to get a pool or create a pool. So there's a great way of or a great explanation of why this is called pool, but it's essentially a set of like, it gives you a set of processors that you can use to run stuff. So you can run the square function with multiple processors. It didn't actually return anything. I guess you have to probably need to store it somewhere. Yeah, okay. So yeah, this does the same thing. And you didn't see the speed up, of course, because it's so fast anyway. We could have a much bigger list and then you would see it. But so what the multiprocessing pool does, it takes the list, it splits it up between all the processors available, and then runs this function on all of those. So all of those processors do some part of the work, basically. Yeah. And in general, like, I would say, like, here we give a list to it, but map, basically, it works on any kind of like iterable thing. So in Python, you see a lot of like these iterable things, like something that at least is a iterable thing, but you can have other things as well, like iterators. And what what we're like, map does is that basically take something from the iterator, and it runs a function on it. And if you use the normal map function in Python, you do it one by one in one processor, basically. But here, we have a like a processor pool. So we have multiple processors. Usually the number, number is auto decided to be the number of processors in your computer, of course, you can set it yourself. But but you have a number of processors, and we read from the iterator, and then we like, give it its processor. Yeah, one at a time. Yeah. And then really, actually, many at a time. Because yes, many at a time. And, and, and of course, like the, the multi processing library then collects the results in same order back, so that you get the same, same kind of like, like correspondence between the input and the output. But, but basically, now you do the mapping in parallel. Okay, so let's now go to the exercise. So there's an exercise one where you use multi processing. And then you can also move on to exercise two, which is more of a discussion about running on a cluster. So a cluster is a, a supercomputer. It's a, it's a system with a lot, a lot of computers in a fast network. But yeah, mainly do exercise one. And then if you have extra time, take a look at exercise two. So we'll have 15 minutes for it. So during the exercise, we'll, we'll try to answer any unanswered questions in the in the notes and we'll bring them up after the, if there's anything interesting, or any especially good questions, we'll raise them up. Okay, so that's it for now. See you in 15 minutes. Bye. Bye. So yeah, there was one more thing we wanted to mention before MPI, which is that a lot of the libraries that we mentioned in the beginning, a lot of the libraries have some sort of parallelism built in usually multi-treading. And often it just works out of the box. But also there are ways of setting the number of workers and using parallelism through some settings. So do you have? Yeah, it's usually, usually good idea to use this, like the developers of the packages, they have most likely tested that this parallelism actually speeds up the code. So it's instead of like making some parallelism outside of the, what the developers have intended using their method is most likely the most efficient way of getting parallelism. So for example, like just like an example library that you might encounter is like skeek it learn if you're doing like machine learning or data analysis and that sort of stuff, putting models there. And let's say we are in the user guide, we just go to the user guide. Like this is what, like when people ask about parallelism, this is what I usually do. I open the package space and then I press control F to search and then I search for parallel and there's like a page on parallelism here. And if we look at here, they mentioned that they're using this job library, which is similar to, oh, it's built upon multiprocessing. It's a nice library. We'll mention other tools like this in the notes after the MPI session. But basically it says that there's this end jobs parameter, like on estimators that you can like, you can put it there and then it uses parallelism basically. And then you can also use this, this higher level parallelism and these open MP stuff. And they have various ways they explain like, how do you get the best performance. So it's usually a good idea to check the guide, whether it's parallel mentioned anything. Parallel is the magic word usually that can be found in Vegas documentations and you can then like use that. Another thing that, so in the notes, I think it's good to mention. So at least one person got an error that from multiprocessing, it's a library name, that it cannot pickle the object something. And that happens because like we mentioned, Python actually doesn't allow multiple processes. So they cannot actually just read the same memory. So what multiprocessing does to get around this is that it essentially writes the function and everything that needs to run that function to disk starts another Python process. And then that Python process reads it from the disk. Or if the disk can be on RAM, so the file can be in fast memory, but still it needs to be possible to write it to disk. And the main thing is that that causes some restrictions. So sometimes things just don't work. And there's ways of getting around this. You can read the instructions or try to ask people, but it gets a bit complicated. Yeah. And in general, like let's say you have a function that does like really complex things and that sort of thing. If you want to run it in parallel, all of the different parallel processes need to know about the function and what the function has eaten. Like if it uses like global variables or whatever, all of that needs to be transferred to that other process. So the more complex your parallel thing is, the harder it is, usually it's parallelized. So usually it's a good idea to like parallelize, yeah, keep it simple. Like have a simple function that will be executed in parallel and then return to like a bigger program or something like that. So basically, if it's a method of a class, it needs the entire class. If it uses a global variable, it needs the entire file. And if you're calling multiprocessing from that same file, it will just fail. So one way around this often needs to just move the function that you're running into a different file and often that will help, but sometimes not. So just, yeah, just be aware of that. It gets complicated and try to use parallelism from inside the existing libraries rather than writing your own, if possible. Now MPI though is, MPI is usually not built into the libraries for a very important reason, which is that it always needs to be run with this MPI run program basically that sets up an environment for. So what MPI does, it runs completely separate processes and then tells them some information about the other processes so that they can communicate. So this can be running on completely different parts of completely different computers as long as they know the IP address to the other computer. And like there's an ethernet cable or Wi-Fi connection or something so that they can communicate. Of course, that would be quite slow if you send a lot of information back and forth. In principle, that is something you can do. So MPI is a kind of a very different paradigm because you just run copies of the program. So all the copies run everything from the beginning. All the copies run all of your code basically. That's the main difference. Whereas in multiprocessing, it just takes the function and tries to put it in a file and read that file and run it. With MPI everything will, all of the copies will run everything unless you tell them specifically to not to do that. There are other frameworks as well that work in a similar way that they basically, they like set up some network configuration and they do their own configuration. For example, like ByTorch as a Torch run and there's Ray, which is this parallel library as well. But there's many frameworks that do this kind of like everybody starts and then they just know about where the others are. They communicate in some way. In MPI's case, the MPI will handle this. Okay, who are you? Like where are you? But in many other cases, you might have master processes and clients. So for MPI, MPI is kind of the lowest level. It's kind of the basis. Again, multiprocessing is basis of a lot of libraries. MPI also is the basis for a lot of stuff. So we'll just show a quick example. But this is something you may not be able to run because it requires that you install not just a Python library but also you need to install MPI as a sort of a system level library. It's not written in Python. So it's possible to do it in Konda, but we don't assume that you have done that. You can try. Maybe you have the functions or not. Okay, so what Simo is doing here is first importing MPI for Pi. And then there's some magic stuff that I mean, yeah, I could explain what those things are doing. But the main thing is that you get the size, which is the number of processors that are running this program. And then you get the rank, which is the unique identifier for this particular copy. It's a number from zero to size, basically. So that's what you have to work with really. You have a single number that identifies this particular copy. And then, well, then there are MPI functions that allow you to send information to other copies. You have to know the number of the other copy. So from these building blocks, you can build a lot of stuff, of course, but it takes a bit of work. Okay, so let's just try this. So this is just printing the rank, the identifier number, and the number of processors. And to run an MPI program, you need to use MPI run. And you can give it a number, okay. Well, I gave it the wrong name. Oh, yes. You need to give it the correct program. It will run better. If you try to run the wrong program, it will not run as well. Yeah, that's a good hint. Okay, so yeah. So it ran two copies of the program. One of them has identifier one, and the other one has identifier zero. So if you were, for example, like doing what we were doing previously, where you split this list into two multiple processors, you would need to have perhaps just, usually you just set it up with process number zero, and then you send a part of, you actually have to manually write the code to send the part of the list to the other process. We'll do a different example. Well, let's just run the example in the notes. It has all of the stuff that we just ran, but it's also, like the different, so yeah. Okay, so the different identifiers, different processes are now doing a different thing. Yeah. And in the end, it's being collected. Yeah, so all of this code, what's important in the code, basically, we have the function that all of them will run independently, and all of them will do their own part of the whole thing. But the important part is here in the line 39, where we have this communication gather. So basically, we gather everything, and then we have this root. So we gather all of that information to the processor zero. So you can have this kind of collective, everybody send the information to the process zero, and the process zero will then print. So that's an example of one of these functions that allow the processes to communicate with each other. There's also a send, for example, that just means one process sends information to one other process. Okay, I guess so we probably don't want to go too much in the detail about this, because we also want to spend some time on the rest of the material. Yeah, I'll quickly mention that the MPI is commonly used in scientific codes, and that's all the things where you might have a grid, or collective, you solve some problem in a big grid or something like that, where everybody talks to their neighbors. But the communication, how do you communicate? It depends heavily on the problem. So you need to usually provide the communication. Like who tells who what information you need to... Based on your problem, you usually need to decide yourself how to do it. So there's one big rule of thumb when it comes to multi-treading or multi-processing an MPI, which is because multi-treading and multi-processing usually run in very small parts, like a single for loop in parallel, whereas MPI will run all of the code in parallel. So in MPI, you want to split at the highest level possible. So as much as possible gets split between the processes, and then they're just occasionally sending messages to each other. In multi-treading, you basically want to parallelize the smallest loop, or one loop at a time. So they tend to be used quite differently. And that's also the reason why often the libraries are parallelized with these multi-treading, multi-processing things, and not with MPI. But if you're using MPI, then you usually have to do it yourself. Or use something that's specifically intended with MPI. So another thing is, like we said, a lot of libraries are using multi-treading. And the way the reason they can do it efficiently is because they're not written in Python. So Python needs to use this multi-treading or multi-processing approach because one process can only run at a time. But in C, C++, Fortran, whatever, in all of these fast-compiled languages, you can run multiple threads at a time. Multiple threads can run code at the same time. So almost everything is written that has a back-end that runs the fast parts using libraries written in these languages. So if you ended up in a situation where you need to kind of extend those libraries a bit, there's multiple ways of using writing C, C++, code, Fortran, Rust, and then calling that from Python to get the thing done. Yeah, and they are also... You can just call your library from Python. They are also like Python libraries such as Namba and Jax nowadays that can do stuff like just-in-time compilation where they take your Python function and then they compile it into a faster C for function without you ever having Python. But they need additional things usually. Your code needs to be written in a way that it can be compiled. So it doesn't exactly support general Python. It's a subset of Python that you have to use but it allows you to write the program, write the function in Python and then run it as if it was written in C, which is really convenient and usually actually the first thing you would want to do or try. Yeah, and before we leave for a break, we can mention DOSC as well. So if you're dealing with big pandas data frames, like if you have a lot of data that you need to process, DOSC is this kind of improvement on pandas or more parallelizable version of pandas that allows the program... Like what it does basically is that when you set in your pandas code or whatever, you set like select certain rows here, calculate average of them and whatever, you have some operations that you do. But DOSC can create this kind of like a computational graph out of it and then it can execute it in parallel. So let's say it will run like the data in pieces or something like that and it can handle the parallelism on the back end. And if you're dealing with large data sets, like big data sets of data frames, it's a very useful tool and it's used in banks and that sort of stuff because it makes it possible to analyze like huge amounts of let's say customer data or something. So I started a list of useful libraries for parallel Python libraries and I wrote that DOSC is useful for large data sets. This was a job which is I guess an easier way of doing maps with multi-processing. I guess it probably does other things as well. Yes, so we'll continue adding most of there and if you have any more questions, just put them. Yeah, and then please add any library you know of or you use that we may not know of yet because that's always good to keep up to date. Yeah, and also at the end of the day, if you have some certain cases that you would want us to present or certain libraries you would want us to present in the coming years, let us know because there's so many of these nowadays, so it's hard to say what are the most important ones for users but what sort of use cases you want us to demonstrate. Otherwise, so we intentionally left a good amount of time for discussion here. So looking for questions in the notes that we might want to bring up. There's a question there, how can I install OpenMPI? Well, Kondah has an OpenMPI installation package there in it, so you can just Kondah install OpenMPI at least from Kondaforge and MPFAPI as well and that's I would highly recommend using that compared to like installing MPFAPI yourself unless you're using it in a very large scenario in a computational cluster or something. But if you want to work in like one machine or something like that, the MPI in Kondah is good enough in most cases. If you're working in a computational cluster, the maintainers of that cluster will usually provide you with an MPI already because it needs extra stuff to be able to be. Absolutely, there will probably be multiple versions you can choose from. Another one, so this is answered in the notes, but benchmarking libraries because we talk, I could be in the beginning, about how before thinking about parallelizing you should do benchmarking and try to just make the code faster with the existing libraries, which is usually enough. So if you have one specific function you know you want to benchmark, then time it is a really good library. And if you have an entire code or when you're starting you basically you just have one entire code you want to figure out what is the slow part in that code, then scaling is a good option. So that's on this question. Mark this question number nine, but should we, is it easy to find in the notes? Yeah, I added to the library list. Okay, good. So it's benchmarking and guidelines for deciding knowing in what way I should parallelize my code. I would say like the first thing is like check out the embarrassingly parallel. That's usually the most efficient way of parallelization because it's like if you have like a natural thing in your code that is like embarrassingly parallelizable. Like for example, you run it with multiple parameters, multiple datasets, then that is usually the way to go because that scales infinitely basically because you can always launch more processes. As long as you have more data. Yeah, as long as you have more data. Yeah, and after that I would say probably check the libraries that you're using whether they support parallelization. But yeah, so then if a single case just takes too long to run like several days or so if you're running on a cluster or on a supercomputer, you can usually reserve it for a few days. But then if something breaks, you might lose everything. So it's important to build in some checkpointing. So write everything to disk so that if you have to restart it, it can continue from where it was. That's already very useful. It's not parallelization exactly, but it just allows you to run for longer. And then yeah, if then if it still takes way too long and then you just need to make it faster and you might need to parallelize in some way. And then if there's a way of like, if there's a way of splitting the data, like if you're running a simulation with a grid of points, you can split those points and run independently for each of those points. That's a good case for MPI possibly. If you have big loops over a lot of small things inside those, you might be able to parallelize with multi-treading options. So basically though, the multi-treading thing basically means use libraries that are multi-treaded, which means NumPy, SciPy, Torch, TensorFlow, whatever. It depends on what you're doing, but almost everything is multi-treaded if it does a lot of calculations. So just try to combine calls to NumPy, for example. If you have multiple calls to NumPy with the same data, try to make it one call so that it doesn't get split. One thing also, there's a few questions in the chat about like the multi-processing example, like locking up program. It might be due to the Jupyter locking the global interpreter lock, what we were talking about. Because if you have a Jupyter, you have a Jupyter lab running Python interpreter, and then you try to run multi-processing there. It might be that the Jupyter and how it processes the sales and that sort of thing. It should work, but there might be a situation where somehow locks the global interpreter lock. So that might happen. It's hard to say. We'll check the examples and we'll verify that. There's at least two questions about this. Yeah, so we'll verify it. If the solution, if you take the solution from the web page and it doesn't work, then it might be... Yeah, you might want to check compared to the solution. Okay, we are out of time though, so let's take a break and then we'll move on to packaging. Yeah, I'll quickly mention that there was also a question about pooling. Pool one, getting bad results or using pool was worse than not using pool. And this is exactly what might happen in a case where you parallelize a thing that actually doesn't benefit from parallelization. So the example, of course, is a trivial example that we have. So the processor will just go through it in any instant anyways. It's going to be nanoseconds or microseconds or something to calculate it. So adding the constructions of, okay, we'll construct a parallel pool and then we'll give everyone their own process to run. It's a huge amount of overhead. But the thing happens is that when we are getting to the run times of seconds or minutes, then suddenly the overhead isn't that big. But usually you need to figure out what is the part in the code that requires me to parallelize it. And usually also I'll mention that, for example, the map thing, writing it as a NumPy array and then just squaring the NumPy array would be much faster than any pooling because NumPy already does the parallelization inside. So not using Python objects but using NumPy arrays would be always faster than doing the pooling thing. It's a trivial example but it's just to demonstrate how to use the tools. Okay, but yeah, so we are out of time. So do take a break and walk around a bit. Let's come back in 10 minutes. Yeah. All right, bye. Bye.