 So, hello, everyone. Can you hear me well? Okay. Welcome to my talk. I'm Michał and today I'll be speaking about running your Python code in parallel, for most part, and asynchronous. To be honest, I've never spoken to such a big group of people, so excuse me for being a little bit overwhelmed. And let me advertise Adretox a little bit. So, you can see that asynchronous and parallel topics are hot right now. So, even during this conference, we have several talks about them. I want you to understand what my talk is about, to not mix it with some other talks, I also encourage you to see. So, there are some asynchronous talks, parallel talks. There's also a talk about Python at CERN, which I think will be interesting, or I think it's a poster session. And this talks tries to be an overview of the topic. So, it's not an introduction. That's why I labeled it as advanced. And also, you might feel that I've skipped some parts. But I wanted to put it together as an overview so you can later research what's interesting for you and to not bug you with a lot of details. Okay. So, a few words about me. So, I worked at the DHCB experiment at CERN, looking for antimatter for some time. And later, I decided to pursue a PhD in computer science. But then I've heard that if I drop out, I will probably start a multi-billion-dollar business. But for some reason, that hasn't happened yet. Yes. And currently, I work at Akamai. And I'm known as a FAT developer. We're FAT, obviously, stands for frameworks and tools. So, what my job is at Akamai is to make sure that we use the best tools we can. And how do you define the best tools? So, sometimes you hear that Facebook is using some tools or that Google is using some other. But do we need that? Do we have the exact same scenario as them? So, my job is to create tools and to select tools which are the best fit for us. And Akamai itself is a content delivery network and cloud service provider. So, we are not very known in Europe for some reason. But we have one of the largest or the largest network of computers talking with each other. And we're also responsible from between 10 to 25% of all web traffic. We also have some security project, security products launched recently. And we also have 16 offices in EMEA, both sales and engineering. Okay. There's a lot of mess when it comes to basic concepts in asynchronous and parallel programming. So, let me clarify some things first. So, when you have one pipeline and one worker working on it, you have a serial or sequential execution. When you have also one pipeline, but multiple workers, and they do, they work in the same time, but not in parallel, I would call it concurrent. You may not agree with me and some people do not. But let's assume at least for this talk that it's correct. And also, we have parallel execution. So, we have multiple pipelines. We have multiple workers. And they actually do their things in parallel. So, the concurrency, I usually, when I think of concurrency, I usually think about preemption. How many of you know what preemption is? Okay. Half of you. So, let me just say that preemption occurs if a threat has CPU time. And operating systems scheduler decides that there's some other threat that needs the time more. So, there the preemption occurs. One threat is being stopped. The other threat is being put into his place. And then they switch roles until their job is complete. So, this is why you can sometimes see that things are concurrent because you achieve results in a certain amount of time. But they are not truly parallel. Okay. So, how would you call this? I would call this a headache. Or you might call it parallel and asynchronous. So, another thing which I need to clarify is the difference between threats and processes because they are often mistaken or processes are wrongly called bigger threats. So, threats are the place where your code is executed. Each process has a threat. And this threat can be scheduled for execution. It can get CPU time. And all threats share a virtual address space and system resources of the process. And they do not share stacks, local variables and also but they do share process heap. And process is an execution environment for threats. So, it has its address space. It has executable code. It handles system objects. So, it brings all what is necessary for a threat to run. So, I wanted to clarify that because sometimes people don't know why GIL in Python complicates things. So, how it applies to Python. So, in Python, we have multi-threading and multi-processing. And when we talk about multi-threading, we have one process. So, this one environment, we have many threats, only one interpreter. And due to GIL, there is a rule which says that in a Python process, only one Python byte code instruction is executing at once. So, if you have many threats, you cannot execute many byte code instructions from different threats at once. But with IO, it's a little bit different story because if you have IO, then it does not execute any byte code instructions. So, if you have threats and you do some IO in them, you can actually see speed up. But that's because it's not going through Python interpreter. And when we talk about multi-processing, we have many processes. We have many threats. At least one threat per process. We have many interpreters. And all threats have their own interpreter. And that's why they can execute in parallel. So, do we have Alex Martely here? Okay. It's always dangerous to be citing someone sitting in front of you. So, during a chat with Raymond Hettinger, he proposed the following classification, which I think is simple but nice. So, if you have one core, you usually want to run a single process with a single threat. So, for two to 16 cores, because that's how many cores you can get in consumer PCs nowadays, you can have multiple threats and multiple processes. So, why you should not use multiple threats on a single core? That's because even though that IO, which might give you a speed up when it's done in a threat, it still needs some CPU time. Not a lot, but it does. So, with only one core, you should not achieve any speed up. And also, when you have 16-plus cores, you usually have multiple CPUs. So, you enter the area of distributed computing. And Alex proposes that as the time goes by, the second category becomes less relevant, as we are in the era of big data. And even one CPU with 16 or 32 cores is not enough. I would argue that for some cases, like back-end work services, it is. But you can hear more about that in Raymond Hettinger's talk. Okay. So, you should have some knowledge about that now. So, when I, as a back-end developer, think of speed up or performance boost, which one I want to use? Parallel or asynchronous? Parallel. Because I want to execute many things at the same time. And if I want to gain responsiveness and lower latency, I'm choosing asynchronous. Okay. So, when running things in parallel is useful? Well, when you have big data sets or complex computations, when you have problems with parallel nature, so-called coarse-grained, or when you have multi-worker applications, IO bound problems are not a good fit for being parallelized, as they require a lot of IO, which is mostly serial sequential. And also, problems need to be complex enough so that parallel overhead caused by process maintenance, communication, scheduling, synchronization is negligible to what's going on inside the process. Okay. So, who knows Amdahl's law? Some of you. Okay. So, Amdahl's law says how much speed up you can get when running in parallel. When you, you need to know how big part of your program needs to be sequential. And if you know that, you can approximate how much speed up you can get with a certain number of CPUs. So, let's say that we have a task that runs for 10 minutes, but five minutes of that time is sequential work like loading data. So, you can see that if you have even an infinite number of CPUs, we can only achieve a speed up of two. Because that second part, if it's really run in parallel, then that time goes to almost zero, but you still have that five minute time. So, you really need to know your problem when you start working in parallel programming. And just to give you an example of that, some of you might say that it's a really trivial problem, but in order to present you how that works, I had to choose something like that. So, here we have a small dataset and a really simple operation. We have an input vector of one million elements, and we want to calculate outputs which are inputs plus one. So, we can run it sequentially, and also we can run it in parallel in different processes. So, how do you think, how much the speed up will be? We are running on four cores. Two, four, none. It will actually be slower. Because the problem is really simple, and dataset is small, and it's not enough to have any gain. And, in fact, you actually lose something. And even for eight cores or more, it gets even more complicated and you get even worse results. Okay. So, a common pattern in parallel programming is to put a problem more difficult by running it in a for loop. So, here we have a problem that's 200 times more complicated. And how much the speed up will be now? Two, four, almost four. Yeah. So, the speed up comes from using processes, which, like I mentioned earlier, we need to have processes in Python to execute truly in parallel. And arithmetic operations go through interpreters. So, we need separate interpreters. So, here we have almost four. Okay. So, some problems like that have parallel nature. So, here I was easily able to divide my dataset into four subsets. And the most part of that program is running in parallel. So, this type of thing has a parallel nature. So, usually when we talk about parallel nature, we talk about coarse-grained problems. So, if we have a loop of loops, if we have multiple images to process, if we have multiple datasets or a really big dataset, or maybe the dataset is not big, but the operations we want to run on it are long. So, those problems are coarse-grained. And then, there you can easily apply parallel programming. But for fine-grained problems, there's a different story. So, when you have iteration of a single loop, an image, or a single small dataset, you should not parallelize that, at least not with a CPU. Because nowadays, we can actually parallelize fine-grained problems with massively parallel architecture devices like GPUs because they have really a lot of processing units and their threads are really light. So, in parallel programming, we have different memory architectures. And the most known two are shared memory, where each process has its own memory, where each process connects to a shared memory, and it works on the same dataset. And we also have distributed memory, aka message passing. So, we need to pass data to processes and later get that back. That's why it's called message passing. So, how to apply them in Python? So, for shared memory, we have shared C-type objects. Those are objects created in shared memory and can be inherited by child processes. So, if you import from multiprocessing value, you can assign what type the value is. You can assign its value in the beginning. And you have also some other types and primitives. So, let's see how they behave. So, I have two programs. So, the difference is that one uses locking and the other one is not. The one on the left does not use locking. So, we have shared memory. So, all processes have access to the same memory. All can read from it and write to it at the same time. So, if you do that, there actually will be something called race conditions. So, sometimes two or more processes may read the same value. So, let's say that at index two, I have value two. And four processes read that. And when they read that, they add one to it. So, it's three. And then they will four times write that three into the memory. That's why you can achieve this. So, when you run the left program, you will get different values depending on what's going on in your system. But the answer will be wrong. So, for shared memory, you always need to use some kind of synchronization. And in this case, I used locking. So, here we ensure that only one process can read shared memory at the same time. And only one can write to it. So, with that, you get a good result. But what's with the time? You might say that the problem is too small or the dataset is too small, but that won't be the case. The case here is when you use locking and you have multiple processes, you in fact get sequential execution. Because only one process can take something from memory, make calculations and write back to it. So, your code will either be slower or run more or less at the same time. And believe me that here it's really easy to spot and see, and usually we don't have that simple problems. And actually, you can use something else. These shared C type objects have their own locks. So, you can use them. The output will be the same but you will not create additional locks. And you need to really keep the number of your locks as slow as possible to not know what's going on. Okay. So, we also have managers in Python, which are hybrid between shared memory and message passing. So, managers are proxies through which child processes can access data. And when you create a manager, it spawns a new process which communicates through sockets. So, actually, if you create a multi-processing manager, it will create a new process. And you can later give that to a children of that process or you can even use it for remote access because it's using a socket. So, for distributed memory, the most commonly used tool is pull map. How many of you have used pull map? Yeah, some of you. So, it's really simple and nice. So, you just define how many processes you want and you map a certain function and a collection or its arguments and it just runs fine. It's really a high-level and nice tool and you can simply achieve speedup. But you need to remember to always close or terminate your pull and later to join it. And if you're one of that kind that doesn't remember, like me, you can use it as a context manager. And we also have something that looks like message passing more. So, we also have pipes and queues. So, the basic difference is that pipe has only two ends. It's really fast because it's usually using operating system pipes. And queue can have multiple producers and consumers. But you need to have in mind that behind the scenes, there are pipes connecting all elements of the network. Yeah, so pull has some overlooked features because people usually use it like this. So, they create a pool with numbers of processes and then they just map it some function to some input. So, what you can define is, for example, max tasks per child argument when sometimes your processes grow and consume more and more memory and you want to restart them once in a while. So, here you can define how many tasks should be executed per child until there's a new child. And you also can define a chunk size. And I didn't know that until yesterday, I think, because map usually maps one execution to one element. So, if you have a map of four processes and let's say 12 inputs, the default chunk size is one. So, there will be 12 round trips between the worker and the main process before the result will come back. So, you can optimize it with that parameter. So, you can also define I map and I map unordered. The difference is that I map still waits until all processes finish. And because when you call map, you will get a list with the results. When you call I map, you will get generator. But you will still need to wait until everything is finished. And when you use I map unordered, you get what finishes first. So, that's useful. And also, there's an approach with Apply Async. This is what's actually going on behind the scenes. But it is discouraged to use it because maps are considered higher-level and better tools. Okay, so, we have different models for parallel programming. We have also different models for distributed memory itself. So, we have something called worker-based models. So, you can have a pre-fork model that journeycoin uses. So, you might create your workers beforehand. So, you define that your application starts up with four processes or four threads. And that's pre-fork. You can have a worker model where you define during execution how many workers you need. For example, you optimize that to your dataset and how well it divides. And a multiprocessing pool is an example of that. And you also have a hybrid approach. So, you can define that number of workers beforehand. And later, you can scale them dynamically, which is useful when you're working with something like backend server. Okay, so, when you want to create an application, a multi-worker application, and let's say that you want to respond to some requests, then you have basically two approaches. You can either use reuse port and reuse other flags for the socket. And I won't really go into details. There's a really nice description on Stack Overflow about that. But basically, you can create as many processes or threads you want. You can assign the same socket to them. So, all those workers can listen what's coming from the socket. But you need to, in this scenario, you have to ensure locks and synchronization. Because if you're going to read from all those threads, then you'll just get garbage. So, in Twisted, there's a really neat way of doing this. So, you create a socket, let's say a TCP socket. You spawn child processes. And you can later adopt sockets for child processes. But more, and this is the approach which you should choose if you really want to tune your performance and if you really want to have access to some low-level stuff. And if you don't, you can take a different approach, which is most common, where you have just a single thread reading from the socket. And this thread is responsible for IO. And later, it delegates the work to some other workers through a queue, a queue which I mentioned earlier. So, then you don't have any problems with synchronization and stuff like that. So, up till now, we've talked about so-called intra-note communication. So, communication within a single CPU or just one server. But you can also run your code on multiple machines. So, there are many libraries. You probably have heard about MPI, which I think is the most commonly used library up to this day when it comes to scientists. But there are some other tools. I personally like SCOOP, maybe because it has a really nice slogan. But it uses zero MQ sockets for communication. It's really similar to multiprocessing pool. And it utilizes SSH connection for execution. So, you need to have SSH access to the machines you want to run your application on. And later, it connects to them since data and executes it. So, you can see that it's really, really simple to use that. Okay. So, I've encountered some traps and some weird behavior over the years. So, I would like to share that with you. So, one possible trap is hyper-threading. So, CPUs are often advertised as 16 cores, 32 cores, 64 cores. But how many physical cores you get? Usually, half of them. So, hyper-threading works in this way that you have a CPU pipeline and you have, let's call them, slots in them. So, if you have slots to run two things at the same time on one core, then your two logical threads will run in parallel. But if you don't, then they won't. So, I had a problem. I had a 12-core Intel Xeon machine with 24 logical cores. And when I ran my computation, which was a really complex computation, and I'm sure that the result is not caused by communication or stuff like that, I achieved these results. So, I've heard that Intel is launching a new tool for tuning and profiling Python. So, I think it might be interesting to work with that. Also, you don't always want to target 100% utilization, because if you have four cores, you prepare four workers, and then you have 10% of each core not used in each CPU or epoch. So, what you want to do is to just add workers to use that, but you won't gain anything, and actually, you will lose. Does someone know why do we lose time here? Because we should utilize that additional spare 10% and it should be faster. Yes, exactly. So, we are switching contexts. So, all processes are fighting for resources, and switching them and copying them for different cores is a really expensive operation. So, don't always target that 100%. Also, there's a funny thing in how pipes are implemented. So, pipes cannot, at least OS pipes cannot send things both ways. So, if you define pipe with duplex, you will actually get a socket. And if sometimes you get a socket and sometimes you get a pipe, and if you take into consideration that they have different buffers defined in the kernel, then you might encounter a situation where sometimes you will be able to send something and sometimes not. So, that's interesting. And also, you have a usual topic, which is deadlocks. So, when one process has some resource, a second process has its own resource and they wait for each other, but they do not free their resources, then they will wait forever. And do you know how to kill processes and threats in Python? So, there's a kill method. So, who used kill method? Okay. So, you couldn't use kill method because it does not exist, because you cannot kill a threat. It's by design because you might end up in this situation where your threat holds a resource, and when it's killed, other threats will never get it because it's never being freed. So, that's why you need to use some different mechanics. And when we are at threats, there are some, there's a common misconception with diamonds. So, if you have while true or something similar in your threat, it should be a diamond. And diamonds should not be joined. Once you set up a threat as diamond, they should just run as long as their process is running, and the only clean way of stopping them is up to you. So, also, don't use global variables. Don't define stop equals false and then iterate it unless it changes, because you might never know what will happen, and when those threats will be stopped. So, the common pattern is to use events and to just send an event in the main threat and to wait for that event in worker threats. Okay. So, when it comes to parallel and asynchronous, we finally reach the topic. So, we have basically two options, the threaded option and the process option, where we can define executors. We can submit jobs to them. And basically, what you get is futures. And also, you can define them as context managers. And you can run that without starting at the IO loop and just get futures, or you can use them with IO loop, but then you need adapters. And sometimes those adapters work, and sometimes they don't. So, you need to be really careful. And also, keyword arguments are not allowed in executors. So, you might want to read the PEP to know why. If you really need keyword arguments, then use just partial. So, also, you can submit several jobs and wait for all of them to finish. So, coming to an end, why would you want that? So, you might want that for long-running tasks that might block your IO loop. So, you might want that if you have some code which is incompatible with your IO loop, and that will most certainly block it like requests. You cannot use requests with any IO loop that exists for now. And also, if you have some running blocking tasks that you want to run in parallel. So, what will you get when you use that? Headache. Because running things asynchronously is troublesome. And when you introduce also running it in parallel, it's also troublesome. So, you should really know that you need it. Okay. So, let me run for a moment just before I finish. So, you all know this gentleman, it's Tim Peters. And he said that there should be one and preferably only one of this way to do it. So, where have we gone wrong? We have currently four commonly used IO loops. We can three types of asynchronous calls. So, if there are some decisive people in that crowd, let's think how to fix that. Because Python 3 was created in order to clean up some mess which accumulated over the years. And I feel that now we are creating such mess again. Okay. So, in summary, Python has a wide variety of options when it comes to parallel and asynchronous programming. You should really know your architecture when you use parallel programming. You should always test your code before entering parallel concurrent world. So, first sequential then parallel. After you enter the concurrency world, you should test it with fuzzing. I didn't say anything about that, but you can research that. Be aware of any incompatibilities between modules. And I assure you that they do exist. Be sure when you should expect awaitable objects in asynchronous programming and handle them properly. And also, you know, those tools are for us and they mostly work. And you can create even production code with it if you test it well. So, don't be scared to seek out new options and to boldly go where no man has gone before. Thank you. Hopefully it will be a piece of cake. Thanks. Do we have time for questions? Yeah, we have three minutes for questions. Questions for me? Many questions. So, actually, I brought something nice from Poland for people who ask the best questions. You mentioned about deadlocks. I am curious about if there's a dead library. Also, detect deadlocks in the program, which can appear. I know that such solutions exist, but I don't remember the names. But you can mostly get that with fuzzing or just testing some unexpected behaviors. Okay. Thanks. In your last slide, you said you mentioned about incompatibilities between modules. Maybe you can tell something more about that. What was your experience? What were those very incompatible? Okay, could you say that again and a little bit louder? So, on my next slide? Be aware of any incompatibilities between modules you use. So, what were you referring to? In incompatibilities. Okay, sorry for that. So, like I mentioned, we have different IO loops. So, let's say that you want to use tornado. It has its own IO loop. But you want to also use some process executor, which does not run really well with that without adapters. So, what you might to do is to first adapt your program, your tornado program to run on async IO and then run those executors. For example, Curio does not work with any other IO loop. So, you don't even get anything to connect to them. For example, in tornado, you should not yield from. You should yield from a coroutine. So, there's also some incompatibility. That's what I meant. Okay, thanks. We have a presentation.