 Okay, so parallelism shootout, threats, multiple processes, async IO, just to set the expectation before I start, this is not a deep dive into any of these. So definitely not an advanced talk, probably intermediate, maybe even gonna be intermediate, depending on your, how much you know about. So my name is Sharyar, I'm a software engineer in London, working at BOSPA, and I don't have a presentation because it quit unexpectedly. Today we're gonna talk about parallelism, and the point of it is to take one problem, it will come on the slide eventually, and try and approach solving it using different techniques. I am threading, multiprocessing, or async IO, and just get a feel for how each of them work, that's me. We wanna take this problem that we have, so we have, let's say, lots of URLs in the file, and we want to download their contents and store it on our machine, right? And the point of it is to use the threading, multiprocessing, async IO libraries, or modules, separately, and then, firstly, get a feel for the mechanics of how they work, and secondly, to be able to do a simple benchmark. Now benchmarks make me nervous, especially for parallelism, so don't take it too seriously, it's just to give you an idea, or an me an idea, of how they compare to each other. So before we start, I'm just gonna break down the problem into three main bits. So the first is that we're gonna read the URLs from a file. The second part is to download it from the internet, and the third one is to store it on our machine, right? But before we begin, just a quick reminder, who is familiar with IO-bound and CPU-bound types of computation? Excellent. Just a quick recap, CPU-bound are basically computations that are hungry for CPU, so if you give them more CPU or faster CPU, they perform faster and they end quicker. And IO-bound computations are ones that, the time it takes for them to complete, it depends how long they're waiting for IO, so you can give them a really fast CPU, but it won't make a difference, because it's blocking on IO. So to go back to our three original SOP problems, reading URLs from a file, IO-bound, right? Because this access, we're reading it. Downloading content, IO-bound, HTTP requests, again, we have to block and wait. And storing the content on our machine, again, we're writing to disk, so that's IO-bound too. And just as a random thing, generally a lot of things we do are IO-bound for a weird definition of generally, but usually day-to-day tasks are IO-bound. Before we even paralyze though, I think it'll be good to just quickly go through the sequential approach, and I think that'll be a good baseline to compare how much actually paralyzing it improves it and how different methods have different improvements. So a bit of a mouthful, I've put the whole thing on there because you can actually run this and it works. Interesting, the highlighted bit for the sequential approach, we go over the URLs, those functions are just for convenience, so that I don't have to write a lot of things again, but they do what they say they do. So we go over the URLs, we get the content, and we put it on machine, but we do this sequentially. So we do one, we do the next one. And when we're thinking about tasks, or when I think about tasks, and if I want to make them faster, I would have to think about, so how does this look on my CPU over time? So the way this looks is that it's only running on one of the cores, let's say we have two cores. As far as I'm concerned, second core is Skyving, right? Because it's doing its own stuff, but as far as my task is concerned, it's not doing anything. So I'm doing a bit of work getting the URL, downloading it, storing our machine, and doing the next one, just continues like that. But to even be more accurate, what's actually happening is that we're doing a tiny fraction of CPU work, then we're doing nothing, which is the dotted lines, because we're blocking for IO. CPU is not doing anything, and then we do a tiny amount more CPU work, but the reality is this is not actually to scale. So if I was to show it to scale, the bit where the CPU for this particular task is actually engaged is very, very small, right? So this is a proper IO bound task. And just to show how this works over multiple URLs, so we have one URL takes a tiny amount of time, 30 URLs takes a lot more time. By the way, in the beginning I said the problem statement is that we have lots and lots of URLs. I've just used 30 in this case, because it was much easier to run it multiple times, but imagine this over a gazillion URLs. It's not gonna happen with sequential approach, but it just predictably goes up linearly. So threading is the first method we're gonna use. Threads in Python are actual real threads. Also, there's no controversy in this. I'm not gonna talk about the global interpreter lock or fix it or it's just there. It's not, that's not gonna happen. But just so we know threads are actual P threads or Windows threads or whatever, they're real threads, right? And quick recap on how to make them. Is everyone familiar with how to use threads? Okay, fair enough. So quickly we can make them two ways, either stop class threading the thread and override the run method, or just have a function and use the normal thread class pass it as the target and let it do the work. To run the threads, you just call the start method, not the run method. Call the start method and it goes and does its stuff. And it stops when your actual function, so in the left case, the run method and in the right one, the actual do work function, the thread stops when that function has reached the end. But what if that function never reaches the end? What if we have a while true in it? So we want it to do constantly work. Then we have the minute, the minute threads. So we pass the demon equals true to the constructor and that tells it that you will stop whenever the main thread stops. So when the main thread stops, that will stop. Otherwise the interpreter will lock if we don't because the main thread stops but everyone else is still running. It's confusing. The threading code, again, this is the full code. So I don't usually like putting lots of code, but this is it. So I thought it would be cool to go through it one by one. First and foremost, we add URLs to queue. I didn't mention the queue. We need the queue so that different threads can talk to each other, not talk to each other actually. Different threads can use something to get what they wanna do next, right? Again, Python just gives this to you. Most of you probably know this. It's thread safe. We don't have to worry about it. You just create it at the top, unvisited URLs. First thing I do, I go and add the URLs to the queue so that our threads can then consume from it. You get an interesting case if you do that in separate thread too. You don't wanna do that because your queue might never get full and threads might read from it and then they think it's empty and your program ends but it's not actually empty. So then we go, we have a number of workers, let's end. We go through them, for each of them, we create one thread, give it the target function which is visit URLs, basically does what the sequential version did. It literally does that except it gets the URL and it marks this task done, which is what you do on a queue to say the task is done. So we create the worker threads and we start them, right, and that does the actual work. That's it. And to go back to how my CPU and my time is looking, this would look something like that. This is not accurate, but if we had three threads, then you would have three threads. And once one of them has done with CPU and is waiting, well, we can go to the next one but it's being very vague here, the OS or someone decides it's gonna move on. So in the same amount of time, we make better use of our resources, we do a lot more work, right? And the yellow thing there is the global interpreter lock which we shall not talk about any more after this, but that's just to say, if the lock just makes sure there's only one thread being run on a core at a particular time, right? So again, our second core is Skyline, it's doing nothing. And to look at the speed and the performance of this approach, this is how it works. The X axis, we have a number of threads. So if we have one thread, it takes ages, it takes, it should probably even take more than this sequential version because there's a bit of overhead. By one thread, I mean not the main thread, I mean one extra thread created after the main thread. But we'll see that as we create more and more threads, this goes down. However, it does flat out after, I don't know, in this case, maybe between 11 to 17 threads, you're not really getting any more advantage and that makes sense because by the time that 17th thread comes up, there might not even be any things left for it to do, right? But this is only for 30 URLs. If we had a gazillion URLs, then that would flatten out a bit later. And so, okay, this is good. We have reduced the time, probably, I think sequential took about 30 or so seconds. We've gone down to, what, had a good case about five seconds. So that's okay. But we wanna try multi-processing now, see how that would perform. Multi-processing, again, I assume most people are familiar. Hands up. Yay. With multi-processing perform processes, the actual processes, right? So they can just run on a separate course. And the cool part about it is that the API is very, very similar to threading, as in very similar. So it sidesteps the interpreter lock. Oh, I said I won't mention that again. This is, I promised it last time. And it's really easy to change our approach to our threading example to be a multi-processing example. And to do that, this is the exact threading code. It's only the highlighted lines are changed, right? So instead of getting Q from threading, we get it from the multi-processing module. And instead of a thread, we create a process. That's it. Everything else is the same. This is beautiful, right? So I just changed that in five seconds. But the multi-processing also gives something else amongst many other things. It's something that I'm gonna talk about here, which is the pool object. And the pool object is a way to parallelize execution of function over a number of arguments, right? So what this allows us to do is to, instead of changing our threading kind of code to use multi-processing, we can even change our sequential code to use multi-processing. And again, this is the sequential code, as I showed you on the one of the first slides. All you change is that you read the URL in advance and you create a pool. And that pool will have a map method and what map does it take to function and a list of arguments. But not a list of arguments, but one whole to that function. Every time it calls the function, it gives one of those items in the list to it and it says, do your thing. And you can give it number of worker processes that you expect. So to go back to the time and CPU kind of usage, if we had two processes, this is hopefully how it would look, assuming that they would actually get scheduled to run on a two separate course. But the idea is that this should, multi-processing should allow you to sidestep the kill and be able to run it properly in parallel. So true parallelism, hopefully. Yeah. And if we had more and more processes, then this is not accurate, but this is how it would look. It's like having two of those threading things. It looks the same as the other graph. But again, this is not exactly accurate because it's processes, but you get much more work done and you have lots of more cores to annoy. And the performance, very similar to threading in terms of the way it reduces your number of tasks. But first and foremost, for just one single, if you have one single process, it takes longer than both sequential and threading because the overhead is a lot to create the process. The thread is a bit less than sequential compared to this is nothing. But again, you get a healthy drop. This was again used for 30 URLs. So after a certain point, it's diminishing returns, it's not really doing much. So that's cool too. But async.io, right? I think async.io is to Python as big data is to middle management. I don't know, I think it's kind of, woo, async.io. It's a new module in Python 3.4 and it gives you the infrastructure for writing single-threader code concurrently. It is meant to be quite low level and the point is that you can use other stuff like tornado and twist it on top of it. I don't in this presentation, but it is quite low level and it's fairly compatible with everything else. Well, except, you know, it's Python 3.4 mainly. And I'm just gonna, is anyone familiar with async.io? Cool. So I'm gonna go through, async.io has a lot of concepts. I'm going to go through like two of them just because I think they're like the most important ones and also those are the ones that I'll be using in the code later on. One of them is that we have coroutines and coroutines are basically functions that can pause there's a fusion in the middle of what they're doing, return, you know, something else does its work and then you can go back to that function and carry on. So this should immediately remind you of yield, basically. It's like a generator, right? Because it keeps its state. You do something, it yields. You do something else, but then if you go back to it, then it continues from where it was and it keeps its state. These are what coroutines are. And the way they are used is that if you have three separate functions and you wanna run them in a row, you run one, then run the next one, then you run third one. Whereas with coroutines, you can say, okay, I'm gonna run the first one until it needs to block. When it needs to block, well, you can yield. I can do my own stuff. I can run the second function. And then it does it the same way. I mean, I think this demonstrated well where it does it in a row of three separate functions. But if you take the case of blue, you know, it suspends because it's blocking. So it gives a chance for other things to run, but they also suspend too halfway through. And when blue carries on, it's just making progress. It's not that it's starting again. It's just now it's stopped blocking. It's ready to go again, so it can make progress. But, and also notice that these are not run in the same order, round-run-run stuff. The order changes. So someone needs to keep track of, you know, how are the schedules and generally just keep track of all these coroutines going around. And that's where the event loop comes in. The event loop is in charge of keeping track of the coroutines. That's mainly the thing it does. And deciding which one's gonna go next. I ran through this much quicker than I did last night when I tried this one. Okay, anyway. So the code for using async.io looks like this. Yield from is new. It's similar to, we don't talk about it, but yield from basically allows you a two-way channel of communication. And what it does is that usually when you just do a yield, like a generator, it just turns something. Whereas yield from allows you to kind of refactor your generator out of your generator. It sounds weird. It probably doesn't make sense. Just don't worry about it. Delegation. Delegation. Yeah, delegation. It does that too. So just to walk through what's happening here. First, we get all our coroutines. We do work as a coroutine. Basically, you know, a function that's suspended halfway through, blah, blah, blah. We first create all of them with all our URLs. Then we need an event loop. So from async.io, we can get an event loop. And then the run and complete, run until complete method allows you to pass it a bunch of coroutines or futures or whatever, in this case, coroutines. And they will run them, all of them, until they're complete. And I don't async.io that wait there because I want to actually wait for everything to be completed first. And the way do work works is that we first need to get the content of the URL. So at this point, this is fairly IOE, right? So it yields from async.io. It yields from get your content, which again, there's a lot of blocking there. So we can, while we're waiting for it to happen, we can just go back and run like the next task. And that would be okay. By the way, there's a lot of different ways of writing this task. I was trying to make the shortest possible one so I can see all of them in one slide. But this is kind of how it works. And then you get URL, it has yields. So halfway through, if it's blocking, it can just, other stuff can carry on and do their work. And the performance of this looks pretty cool. So with number of URLs, it's pretty quick, right? And I think what's really cool about it is that the kind of line that it increases as you add more URLs is less steep than it was, let's say in sequential case. So this is quite promising, drumroll, for a winner. I'm just gonna put all the four different approaches that I used, well, sequential, I'm counting that as one. Next to each other to see how they performed for 30 URLs again, not a good alien, which is the whole point of doing stuff like this. We can see sequential is just not gonna happen. And threading multiple things, I think IO, they are all fairly good. I tried running this on lots and lots of more URLs and async IO did outperform properly in this case. But again, these are, you have to take them with a pinch of salt because they are very dependent on tasks that you're doing. For IO bound tasks, you can use threading, you can use async IO and that's fine. But if this was a CPU bound task, threading wouldn't stand a chance because of, you know. And async IO, it wouldn't do that well either, as far as I understand it. So multi-crossing would be the answer. So this, I'm tempted to conclude here, but I don't like making conclusions when it comes to parallelism. I think the whole point of it is every single task, every time I've come across a separate task, it's just different. You have to look at how is it IO bound as a CPU bound, but how IO bound is it actually? So I can't make a conclusion and say, well, use code routines always or async IO or whatever. I think you have to be pragmatic about the task that you have at hand and just play with it a bit to see which one works well. So I'm definitely not gonna say, I wouldn't use this slide to say, ooh, async IO is much better. No, it's not. It really depends on what you're doing and the type of competition you're doing. So this was meant to be a half an hour talk. I don't know why it's 20 minutes and 48 seconds, but this is it. Sorry to disappoint. Just to waste another minute. I, please, if you wanna give me feedback, other than your talk was too quick, anything else, please get in touch. We can talk about it. If you wanna try other stuff with my code and some other resources I've put together, some great links and videos of stuff that will be on that URL on GitHub right after this talk. I'll do it when I get out there. It's there, I just have to make it public. Yeah, this is it really. QNPA. We have a bit of time left. We have time for like 4,000 questions. So did you ever do something crazy, like combine these techniques, have multiprocessors that run threads and use async IO in the threads? So the idea for this talk, initially when I proposed it, was to at the end do something like that. But then I didn't have time. Yeah, so I didn't realize that I would have like 12 minutes of doing crazy stuff. So no, I didn't. Sorry. Hi, thanks for the talk. It was very interesting, quite concise. There's something that's true. It's equality. There's something that really puzzles me. If you can please show again the slides with the threading of the multiprocessing time. I really don't understand why would you have one, I don't know, one thread it takes. We'll see the number. Sorry, one sec. This one okay? You want the diagram? The times please. Let's check this out. The next one I think. Oh, so whichever the processor, the next one there's this one, yeah. I really don't understand when we have one process, it takes more than 30 seconds. Oh yeah, so what did I do wrong? Or did you? No, no, no, this is one process for 30 URLs, right? It, I mean. Okay. So in the sequential version, if you have, if you want to download 30 URL, this is basically sequential. So each URL takes roughly about one second just under. Okay. Does that make sense? You scared me there for a second. I thought I got my axes wrong. Okay dog, thank you. Guys, we can talk about life too if you've run out of questions because we have another 15 minutes. About life and everything 42, but besides that, what about G event? Yes. And green threads. So you can use stuff like G events on top of a SYNCIO, which is pretty cool. I haven't done it. So you can definitely use it. You can do that tornado, twist it, everything. That was, I think the way SYNCIO was designed was for it to be it for other frameworks like that to be able to build on top of it. That's why it's quite low level. But I don't have any chart performances for it. What I can do however, if you're interested, if anyone else is, I can do that. Add it to slides. And when I put the slides online, it could be included. There's anything, is that okay? That would be great. Okay, yeah. G event does not run on top of a SYNCIO. There we go. It's its own event loop. It's completely different. I'm wrong. Okay, well, yeah, I have to look into that again. Yeah, it's not, right? But I was under intrepent pressure and you can use them together. I was just saying that you said G event does not fit in there, but Tornado and Twisted definitely do it. So they can run on top of a SYNCIO. The G event is really a thing that is more level and does it's own way of getting it there. Correction, you can't run G event on top of a SYNCIO. Ah, okay, good to know. I just wanted to make sure everyone's on the same page here. So yeah, good, you're paying attention at least. Not that I have the microphone anyway, sorry. So one thing to add, you presented yield from as the way to do delegation in an SYNCIO. So Python 3.5 will have a new syntax from that, which was used, no, it was changed before because there were problems with yield from and there were generators. And so generators don't work, but coroutines do. And so that was all a bit of a mess. And so there's new syntax now that goes from coroutines are defined with asyn def. There's new syntax for defining coroutines and then instead of yield from, instead of such a coroutine, you would use await instead of yield from. So I had, yeah, I actually opened a post that's in the resources show you but agree those response to why he chose yield from and how that didn't happen. But that's also in the resources. It's a good read. He just goes through why he chose yield from and not await. And then as you mentioned, guys, we have at least 20 more minutes. Yeah, I think, yeah, just something very quick. What about memory overhead because I have been pushing processes can or even threads can take much more memory than SYNCIO, for example. Yeah, they can. Thank you. So, but again, I don't want to, I really get nervous when I have to do a conclusion like that though. Yes, they do. I think you have to look at your tasks soon to see what does, but yeah, I mean, they do. About the overheads, did you happen to get that chance of testing something here like say a four core hyperthreading processor for the multiprocessing? I didn't. Maybe I should. Might be interesting to see. I feel I've disappointed a lot of people today by not doing all this. It's time, it's fine. Yeah, if I could do it. But you know what, that's great. I want to put this code up. Well, it is up. I want to make it public and like feel free to contribute. And next year, I can do the same talk and it can take long. Any more questions? Lunchtime? Yes. Thank you very much.