 Hi, everyone. Thanks for being here. Thanks for the intro. As mentioned, my name is James Starr-Urwini. And today, I'm going to be talking about downloading a billion files in Python. So really, this is a case study in trying to figure out whether we should be using multithreading, multiprocessing, async.io. And to help motivate this, let's say that we've been given a task. So say we come into work and we've been asked to do something. So our task is that there is this remote server that stores files. And these files can be accessed through a REST API. And so our task, what we want to do, is download all of the files from this remote server onto the client machine that's going to be running our code. So seems, at least conceptually, pretty straightforward, right? And if we ask for some clarifying details, we want to make sure we understand the problem entirely. So we might ask, OK, so what machine is this going to be run on? Do we have a cluster of machines? Is it just a single machine? And so let's say, for this task, we have one machine that we can use. It has 16 cores, 64 gigabytes of memory. So it's a machine that has a good amount of computing power here. And we want to know about the network. So what about the latency between the client and the server? Are we making calls across continents? Is it in the same network? What does that look like? And we'll say, for this purpose, that the client machine is on the same network as the service. And then we ask, how many files are on the remote server? And as the title suggests, there are a billion files that we need to get. But fortunately, the files are fairly small. And we're talking really small here, 100 bytes per file. So fairly small file size. And the last thing we should always ask is, OK, when do you need this done? And of course, we get a favorite answer. Please have it done as soon as possible. OK. So we start looking at this REST API. Before we start getting into the various approaches, we want to understand how this API works, to better understand this problem. So the API that we have, of course, there's an API to get a file. Just given a file name, you can download the contents of a file. There's also the ability to list files. So we need to know what files are available on the service. And the way that this API works is the way that a lot of APIs work that have a large result set. It's paginated. So instead of returning a billion files and says, here's all the files from this server, it instead will give you a fixed amount of files at a given time. So say, 1,000 files for a single page. And you'll have 1,000 file names and then a marker to get the next page, which is essentially, think of it as like a pointer to the next page of results. And so concretely, we'll have a JSON response, file names, which is a list of files, and then next marker, which is our pagination token. So then we can get the next page of results we can keep going serially through all of the files until we get no next marker, which means we've got all the files listed. OK, so that's the list API. If we look at the other one, as shown down below on that third column, or the third row there, there is a get slash file name that gives us the file contents. So this is the API we're working with. It's a little bit weird that there's like list and get in the URL. But that's what we have. And so I think now we have enough to understand the problem and start experimenting with a few different approaches here. Now before we get into that, there's a couple of caveats just to make sure that no one takes this or misreads like what I'm trying to show in this talk is that this is a simplified case study here. The results here don't necessarily generalize. So just because maybe async.io comes out ahead or multi-threading comes out ahead doesn't necessarily mean you should always be using that. It's really just showing a specific use case here, a specific case study that shows for this problem, with this client and this environment and this network, this is what ended up working for me. But I do think that sometimes having a concrete example is still really helpful. A lot of times, one of the things I struggle with is that you might understand how, say, multi-threading or multi-processing works, but then really understanding when you should use one over the other or what the trade-offs are is still a little bit challenging. And so hopefully by sharing some of this data here today, it can help people be more aware what some of those trade-offs are going to be. OK. So I will say, though, always profile, always test for yourself. There's really, you can't get around that. You still have to always do that. But let's get into the various approaches. So before we do any of this multi-processing or async.io stuff, maybe we can just use a synchronous version. It's the simplest thing that works. Maybe if that's good enough, then we don't have to do anything crazy. Wouldn't be that interesting of a talk. But we should at least make sure that it works and see what the timing looks like. And to be clear here by synchronous, what I mean is maybe we can just take a list, and then we can just go through each of these files here and then download them one at a time. So say, for each file on a page, download the file, download the file. And then once you're finished with that, get the next page and then go for all of those files. So the code for that is fairly straightforward. We're just using requests. And then we just have a for loop to go over all of these pages. And so the way that this code works is we have some constants at the top that match the API we saw in the previous slide. And then we have the first request that we make, which is a list request. So we get that first initial set of files. And then we just have a for loop here. So for each file in the file name, we're going to make a call to download file. And we'll look at that definition in a second. And then after that, we check, is there any more pages? So if there's this next marker, then we say, all right, requests.get this next URL, providing this next marker here in the query string. And we just loop over that until we download all our files. And so this download file here is also not that many lines of code here. We're just making a request.get call to this remote URL, checking to make sure that it's a 200 status code. And then we're going to open the local file name that we were passed in, and then write out the data to that file. So that's asynchronous results. If we look at how long it takes to do one request, it takes 0.003 seconds. So this is essentially the network latency for making a single request, amortized over the list call and the get call, and the overhead of processing it in Python and going through the request library. So three milliseconds. You think that's good, bad? Well, if we take three milliseconds and we multiply it by 1 billion, ends up being 3 million seconds, which is 833 hours or about 34.7 days. So I think if we were to say, OK, we have this version running, and we'll have this done in about a month. Check in in a month, and we'll have this done. That's probably not going to work, right? So we know that the synchronous version isn't good enough, and we need to find something that's going to be a little bit faster. OK. So now we can get to multithreading. This is the first thing we're going to look at. It's nice because it's been in Python for a long time. There's a lot of things that have been written about. It's been fairly well studied, so it's really easy to get started with multithreading code. What we're going to use here is a producer-consumer queue. It's a fairly standard pattern when you're using multithreaded code or even in multiprocessing, which we'll look at next. But what we're going to do is we're going to take one page of results, and then instead of downloading that, what we're going to do is we're going to put that in a queue. And then from that queue, we're going to have worker threads. They're going to be pulling off that queue and downloading those files concurrently. And then so in the main thread, we have this list object or this list files that's going to keep running, and it's just going to keep putting stuff on the queue, and you're just going to have these workers continually downloading these files. And then at the end, we're going to have those workers put their results in a result queue. That's so we can track status, we can track progress, look at errors, and that kind of thing. And one of the observations here is that we're trying to parallelize where we can. So if you notice with the list call, that's inherently serial, right? There's no way to run multiple list calls in parallel because in order to make the next list call, we have to have the next marker that was from that previous response. So we have to go in order. But once we have this thousand files, those can all be done in parallel. So we're trying to make sure that we have things that can be run in parallel or run in parallel, and then things that aren't are separated out. But we also have this nice decoupling as well. So we have the list files being completely independent of the downloading of the files. Now the code for this is, again, fairly straightforward. It's not too many lines of code here. It's similar to the beginning where we have our constant setup. But then we have the two queues we saw in the diagram. And then we set up our worker threads. So we have a number of threads here. And for each of those threads, we're going to start them up with this worker thread function. And we'll look at that definition shortly. And then we also start up a result thread that's going to be pulling on that result queue that's going to print out results. And so if we look at what the actual main loop here is, it's pretty similar to the synchronous version. So we go through each of our pages. This is that main thread doing that list call. And the only difference is that instead of calling the download file, it's now going to call this work queue dot put. And so the theory here is that it's quicker to put something on a queue in memory queue than it is to actually download that file in that thread. So now if we look at what the worker threads are doing, again, there's not a whole lot here. It's just pulling, waiting for something to come through on that work queue. And once it gets it, it's then going to call this download file, which is the same function that we saw in the synchronous version. It's just now running in a separate thread. Okay. So we run this with 10 threads. And how do we think we do? Better than the synchronous version? Not better? Well, it turns out it's actually slightly worse. So it's about 3.6 milliseconds, and which comes out to about 41.6 days. It's still not very good. And maybe we think, okay, that was with 10 threads. That's, things are happening concurrently. So, you know, let's just, let's just bump up the number of threads to 100, because you know that'll make things faster, right? So it turns out that actually makes things worse. It's about 4.2 milliseconds per request, coming out to about 48.6 days. So things aren't looking so good so far. Why is this happening? Well, we're not necessarily IO bound here. So the threading is really good when you have a lot of IO bound work. But in this case, due to the low latency of the network, so we're looking at single digit millisecond latency, and the small file size, there's not a whole lot of IO. It's very different writing 100 byte file versus say writing a two or three megabyte file. And so things like the GIL contention and the overhead of passing data from the main thread to the work queue, and then from the work queue, the worker threads running them, and passing that on to the result queue. There's just a lot of overhead. Each of those queues have locks and condition variables, and so there's just a lot of work going on. And because there's not a lot of IO, all of that overhead actually ends up making things slightly slower. Now there's a couple of other things to keep in mind when dealing with the multi-threaded approach. This is stuff that's based on my experience working with multi-threaded code and projects that the real code's actually much more complicated. So handling things like if you control C and you want a graceful shutdown and you want all the threads to actually end, not have to force kill all of them, that's kind of challenging. You can't really shut down threads explicitly. You have to request that they shut down and set some sort of invent or variable that they periodically check, and hopefully they see that and shut down. And then debugging is really hard as well. It's not deterministic. You typically don't use PDB. You'll have to use the PyCharm debugger or something that can handle multiple threads. And in my experience, the more that you deviate from the standard library and build stuff on top of it or not use things like Q.Q, the more likely you are to encounter race conditions. It just always seems to happen. The things that understand their library are fairly well tested and have a lot of people using them over the years. And so I usually try to stick to the stuff in the standard library when dealing with threads. Another note here, you might be wondering why not use concurrent.futures. Well, in this case, there is an executor.map, which will just give you, you give it an iterable and it gives you the results back and it runs them concurrently as well. In this case, it's creating a one billion element list. So if you try to create a one billion element list in Python, you will definitely crash and you will run out of memory. So we can't use that. But there is still a way that we can use concurrent.futures. And we'll look at that next when we look at processes, because hopefully processes are gonna, we're gonna have better luck with that. So if you remember from the details here, we have 16 cores to use. And the thing with threads is that we were only using a single core. So the thing is, maybe if we can use multiple cores, things might actually be a little bit better for us. So what we're gonna do here with multiprocessing is instead of using threads, we're gonna use processes and we're gonna try to use concurrent.futures. And we have to use it a little bit differently because we have such a large dataset. So the way this is gonna work is we're gonna take one chunk at a time, so one page, and then we're gonna have all of these worker processes download them as quickly as possible. So 1,000 files at a time. Once that's done, we're then gonna move on to the next page and then have that queued up and then have those workers download 1,000 files at a time. And we're gonna keep going until we've got all the billing files. So the code for this, slightly different. So we have our constants at the top. We have our hostname, the list URL. Then we have this iterate all pages. So we're just starting to put some abstractions in place. We're starting to separate out the iteration from the actual logic and the orchestration of downloading files. And then we have this downloader class and I'll talk about that in a minute. But then the main logic here is in this process pool executor. So in this futures package in concurrent.futures, there's a process pool executor. It manages all of this for you. And by default, it'll create one worker for the number of cores. So in our case, we're gonna have 16 workers. And all we do here is it's a two-step process. So we're iterating through all of these pages and then we queue all of them up. So we start the parallel downloads. This is that first part in the previous slide where we queue them all up using this executor.submit and we get futures back. And then afterwards, we're gonna wait for them to complete. And so there's this futures.asCompleted. We're gonna get futures back as they finish. It can be in any order. And then we're gonna look at the results. Now the iterate all pages is sort of, it's kind of similar to before. The main difference is here instead of yielding one file at a time, we're iterating over files. Here, we're just iterating over pages. So we're hiding all of the iteration stuff from the main worker process. So it doesn't have to worry about that. And otherwise, it's the same thing. We make the list request and we keep looping while there's next markers and just keep giving more and more pages. So that's how the iter all pages works. And then the downloader is similar to what we saw in the synchronous version. So we're just using the request session.get and then we're opening the file and writing it out. So one of the things you have to be careful with with multi-processing is because it's passing data across multiple processes, you have to be able to pass whatever your task is, whether it's a string or an object, it has to be able to be serialized, sent to another process and then deserialized. And so it will use pickle for that. And so you have to also make sure some of your objects can handle that. So in this case, we're using request.session, which can be sent across process boundaries. Okay, so this is the multi-processing thing. We're using 16 processes. What do we think? Better than threading? Not better? Yeah, definitely better. So in this case, it's .00032 seconds. So that's an order of magnitude better. So before we were at three milliseconds, now we're at .3 milliseconds, which if we do the math comes out to about 88 hours or about 3.7 days. So that's definitely better than a month, but it's still not great. We're hoping maybe we could do this a little bit quicker. And some of the things that keep in mind with this multi-processing is that speed improvements are because we're truly running in parallel now, because we're leveraging all of these 16 cores, we're actually making requests in parallel. But just like multi-threading, debugging is much harder. PDB typically doesn't work out of the box. And there is overhead. So what we were doing with the multi-processing is sending each file name to a worker process to then download. So there's a lot of IPC overhead there to be able to say, for the main process, I want this worker process to download this file and I want this other worker process to download this other file. There's a lot of overhead involved in that. So maybe we can improve things. But we looked at, there's a trade-off between doing things entirely in parallel, which we saw with a multi-threaded approach, where we just had this completely decoupled list files from the SCET file. And then here where we're doing things in parallel but in chunks. And even though we're doing things not completely parallel, it still ended up being faster, we got to leverage a standard library. And it was less code for us, which is great. Okay, let's move on to async.io. Hopefully things get better, right? So with async.io, what we're gonna do is something slightly different. It's a little bit hard to visualize what's happening here because it's mostly the way that it's run in the event loop that's different. But at least conceptually what we're gonna do is we're gonna make our list call, like we've been doing before. And what we're gonna do is start a new async.io task for this. And so the lighter color here would indicate that the task is running but not done. So we're gonna spawn all of these tasks, move on to the next page and start queuing up more of these tasks. And they're actually running now. And so maybe as we're starting to create more of these tasks, maybe one finishes and so then we'll keep creating more tasks and some will finish and we'll just keep going on and these can be in somewhat arbitrary order these tasks will finish. So that's what we're doing with async.io. Some of the benefits here that are mostly in how this is run. So because we're creating a new task for each one of these requests, that's something that we really couldn't do with multi-threading, right? We don't necessarily wanna create a new thread for every file that we wanna download. But here the task object is much more lightweight and some of the benefits here with using async.io is that it's all in a single process, all in a single core. You switch tasks when you're waiting for IO or really anytime something yields or something's waiting for something. So in theory, this should keep the CPU busy, right? We should have much more efficient use of the core. Now the code for this is a little more involved compared to some of the other approaches. We're using some different libraries now. We're using async.io from the standard library and then two other libraries we're using here are AIO, HTTP, which is our substitute instead of using requests. And then we're also using UV loop. In my experience, UV loop is just a way to, like it's basically free speed ups. I haven't had any problems with it and it just always makes the event loop run faster so I always just use it. And what we're using or how we're gonna set this up here is we're gonna still have the same hostname in the list URLs here, but we're gonna have a sem afford to control how many tasks are running and then we're gonna have a task queue to make sure that we only have a certain amount of tasks that we run at any given time and then we have that results worker that's kind of similar to the results thread from the multi-threading approach. So one of the things that I found with async.io and really any of the approaches is that you can list objects or you can list these files much, much quicker than you can actually download the files. And so we have to be really careful that because there's a billion files, if we don't cap that, you will eventually run out of memory because you're queuing way too fast. So the way that we use this AIOH TTPs, we use async with, we create a session and then from that session, we're going to iterate over all the files and we'll look at this definition in a bit, but now we're going through each file and like we saw in the diagram here, we're just creating this remote, or we're creating this task that's gonna download the file. It's responding a new task for each of these files and then we're putting it on the task queue. Now the iterall files is similar to the iterall pages. The only difference, aside from using async and using AIOH TTP, is that we're yielding this file name. So again, we're just hiding all the details from having to do this pagination and we're just giving this file name each time. Okay, and then the download file is again fairly similar. We're just using async with session.get in order to download a file. We're using the semaphore here to control how many download tasks are going at a given time. So you can actually do this also with AIOH TTP. You can set a limit for how many connections you want to a host. In this case, using async with the semaphore lets us control that in our own code and it's just a little more explicit so it's easier to follow how many actual download file tasks are going at a given time. But one thing I wanna point out here, and this I'm not entirely sure about, is that there is an AIO files library that I initially used and that's to do all of the IO using an async, basically gives an async interface to the file IO. And I believe it does it behind the scenes using threads, but in my experience here with this specific problem it ended up being a lot slower, not a lot slower, but ended up being a little bit slower. And I guess I suspect it's for the same reason that the threads didn't have a whole lot of benefit. It's not really like a lot of IO. And so in this case, normally, writing to a file in an event loop is a really bad idea. You're stopping the whole event loop. But when I compare the numbers, this was actually faster, so this is what we went with. But again, I'm not entirely sure if there's a better way to do that type of IO. Okay, so the async IO results. Faster than multiprocessing, not faster. Well, turns out, faster than the threads and the synchronous, but not quite as fast as multiprocessing. So with that in mind, it comes out to about six and a half days if we were doing this, so which was a little unexpected. When I was doing this, I guess perhaps I was biased, but I was thinking that async IO would be the clear winner in this case. And so now we're looking at, here's the summary of the results that we went from the 34 or 40 days using synchronous code, using multi-threading code, all the way down to about 3.7 days with multiprocessing. So we would say, all right, out of all those three approaches, we should go with multiprocessing, right? That's the clear winner. That's a takeaway. But for me, I was thinking, that still doesn't seem like that's the best we can do. And normally I might stop here, but I was just curious and I wanted to see if we could do even better. And my thinking was that, okay, async IO is the best way to take advantage of a single core and a single thread, but multiprocessing is the best way to ensure that we take advantage of all our cores. Maybe there's a crazy way that we can combine all of them together to get even better performance. And it turns out you also have to use threading as well, which I'll explain in a bit, but like, so we're throwing everything at this. We're gonna see how fast we can get this. So I won't be able to show the code for this due to the fact that it's really complicated and we're running a little bit short on time, but I'll work through how this actually works. So the way that this works is that we have worker process, right? Within the worker process, there are two threads. And with one of the threads, it's running an event loop. It's crazy, but stay with me. Okay, so what we're gonna do is we're gonna have 16 of these processes going, 16 event loops, right? And the way that this is gonna work is that we're gonna try to have each core being maximized as best as it possibly can. And we need to have a thread in each process because we need something to interface with multi-processing cues, which is using like pickle and it's using sockets or writing to pipes and it isn't async aware. And then we need something to bridge that into the async land and back and forth. So here's sort of how the workflow goes. This was just something that I was playing around with and I'll try to walk through this. It's a little bit involved. So what we're gonna do is we're gonna take a pagination token. So this is just some opaque thing like foo or whatever. Take that from the main process. We're gonna send that over to the worker process through a multi-processing cue, right? That is then gonna send it into the event loop for the async.io.q. And so this main thread is then gonna do the pagination. So it's similar to what we saw in the async approach, right? So it's gonna get a thousand files and it's gonna get a next marker. Let's say that it's bar. That's the next page of results. Now, before it starts downloading, what it's gonna do is take that bar token, immediately send it back to the thread and then that thread is eventually then gonna make its, or that value is then eventually gonna make its way back up to the main process. Skipping a few steps here, it actually goes to the output cue, which then gets fed back into the input cue. That then goes to another process, right? That then does the pagination. And so while that's going, you have this worker process that then is gonna download its thousand files, while another process is doing its own list files call and then gonna eventually download all its files in the event loop. So the reason I like this approach is that, number one, you get to leverage all the cores. And we clearly saw that when we do that, we get a nice performance boost. Okay. Number two, we download all the files efficiently, right? So we saw that async.io is the best way to take advantage of a single core and a single thread. So that's great. And then the third thing is that we've minimized the inter-process communication overhead, right? And that's because instead, before where we were sending a file for every time we wanted a process to download, now we're just sending a pagination token, which is a thousand times less, right? So there's one, each page has a thousand files. So we're only, we essentially cut the IPC overhead by a thousand. So this is a really nice way to get, sort of take advantage of all the things we've seen so far throughout this talk. Okay, combo results, better, worse. Well, it turns out it is actually awesome. It's, what is this, a 0.03 milliseconds, about 30 microseconds per request, which comes out to about 30,300 seconds, or 8.42 hours, which is cool. So the real summary is that by combining all of these, we went from something that was about a month, all the way down to something that was about eight hours or about a third of a day, using asyncio multiprocessing and all that stuff. So what I actually recommend that, I mean, probably not, it's really involved to write. And so some of the lessons learned here is that it's multiple orders of magnitude based on a different approach, but it is really a trade-off between simplicity and speed and depending on, maybe the multiprocessing is enough and four days is enough for you, or maybe you really need to get this done in a day. And so another lesson, last one is just, you should have max bounds because one of the things I ran into all the time was getting memory errors because I wasn't properly bounding things correctly. But yeah, I hope you got something out of this. And I hope that, you know, this gives you more context about multi-threading, multiprocessing and asyncio. That's all I have. Thanks, everyone. Whoa, thank you, James, for getting us so many files and actually less than one day when you took more than one month in the first version. Thank you a lot. We are a bit short for a question. I get a hang out outside if anyone has questions. So yeah, we can go outside. All right. And we have, so thank you, thank you a lot. Thank you a lot.