 Does that work? OK. So hello, everyone. So my name is Pierre Glazer. I work at Inria, which is a research institute in France. And actually, most of my team does research and machine learning for new imaging data. But there is a small subset, actually, in this team of ninja that work mostly on open source projects. And you must know them because a bunch of them maintain the second-hand library, machine learning. And actually, my job is to improve the performance, among other stuff, of scikit-learn when it's asked to do tasks in parallel, either multi-processing, multi-threading, or multi-machine. So I'm going to talk about how we collaborate with the OlicoSystem, MMPI, Python, and a bunch of other libraries to make a multi-processing and multi-machine computation faster for everyone. So the first section will be basically like a walkthrough about why parallel processing is important in machine learning. Then the second part is kind of going to be a catalog of all the different libraries that you'll see that do like a multi-processing, and that scikit-learn, for example, uses. And then the last part actually will be like the most leading edge advances that we actually pushed forward in Python 3.8, and that hopefully you'll be able to benefit from. OK, so I'm just realizing this doesn't fit well, but anyway. So Python, hopefully, by the third day or fewer of Python, you guys all know that Python is used by at least a decent part of the users to do data analytics and machine learning. I think Python is really cool to do basically interactive development and fast iteration of data analysis. So there is a huge ecosystem of open source libraries that got a lot of momentum in the last 10 years. MMPI, that's used by the recent matrix, by 200,000 people, PENDAS, and of course scikit-learn. And actually, so this is for the whole data analysis presentation. Why is parallel computing so important in machine learning? I think it's because parallel computation happened a lot in machine learning, because machine learning, the core, is statistics. And in statistics, you tend to repeat independent operation to make stuff more robust, like to insert these into a bunch of stuff. Sometimes you can want to span a space to explore also. And so what we call this kind of independent computation that we can run completely independently, we call them embarrassingly parallel tasks. And so embarrassingly parallel tasks happens a lot, as I said, in machine learning. They happen in cross validation, random forest, random forest. For example, you want to fit a model. And a model consists of a bunch of trees. So you want to fit all these trees completely independently. And so completely, conceptually speaking, distributing this task shouldn't be a burden, right? Because you have all these tasks completely independent from each other. You have resources, like let's say workers, processes, machines, or anything. And you want to distribute them. I mean, there is no communication between the processes. Like it's technically super easy. And it should be made, I think, easy for the users like using scikit-learn that would like to enable machine learning in their code. And so, as a conclusion, I think parallelization in scikit-learn is completely ubiquitous. And it's super easy to enable. Like let's say, for example, you want to train a random forest classifier. You can train it on four cores, simply by using the end jobs parameter. So I mean, it's a one-liner. Super easy. No, to get into a little bit more details. So I think there is a little bit overlapping of this talk with other talks, maybe a little bit the talk of Victor. But I still get a little bit kind of dive into it to make sure everyone here is up to date with that. Parallelism on a single machine exists under like two different forms. You can inter-create different processes. So basically, different Python programs that will have their own memory space. Or you can have one-second Python program executing different threads in parallel. So actually, kind of spawning multiple threads in one single Python program seems like the smart way, right? Because I don't know, you save memory, you have everything in a single process, it's lighter. So why don't we use all threads, right? Like in this example here, you have, for example, three threads that all have to fit, for example, different trees of a random forest. And so they all kind of point to the same data. So to fit the model here, they always have the same data. But actually, the problem here is that Python, in general, kind of forces threads to run sequentially. So it's not always the case, right? For example, if you use NumPy or SciPy, NumPy or SciPy are written in native code. And so because they're running native code, they can release this global interpreter log that prevents the thread from running in parallel. But I mean, you can always hope for it, but it will not happen all the time, right? When you create your own Python program, you don't know in advance unless you dig into the source code of the libraries you use. You don't know in advance whether or not your threads are going to be able to run in parallel or sequentially. I think that the safe way to an end goal parallelism in the code is actually to run process-based parallelism. So you have different Python processes with their different memory space. And because they're completely independent process, if you have multiple cores of your new machine, you can actually ensure that different sequences of Python bytecode are actually going to run in parallel. And that's asymptotically, if your code is CPU bound and not IO bound, you're going to get a speedup. So this scheme just illustrates how many different copies you have to make of the data, because at this point, all processes have different memory space. So you have to dump your data to all the memory space of the different processes, which, I mean, adds to overhead. First, it increases the memory footprint of your program and generally a bunch more of data just hanging out in the RAM. And also, there is an overhead of transferring the data from the main process to the other, which can be kind of long, especially if the data is very large. There is, I mean, short memory is a thing, and it's going to be a thing in 3.8. It's also was a thing before, but you guys probably didn't know it because it was kind of hidden, not even in the standard library, just through some kind of weird hacks that we used to do in scikit-learn. So we used to enable short memory between processes. But let's not talk too much about that because it just adds some more complexity, just guiding you that it's a thing, and hopefully it also helps to make multiprocessing faster. This graph is a graph of benchmarks of different model fitting of scikit-learn models. So it's a 2D graph. So basically each point of this graph represents one machine learning model. The x-axis is the time it took to fit this model sequentially, and the y-axis is how long it took to fit this model using four different workers. So, I mean, you would accept most of the points in this graph, right, to be close to the green line because the green line represents the ideal speedup. So the green line, the equation, is y equal x divided by four. So it's basically the ideal speedup that you can expect from running a model in parallel between running a model with one worker. So hopefully all the points are distributed along this line. Of course, it's not gonna be the case because you're gonna have all the data passing and stuff, but hopefully we can hope that. In practice, it's here. So you also notice two colors here. You have purple points and yellow points. The yellow points are like points where the parallel backend was processed. So we didn't run different threads. We ran different prepyton processes. The purple points were points where the parallel instances were threads. And as you can see, not all the points are completely distributed close to the green line. Most of the points are in the kind of good enough zone that's the green zone, right? They're close to the optimal speedup we can get. But some of these points, and you'll notice it's only purple points, are far away and are outliers, and the red line here represents the point past this red line, like above this red line, we actually have no speedup from fitting a model using different workers. So this is worse than these points represent cases where fitting a model using different worker is worse than fitting the model sequentially. So this could be considered as bug, but as you see, it's only purple points. So this represents typically global interpreter lock contention that can exist in Python. So I think the takeaway of this graph is that process-based parallelism is always, I mean, asymptotically kind of a proper speedup is the data is not too big and running your model is not too slow. Sorry, it's not too fast. And that's why I think we have to kind of focus also on making multiprocessing faster is because it's for now kind of the safe way to enable parallel processing in Python. And also I think there are many different concepts, like many improvements that you do when you improve multiprocessing. You also improve multi-machine because the concepts are the same. You have to pass data around between machine between process. It doesn't matter so much. So as soon as you do an improvement for multiprocessing, you do also an improvement for multi-machine execution. The second part of this talk is gonna be more of like kind of a catalog of all the libraries that provide multiprocessing possibilities in your code. But I mean, more than a catalog, I think it kind of represents the development workflow that my team did when they wanted to actually have multiprocessing code run in scikit-learn, right? They started from using standard libraries, such as multiprocessing, and then realized sometimes it was just not enough. So we ended up kind of creating our own and then build upon it to make it super easy for data dynamics. I think also key takeaway of this part, I'm sorry, I'm doing the conclusion before even I run the slides, but the key takeaway of this part, I think, is to realize that a lot of stuff we do was kind of to ease the development, like the usage of multiprocessing code for data scientists. First, so let's talk about the standard library. In the standard library, there is a module that cause a lot of pain to people, like making sure Python runs on different platforms. It's called multiprocessing. So multiprocessing is kind of like a huge module providing wrappers around system calls to create processes and to make this process communicate. So you can create a bunch of data structures to make the process communicate. You can create the processes. So like, for example, you can create queues and also you can synchronize the processes using loss. It's a very, very rich library, but it's also kind of a low level library. And I think when you think about like embarrassingly parallel tasks, so embarrassingly parallel, it's just independent computation run one after, like that can be run completely in parallel with no communication between the workers. I think multiprocessing like parallelizing, embarrassingly parallel tasks, programs have completely common structures. Like, technically what you want to do is to create a queue. Queue all the tasks that you want to parallelize embarrassingly, spawn a bunch of workers that will just get tasks from that queue and then have the workers send back the results to the main process. And this happens whenever your use case is as soon as you deal with embarrassingly parallel tasks. So I think this should be abstracted away and I didn't think that, multiprocessing developers thought that. And so actually they created a class which is called the worker pool, so the mp.pool class that abstracts all of this away and with which you interact when you want to run embarrassingly parallel tasks. So let's say for example I want to greet all my friends. I have two friends only, Alice and Bob. So I kind of map these greetings using the pool. Pool.map greets Alice and Bob and then here, that runs in parallel and you don't have to create a queue, you don't have to create processes. It all happens completely smoothly. The problem with multiprocessing is that in some cases it's not super portable. I think sometimes we forget Windows because I think, I mean for example, I like develop on Linux, but actually a lot of data scientists they run on Windows and what data scientists do on a day-to-day basis is interact with Jupyter Notebook. So what happens usually is they launch a notebook in the browser and then they just run code and then interact with it. So they run their code, they figure out that's not exactly what they want, they rerun cells, and then they run machine learning models, but everything that happens in a Jupyter Notebook happens in an interactive session. And the problem with multiprocessing is that it's not super good at parallelizing code that was created in interactive session. And actually when I'm not saying it's not super good, it's just it's not possible. So that's the first big concern. I mean I think, I don't have the correct numbers, but I think like 50% of the cycle learning users are on Windows. So we cannot really like let them away, you know. Another problem is that even on POSIX system, actually, the way to start processes is not I think POSIX compliant is only used for and not exact afterwards. So it actually causes crash with some external libraries that are very useful when you want to enable performance in your code, and especially the GNU version of OpenMP. So this is another problem and that doesn't happen on Windows, it happened on POSIX systems. And also the multiprocessing pools are like not super user friendly when it comes to like worker and expected crashes. So crash like workers can suddenly crash, for example, if they take fault or if they consume too much memory that can be killed by the OS. And in this point, multiprocessing pools are just gonna wait for them to finish or they're just crashed. So they're never gonna finish, they're gonna chill out and they're, you know, it controls C or something, which is not great. So because of all these problems, we decided to, I mean these were like kind of problems at the core and it was not super easy to fix them upstream. So we kind of decided to actually like create a bunch like a whole new library from scratch. This library is not created from Shrest, actually it's created from Confirmed Futures, but I'm not gonna like talk about that. So this library we created and that's actually used by scikit-learn, it's called Loki. So Loki is another library that provides worker pool access. So it's a third party packages. It supports all Python version. Hopefully we're gonna stop support for Python 2.7, so please guys update to Python 3. It has consistent behavior on like Linux, on Mac OS and of course on Windows and it works on interactive shell. So I mean it's the way to go if you want to parallelize code for that's used by data scientists because I cannot stress this enough, like working interactively is the key advantage of Python and we have to support it. Loki API for worker pools is kind of the same as another library in the standard library that's called Confirmed Futures. So I would say it's a drop in replacement, although I'm not like 100% sure. At least the syntax is very, very similar. So for example here you parallelize a task using like Confirmed Futures and here you parallelize a task using Loki is basically the same syntax, right? Why we use ConcurrentFuture and not Loki, same. ConcurrentFutures are some problems like on interactive session and et cetera. So we cannot use ConcurrentFuture. We have to create a new library. It's not because we like to code, it's because we have to. Finally, the last part of this like huge catalog of different libraries is called, I'm going to introduce you guys to JobLib. So JobLib is a library that's built on top of Loki and that's what's created to kind of ease the development workflow of data scientists. JobLib is actually what interacts with Scikit-learn and Scikit-learn codes. Scikit-learn doesn't interact directly with Loki, it interacts with JobLib when it wants to parallelize tasks. And so it's a parallel computing library and among the like vertical feature that it includes, oh, sorry. I think the killer one that actually I didn't know when I started using it is this bad memoization. So basically, if you want to parallelize a bunch of tasks that are CPU bounded that take a lot of time, and let's say for example, I don't know, you want to, so this example is going to be stupid, but you want to greet a thousand friends, right? And so you start greeting all your friends in parallel and at some point like the 500 friends you crash, right? Well, JobLib is going to remember that you did all those tasks that maybe it may take a lot of time to greet your friend, you're going to have them hurry their kids or they et cetera, et cetera. So it may take a lot of time and JobLib actually like got all these results from this task and stored them into disk. So if for some reason the call to JobLib kind of failed in the middle where you were executing a bunch of tasks, all of the results that you already computed will be stored in disk. And so if you rerun the call, you won't have to actually rerun all the tasks. All the tasks that you already ran, you only have to like get the result from the disk. And the tasks that you didn't run already, well, you have to run them. But so actually this is really, really useful. Another useful feature of JobLib that's not into low key is like optimize transfer of an empire array. As I said like data scientists always work with an empire arrays and if you want to actually enable good speedup from multiprocessing, you have to like make sure transferring an empire array is not taking too much time. So JobLib takes that in account and also has kind of a backend agnostic user API. So it's really easy to switch from a threading backend to a multiprocessing backend. And here is a like two line code snippet that show you how to do it. So that was it. The last part of this talk, I absolutely don't know how much time I have left. Okay, 10 minutes is great. So the last part of this talk is gonna get a little bit more involved. It's about like what's happened recently in the multiprocessing community in Python and what were the problems in this community that we actually managed to fix and not social-wise but code-wise. So from no one, I think you understood that what's important when doing multiprocessing and when constructing multiprocessing code is making sure that the data you transfer from process to another are transferred fast and also because you have so many processes that holds like the same copies of the data, making sure that the footprints of the overall system of workers is not so high. So these are I think the two main things I worked on from the last year and also a last thing, it is a few, typically like pool hanging because like worker crashes or anything but I'm not gonna talk too much about this last point. I'm gonna talk about the first two points. One disclaimer before I start, this is all about CPython. I'm sorry for the Piperi folks from the Iron Python or anything. This is all about CPython. Maybe actually disorder implementation don't even have those problems, I don't know. So when you actually want to transfer data from one process to another, you have to do an operation that's called serialization. So think about when you run a Python program, you have a bunch of objects that are in memory and this object, they kind of form what we call an object graph because an object can hold a reference to another object. I don't know, I am an instance of a class and I have an attribute that's like another instance of this class, so you'll have kind of a link between the two vertices that has this object and so at the end, if you look at the complete memory layout of your Python program, it's a big huge object graph, right? And so my use case is I have an object in one process and I want to like send it to another process. What you have to do if you want to create like serialization for this process is to make sure all the objects that are presented the object graph and that are leaked at some point to the object you want to serialize are also properly serialized. So serialization is kind of a recursive operation where you make sure all objects that are linked to the object you want to serialize are also serialized and at the end, you eat completely atomic objects that don't hold any reference to other objects so it's like integers, integers don't hold any reference. So they're kind of like the leaves of the graph, you know? It's not a graph, it's a tree in this point. It's like... But anyway, so these little shims like kind of illustrates what you want to do. You want to get an object and then you want to convert it to a format that's suitable for inter-process communication. Bytes is a good format for inter-process communication. So here I have an estimator, maybe like of a scikit-learn class and I'm serializing it into a bunch of bytes and then these bytes I don't know can be sent into a pipe and got back from another process that will recreate this estimator by executing the instruction that are contained in these bytes. This whole thing, of course, you don't have to do it, right? There is a module in the standard library that does it for you, it's called pickle. Actually pickle is a protocol and the pickle module is an implementation of this protocol. And so here is a very simple code snippet where you have a list of three elements. I can serializing using pickle.dumps and then I have a sequence of bytes, right? And if I launch a completely different Python process of the snippet below, it can be a completely independent Python process. I sense him this bunch of bytes and using pickle.load, I can completely recreate the original objects. So this kind of illustrates the points that from one process to another, I sent a list. So that's pretty cool. But the problem with the pickle implementation that's presented in pickle module is that by design it's kind of limited and it's not limited to random stuff actually because for random stuff where serialization wasn't introduced, generally you can submit a patch to a treatment that will be happy. But there is one thing for which Python developers are really kind of sensitive and don't want to push that feature. And it's understandable. It's not completely irrational. But it's not possible with pickle to send interactively defined function. So let's look back at the workflow from a data scientist. You create a bunch of function and then you want to execute them in parallel and you want to iterate fast. So you don't want to leave a session and then rerun or something. You just want to have a bunch of code cells that you modify and run on the fly and then send it to workers. And so typically in this situation, the pickle will fail at sending this function. So it's just not possible using the serialization like the pickle module to actually run function in parallel when your data scientist working on a Jupyter notebook especially on Windows. Typically this example here when I create a function completely dynamically that's a lambda won't work. Sends me a pickling error. I don't know if you've ever seen that error guide like running your code. It's an error I see a lot of time but it's frustrating. And so in practice, many distributed computing libraries and in general, also scikit-learn, they don't rely on pickle internally. They rely on pickle extension that actually can appaliate this problem to make stuff faster. Not faster actually, just to make stuff faster ultimately but just basically to parallelize code. So in practice, Dask for example, that's the second logo or ray that's the first logo scikit-learn spark or prefect. All those libraries that don't rely on pickle internally, they rely on pickle extension such as for example, cloud pickle. Cloud pickle here in the last snippet example completely manages to serialize this function. But here's the problem with pickle extensions actually. The pickle module, so it's implemented actually in two different ways. It's implemented as a C extension like it's in the standard library but it's both in C and in pure Python. And the version in C is actually 30 times faster at serializing big objects than the pure Python one. So I mean, when you extend the pickle, you don't want to extend the Python version, right? You want to extend the C version because they all provide the same classes. So at the core, serializing objects in Python is like calling methods from the Pickler class and both the C module and the Python module have the same Pickler class but what you want to extend when you run pickle extension, you want to extend the C Pickler class. But sadly, that's not possible. At least that wasn't possible, spoiler alert. Like in Python 3.7, for example, here, you had, if you were using Cloud Pickle, to rely on the Python Pickler. So let's say I want to serialize a large list. I'm going to take 30 times more using Cloud Pickle than using Pickle for serializing the same dumb object. So it's very frustrating and actually this was causing a lot of slowdown in scikit-learn when sending, I don't know, for example, vocabulary and using TFIDF vectorizer, these kind of things. So this was definitely something we had to work on to make it better. And actually in Python 3.8, Python extensions can no extend the C optimized Pickle module. So I mean, this is really an achievement because it helps a lot when like sending large objects, not an empire array, but large lists or like dicks that typically are attributes of scikit-learn estimators. If we do exactly the same benchmark now in Python 3.8, we'll see that, I mean, the increase in speed is not significant. I think the T take away here is that the order of magnitude is the same and it's not one order of magnitude slower using a Cloud Pickle that's using Pickle. So this is definitely something that got improved in Python 3.8. One last feature, actually, that's even maybe more significant is the burst of Pickle protocol five. So Pickle has a bunch of protocols. Basically, each time you want to do like kind of a huge addition to the Pickle protocol, you have to create a new protocol. There were already like five protocols. And the last protocol addition was made by a Python core developer that's called Antoine. And that trick kind of realized, I mean, he wasn't only to realize, but a bunch of people realize that Pickle originally was kind of created for like on disk persistent share of Python objects. So let's say you have Python objects in memory, you want to start them into disk, but it's not how it's used right now. How it's used right now actually is as a way of communicating data between Python workers. So it's not storing objects on disk. It's communicating data. And because it was not thought of that way, the RAM usage was suboptimal. We were making like spurious memory copies. And I mean, it actually causes worker to be killed by the US when taking too much memory and this was causing tons of deadlock. And it was a pain for like developers of distributed computing libraries such as Dask and Joblib and all the others. So actually this problem was completely solved in Pickle Protocol 5 thanks to a new addition that's called the Pickle Buffer. It's a low-level object that you will never see, but basically using Pickle Buffer ensures us that no copy are done when derping or loading like large contiguous buffer that are hold by NumPy arrays and our tables. And so typically Panda's data frame or scikit-learn is metal that hold references to these objects will be serialized super fast. And also with like no overhead of memory or something like that. So it's really a big addition, but I think it goes like even one step further actually. It's not only like a main memory optimization or something. It allows delegation of like the kind of compute heavy part of serialization to third-party libraries. So I mean, you guys probably know what an empire is, but at its core an empire arrays is just a bunch of metadata like the shape of an empire array, the strides, the flags, and then a huge contiguous data buffer, right? And so what I want to do personally when I develop Pickle is like, I see this empire array coming and I want to serialize it. And I want, like when I see all the lights objects as like the flags, the strides, the shape, I want to give it to Pickle because Pickle is good as just reconstructing Python objects. But when I see this huge data buffer, I just want to take it and serialize it my own way. And then in the process like that recreates this object, I will take the two streams, the streams that from the pure Pickle and the streams that I created when I serialize optimally my object. And that's actually what Pickle protocol five also does. It allows what we say out of bounds serialization of big data buffer. And that's very useful for all like distributed computing libraries such as Dask, Arrow, and scikit-learn and all of those. So it's our kind of draw conclusion. I think what we could have said is that just parallel computing makes stuff faster in general in scikit-learn. That's from what we've seen in this graph. Which backend to use is a problem, like it's a problem specific question. So you actually have to like try out different of those and to see what works better for you depending on what bound is your code. And also working with upstream is worth the hassle. It makes the ecosystem safer. So please guys work with upstream. Thank you very much for the very nice talk. I've definitely loved a lot. We have time for a few short questions. I didn't understand why the CPython developers then don't care about the interactive users. I think it's a very, very, very difficult question. Actually a lot of people complain about the security aspects of Pickle. Pickle is kind of known to be like unsafe. And so typically when you define a function interactively you cannot serialize them as a reference to their module because the main module is not present of all of your workers. When you create a function in the main module of a process you basically generally when you serialize function like for example at another exponential function from the mass module. When you serialize this function the byte string will only contain instruction to get back the function from the mass module. So it's a very light way. There is no code serialized. But if you actually interactively create a function there is no way you can get it from the main module of the other process because it just won't be there. So you have to serialize actually the code. So I mean when you serialize interactively defined function you serialize code objects and you serialize like Python virtual machines instructions. And this can be seen somewhat as unsafe. So I mean there are like kind of fundamental problems to serializing interactively defined code. And actually in the documentation of Cloud Pickle we kind of strive like we're sorry we kind of stress the fact that you know Cloud Pickle is not super safe and that you shouldn't like allow like read a Cloud Pickle objects if it's not like verified and secure and everything. I'm afraid we don't have the time for any more questions. Please thanks. Thank you very much. Give me a round.