 All right, so Julius is going to serve for the delay to go. So I'm going to talk about Dask, a way to work with Python within more parallel and distributed ways. But first, small intro, for those who are not really aware of the Python ecosystem or scientific software and data tools. So it's a very rich ecosystem, one of the foundational libraries, NumPy for providing the numerical base to do numerical computing in Python. A lot of other libraries building on top of this core, if the plotting libraries, the typical scientific routines, symbolic computing, partners for more the data analysis. Now also, on top of this base, you have a lot of packages, a rich ecosystem. Small in more detail about pandas, because I'm going to mention that a bit more. Maybe just to get an idea, who knows pandas, quite some who uses pandas, quite some. Okay. So maybe to ask you to clarify, I'm a core developer of pandas, not of Dask. The thing that I'm presenting today, it's just something that I used a little bit and I find a very useful tool. Pandas provides for those who don't know it very high performance, easy to use data structures in Python. It's typically used for tabular data spreadsheet like data. So not if your images are large arrays, but for structured tabular data. You can read and write a lot of data files. You can easily do a lot of computations, group by operations and things like that. I'm going to detail into that. So there is a very rich ecosystem, data science ecosystem in Python. There are only some problems with that. Most of the libraries, I showed not all of them, but most of them are restricted to in-memory data to single thread, single course computing. For example, pandas, you can work with pandas, with large pandas data frames very easily, as long as your data frame fits into memory and best fits. You have quite some memory. So that's a problem. If you want to keep on using your familiar NumPy pandas data tools, Sci-Clearn data tools for example as well, but your dataset grows and you will get bigger data, or you want to parallelize your data flows. So that's where Dask comes in. So it's a flexible library for parallelism. What is it exactly? So it tries to do parallel computing, which lets you work on data sets are larger than from memory. So to solve what I mentioned in the previous slide, but it's a very simple library. It's written in pure Python, but how does it obtain this goal to get parallel computing this efficient distributed computing in Python, because it just uses the existing ecosystem. It doesn't do any computation itself. It will just build task graphs and based on blocked algorithms, and do schedule these graphs, and try to do that efficiently. Now I had a little demo to quickly show you some things that most of the things are also on my slide. So what is the ID? It provides some collections. For example, it provides Dask arrays, which it tries to mirror the NumPy array interface, but under the hood, it's just a collection of many NumPy arrays who don't have to be in memory. So the Dask array will coordinate those collection of NumPy arrays. If you, for example, do a very big array that doesn't fit in your memory, or you want to do that parallel this sum, it will read in all the different chunks, take a sum of each chunk, and then combine those different sums. That's the idea of the blocked algorithms, and Dask knows about NumPy arrays, and knows about how to make a task graph that can do such algorithms. But the actual computation still is just NumPy. We'll still do the actual computation. But in that way, for example, the Dask arrays are used also with the X-array project typically large climatic datasets. For example, the temperature of the full globe for several years. Typical array data that don't fit in your memory, but you can still do it on your own laptop. I can work on my own laptop with a very large dataset because it doesn't read in all the data at once, and it can also use a different course of my laptop to use them more efficiently. It provides many functions that are available in NumPy will provide Dask as well. So it's very familiar to use if you already know NumPy or know Pandas. The same for Dask data frame, the same concepts, but now instead of wrapping many arrays, it wraps many Pandas data frames. Again, in this way, you can work in parallel out-of-core with a very large data frame that doesn't fit in your memory or where, for example, for more the distributed case, where your actual files or your actual data is, for example, distributed over different nodes of your cluster. Again, small syntax example, so instead of reading with Pandas, you do the read CSV from Dask data frame. What Dask does is all those operations, reading a bunch of CSV files, doing a group by operation, for example, taking a mean of a certain column. This is all lazily done. So it builds a task graph to do those computations, and that's the difference. So this is the difference with when using Pandas. It's only when you explicitly say, now I want to compute my results, that the task graph that has been built up, will execute it in parallel or distributed on the cluster. So for example, if you take a simple sum, the graph will not look that complex. These are all your chunks, all your different arrays, it will read them in, it will take a sum, combine them, so typically reduce them to smaller values. But you can also get much more complex graphs. For example, in this case, this is an example of using a matrix multiplication. For example, with the time series in Pandas, you can typically do a resample to convert, for example, from hourly time series to, in this case, a weekly time series. So that is dusk, it doesn't do any computations itself, as I said, but it knows what the resample is. It knows how to do that and how to translate this in a task graph. It knows that, for example, if I do a resample that on the boundaries, I will have to pass on some data from the one chunk to the other to create the continuous time series. Another example, a rolling, so moving average or moving window, if your data are divided in different chunks, each time the last part of a certain chunk will have to be combined with the next chunk to have a continuous rolling operation. Of course, those collections work nicely if your problem, if the exact computation you want to do is an array or a data frame, that's not always the case. So it's also very flexible. You can create such task graphs yourself, can be an arbitrary task graph, and it also provides some functionality to do that more easily. For example, here, it's a very simple example. You do some for loops, you call a certain function on some data, you stored it in a dictionary, and then at the end, you have another function that combines those results into a final output. If you want to do this in parallel, we want to distribute it, then the only thing you need to do with Dask is import the delet function and wrap your function calls with this delet function. By using this delet, it won't execute directly when you run this code, but it will build up the corresponding task graph, and then you can let Dask compute it. So those collections and those Dask data frames, arrays, they create task graphs, and then we need to run those task graphs. So here you see an example. All the blue ones are done, and the red ones are either calculating or storing their results because there is a future task that depends on them. So the task graph is very central in the Dask library. As I already said, you have arrays, data frames, you have also bags, which is more where you can work with just Python lists or dictionaries, for example, if you have large JSON files to process, for example. The scheduler is in the part that will execute the task graph. There are a few schedules building into Dask. Mainly here is the which you can use just on a single laptop to work in different threads or to work in different processes, but you also have in an external library, the distributed scheduler. So that's the easy thing. You have different schedulers, but this part, the way how you interact with your code, how you write, you work with arrays or data frames is rather it's almost exactly the same. Whether you work on your own laptop, for example, with multiple process or you work on a cluster with a lot of nodes sped over a cluster. So that makes it very easy to change the scheduler value exactly run it, because you only have to the initial scheduler, you have to change it. So the single machine scheduler, which is part of Dask, you can choose between multiple threads or processes by default. For working with arrays, it will use different processes because an umpire and pandas can release the gil for many operations. This is a very small and stable library. The distributed one where you just have you built a task graph on your own machine, but the scheduler can be on a remote cluster with many workers. So there, that's the distributed scheduler, and it tries to optimize because the data will be sped over the different workers and tries to optimize. That's also the computations happen there, where you have the computation that happen on those workers where the data are. It also provides some building into these projects, some visual dashboards where you can see, for example, the different tasks that are executed over time, which is very easy to understand what is happening, but also to look into problems. For example, if you would see a lot of white space that means that there are no computations running so you can diagnose certain problems. There are also interactive graphs so you can see what exactly is happening at certain points in time. So the red ones are the data communication. That's also if you would see a lot of reds, it's also a sign that something may be wrong. So to summarize, task scheduler, it's very familiar if you already know NumPy and Pandas, you can very easily start using it. It's very flexible. You can create custom graphs as well, or use the built-in collections. It scales up and scales down, can with almost the same code, run it on your own computer, or run it on the cluster of thousands of course. It's also very useful interactive computing. Even when your computation are running on a large cluster, interact with it directly. How does it do it by building on the existing Python ecosystem? Just to finish, I want to thank because a large part of the slides was based on slides of Matthew Rockling and Jim Christ, core developers of Dask, who are funded by Continuum Analytics, the providers of the Anaconda Python distribution. So that's it. Thank you very much, Joris. Time for questions. Yeah. So the question was how does it compare to SpySpark? I don't know exactly in performance wise or things like that, but the main thing is, of course, Spark is a very established technology. This is a much newer library. For me, the main advantage is it uses the existing Python ecosystem, so it's also more in the C environment and not in the Java, GVM-based environment, and that makes it more familiar for the, yeah, you don't have to switch between the two, makes it more familiar if you're already using the existing Python ecosystem. So it tries to fill in the gap that exists in the Python ecosystem that you don't have to switch to other ecosystems. Great. Any more questions? Yes. How easy is it just to reason about the performance of the operations that you're doing in Dask if they're hard to parallelize? Yeah. The good-byes are rather, yeah, sorry, the question was how easy is to reason about the performance for how to parallelize problems. The visualizations that are building makes it a lot more easy to look at what is exactly happening. You can see which workers are where the bottlenecks. Certainly, if you work with the data frames or arrays, then many of the group-bys and algorithms under the hoods are already optimized. They try to optimize and it's a process going on. Also, the shuffles for sorting and joints are, they try to optimize it already. So many of those optimizations will be building if you use the actual data frames and arrays. But of course, if you do more custom make your own graphs, then I would say the visualizations and you can always also directly access certain workers to interact with it to see if there is something wrong with a specific worker. Sorry. When you're running the last task? Sorry. Yeah. Question. If you're running on the cluster is there a specific scheduler? So for now, for the cluster, it's only one scheduler that they are developing, the distributed scheduler, which just creates a lot of workers and it communicates with all those workers. So it's only on your own laptop that you have some different options. It's a new project, is it already often used in production or? So the question was that's a new project, but it's already used in production. I know that the task developers are for certain parts are working together with some companies in the US. I think Capital One is one of them that are using it and by the usage trying to optimize certain shuffles and joints and things like that. Also in the, but I don't know if they're already used in production. I think they are still developing within the company. Also in the X-Array package, it's in the Python ecosystem, it's a bit of a merger of pandas and non-pandas since you have multidimensional arrays but with labels there, which is used with climatic data a lot. They also heavily use DASC to parallelize their workforce. We can also see some pickups. Yeah, I think so. I'm not very well positioned to evaluate that. There are certainly some things, for example, to set it up on a cluster. It's less integrated as other things you will do some more manual things, but the tools to do that more easily are growing. So they're certainly less well integrated as for example Spark, which you can easily set up and do that. Time for one more question. I have one. Do you have any idea of the biggest scale that DASC has been used on in terms of number of cores or workers? I know for the material working, set it was used on a bit more than between thousand and two thousand cores and it should in principle. He also said it on the blog post recently, he would very much like that people try it out with larger things because it's very difficult as a developer to try it on such large scale. But in principle, it should scale higher. The only limitation will be the scheduler, the time to the communication time of the core scheduler. Thank you very much, Joris.