 Thanks. Thanks, everyone, for being here. Thanks for the introduction, Dimitri. Well, as he told you already, I'm Peter Enchev, I'm a software engineer at NVIDIA, and today I'm going to be talking about GPU computing with Desk Kupai Rapids. So the outline of this presentation is basically the one that you're seeing. I'll be talking about interoperability and flexibility of the Pi data and Python ecosystem in general, acceleration or scaling up with GPUs and distribution or scaling out with multi-nodes. So the talk will be mostly intertwined, so there's no clear boundaries. I'll be talking about these different topics altogether. I start introducing what we aim to achieve with Rapids, with Desk and Kupai in this context. Let's take a look at this simple example here, and this is a very, very simple example of a typical data science pipeline. So you would start loading some data or creating a data set. In this case, we're using make moons, and then we would create, for example, a data frame. Not necessarily in this case, but just for example purposes, it's there. And then we would do some clustering with scikit-learn, for example. But what if we want to accelerate this? And we also don't want to reinvent the wheel, so we want to keep things as simple as possible for the users. So what we can do now is simply change the imports. So instead of importing pandas, instead of importing pandas and scikit-learn, we would import QDF and QML, which are part of the Rapids ecosystem, and they provide the same API as pandas and scikit-learn provide for you, but the only difference is that they run on a GPU. So what is Rapids? This part may sound a bit like a sales pitch because I borrowed not to say shamelessly that I copied from other presentations from Rapids, but Rapids is an open-source suit for end-to-end data science pipelines. It's built on top of CUDA to leverage ultimate performance of GPUs. It's a unifying framework for GPU data science, and it provides pandas like scikit-learn like API, so nobody has to learn anything new except that you just change your imports, as I showed in the previous example. How does a regular data science pipeline look like? So it looks something like you see here. We begin with some data preparation, say with pandas, then we do some model training, say with scikit-learn, and then we do some visualization, check what we're getting, and then we iterate over it. This is the basic part that Rapids is tackling, and we have several libraries that compose this ecosystem such as QDF and QML that I mentioned before. We have also Qgraph analytics, there's QX filter, and Kepler GL for visualization, and they all interconnect via Apache Arrow, via the Apache Arrow standard on GPU, and they can also interconnect with other deep learning frameworks, for example, such as PyTorch, Chainr, and MaxNet, all through this Apache Arrow memory layout. So the lesson that we learn from Apache Arrow is that we don't want to always do this expensive copy and conversion of data. We want a unified memory layout that we can get rid of all this overhead that is basically slowing down the entire pipeline. So we use Apache Arrow memory, and we can provide zero copy memory, not transfers, but we can provide zero copy memory interoperability between the different frameworks for this purpose. This is just a fancier visualization of the pipeline that I mentioned before. We start with some data, we do some data preparation, machine learning model training, data exploration, and then we get some predictions, and probably we deploy this result later. So as I mentioned, I work at NVIDIA, but RAPIDS is not just an NVIDIA effort, it's a whole community effort, and I can cite here some very important contributors, like the scikit learn people and URIA people that are also here on the talk, and they've been very helpful with us, Anaconda, Quensite, also other ecosystem partners such as Walmart, they've been all very helpful in both development and providing use cases that we can build upon. So I'll focus mostly on the machine learning part for this talk because the time is very limited, so I cannot focus on all data frames and QGraphs, but this is how the machine learning technology stack looks like in RAPIDS, and we have CUDA at the bottom, and we use already distributed with the CUDA toolkit, so libraries such as QBLAST, Qsolver, QSPARS to speed up the computation, then we built some QML primitives on top of them, and we finally have some QML algorithms written in C++, CUDA, and we expose this through Cyton to Python, and then we get this nice scikit learn-like API. The model is something like this, so we have two ways of parallelization. The first one is model parallelism, so this means that we are actively writing this code in CUDA and C++, so we have people writing this code for CUDAF, for QML, et cetera, and this is the part where the model parallelizes the data, so it attempts to use GPUs to the best of its capabilities, but we can also do data parallelism mainly for distribution, so we use Dask for that with chunk arrays, and we distribute over various nodes in a cluster, for example. One of the interesting algorithms that are available in QML, this is just as an example, is UMAP, so you probably have seen before in the keynote, sorry, the UMAP algorithm being used for words, for clustering of words, and UMAP is basically a faster visualization or algorithm targeted at visualization of clustering, faster than TS&E, but it can also be used as a regular, just dimensionality reduction algorithm, and this is what UMAP looks like for the MNIST fashion data set, so we see on the right it runs on a CPU, and on the left on a GPU, we see that the clusters are very well-defined in both cases, but on the CPU it takes about 100 seconds to run, while on the GPU it takes 10 and a half seconds, so just by switching to QML, you get a 10-time speedup. Dask, I'm sure a lot of you are familiar with Dask already, but just for the sake of completeness, Dask is a distributed compute scheduler that can scale from laptops to supercomputers, and it's a great candidate then to leverage distributed systems for RAPIDS, because it's already well-known, a lot of people already use them, and we can just connect the CUDA back end to Dask to leverage even more performance now, and because it's extremely modular, it's a great candidate for RAPIDS, and since we can have multiple workers in a single node with Dask, we can also have one worker per GPU model, so this makes things much simpler to develop and to debug, but how does Dask really operate or really looks like, so normally you would have a NumPy array, for example, so if you see there we have a NumPy array, and you could execute some compute on that array and get some results, but maybe NumPy will be slow because many algorithms are single-threaded, or you cannot distribute them, so what you do is basically you create a Dask array that is a cluster or a big block with many smaller blocks composed of NumPy arrays, but what if we want to use Dask on a GPU? We can do that, we use CUPy for that, so if you're familiar with NumPy, you're automatically familiar with CUPy as well, because it has the same API, implements the same API, and what we have to do now is basically say, okay, my Dask array now uses CUPy as a back end, so all these blocks will be blocks on a GPU and you can distribute these blocks later on with the same Dask scheduler that you would already have for NumPy, and this is part of the interoperability effort that I mentioned before, so before there wasn't a lot of interoperability capabilities in the Python ecosystem, so of course you can always copy data around, but this hurts performance badly and we want to address that, and also you could not, for example, use a CUPy array in Dask, because it was simply not written for that purpose, and to address this sort of issue, NumPy introduces several protocols, and in particular here we are interested in the NumPy enhancement proposal 18, which implements the array function, and this is a function dispatch mechanism that allows you to use NumPy as simply as a high-level API, so we would call, for example, NumPy and Sum on an array, and depending on the type of the array it will dispatch the work to the library that actually implements it, so for example, CUPy or Dask, and what these libraries need to implement is only this array function, which can be easily implemented with like 20 lines of code for, obviously, for libraries that are NumPy-like, so they operate on arrays, and here we have a simple example of computing SVD with Dask, so this is more or less how we would do before, and very similarly how you would do it now, so you import Dask, Dask array, NumPy, and you create a NumPy array of randoms, for example, then you can chunk it with Dask, Dask from array, you can obviously create this array also directly with Dask, but for example purposes I'm doing it like this to be more clear, so we are creating a NumPy array, converting it to various blocks in a Dask array, and finally we call NumPyLinAugSVD on DX, which is a Dask array, so this is something that it wasn't possible before array function, you would have to call Dask array, so it means that everyone who wants to support Dask needs to know about the existence of Dask, and now it's not the case anymore, so you just need to know about the NumPy API, and for this example here it took one minute 21 seconds for this array to compute, but now if we want to do it on the, on Coupy, we do almost the same, we have to change two lines, which include an import to Coupy, and say that the array that we're creating is a Coupy array, everything else remains the same, so Dask array is now an array of several Coupy blocks, and we use NumPy.LinAug.SVD on the the same Dask array to compute on a GPU, for example, and this is, takes 41 seconds, so it's roughly half the time it took before, and this is on a single GPU, by the way. But as all good things in life, there are limitations to the protocol. One of the limitations are universal functions, fortunately, these are already addressed by the arrayUfung protocol, but NumPy array and NumPy as array will require their own protocol if we want to pursue that path, and because these are two functions that are meant to coerce an array to NumPy itself, so if you coerce an array to Coupy, it might break compatibility with various libraries that already rely on, as array, for example, to give you a NumPy array specifically, and nothing else, not a NumPy library, and dispatch methods, dispatch for methods of any kind, such as random state, so in this case, we cannot identify what kind of array we're operating on, because we're not passing an array to this function. We can pass a seed, but it has no reference to base off, and this reference is always the array in the case of array function, so if we don't have an array, we just cannot do anything. There are alternatives to the array function protocol. One is U array, that is an effort of one site, and it intends to address the shortcomings of the NEP 18 that I mentioned earlier, and it's a generic multiple dispatch mechanism, so this is, it looks a bit different than it does with the NumPy array, array function, so in this situation here, what we would do is we would, instead of using a NumPy array, we would set the back end, so we're not explicitly saying, now I'm creating a QPy array or a desk array or whatever type of array. At the beginning, we say, this block of code, we'll use QPy arrays, and with that we can, everything can remain the same, so NumPy ones will create whatever back end you're using, there are arrays, and all the operations will be dispatched also to this library, in this case, QPy, for example. So this is what it should look like, and it does look like this, actually, so this is a perfectly fine code to use. So in this short example, what I wanted is just to create a small array and compute some on it and check that the types really match. So I begin by creating this ones array, then I print the sum, so four seems to match, and the type of A and the type of sum of A are both QPy arrays, so this is exactly what we expected. We can do also multiple library dispatch, so multiple back end, so say Dask and QPy internally. To do this, it's also very simple. We need another import here to say that we are working with Dask back end, and we also have to set that we are using multiple back end, so QPy internally and Dask in a more high level, and since Dask is a lazily evaluated library, we need to add dot compute, but this is already something that breaks the NumPy API, but it's already known and accepted for this kind of application. So in this case, we also check that the sum of the array matches four, okay, that the type of A is a Dask array, it is, and the type of NumPy sum of the array is a QPy array, which is not. It is not. So it's a NumPy scalar, and the reason for that is because Dask needs to add explicitly support for U array. So what we would expect in this example is the red part instead of this one line before the last, where we see NumPy float 64, so we would expect that the result is actually a QPy array. On the GPU side, we also have other protocols, such as CUDA array interface. This is again to address the problem of data copying and conversion. So CUDA array interface basically provides you a pointer to a GPU memory, and we can pass this pointer to various libraries, so NumPy, QPy, PyTorch, for example, they all implement this array interface, and we can just pass this instead of copying any data around. Besides CUDA array interface, there's also DLPack, which is specifically for deep learning, but you can also use that with Rapids, so you can do zero copy and pass data around in case your part of your pipeline is doing some deep learning. Rapids doesn't intend to address deep learning at all, so it's data science, so conventional or classic machine learning and not deep learning, but we want to have the capability to operate with everyone in the Python ecosystem. There are also some challenges to these, and one of them is communication again. So if we are copying data, that's a problem, but in some cases, so if we have multiple nodes, we have to copy data around. There's no way to copy data around that. And desk, by default, use TCP sockets, which are slow, so one of the alternatives to this is using, for example, UCX, which is uniform access to transport, so you can use TCP sockets, you can use InfiniBand, shared memory, or NVLink, which is an NVIDIA proprietary interface to connect GPUs on a faster than PCI express rate, of course, but this is a C++ library, so it's very targeted to hardware, so we need Python bindings for that, and there are Python bindings in the work, there's some work already done, but it's not complete yet, and what it will allow is desk to communicate efficiently depending on the hardware that you have available on your node or on your cluster. So here we have already some preliminary benchmarks, let's say, or performance analysis. On top we have before, and on bottom we have after, so if you are familiar with desk, the red part is memory copying. So it's the time that desk is actually waiting for some memory transfer to be done. And on the bottom one, we see that it's taking over four seconds, after the 20-second mark, it's spending basically four seconds there doing basically copy, nothing else, whereas when we add UCX into play, that same block there becomes what, like, half a second, maybe one second, so it gives us a four, eight seconds time, speed up easily, and, of course, this will depend on the hardware that you have available. These are some benchmarks for KU-Py, so these are all KU-Py on a single Tesla V100, on an NVIDIA Tesla V100 versus NumPy, however they implement it internally, so maybe multi-threaded, maybe not, depending on the operation. So we see that there are different gains depending on the nature of your operation, and these gains can range from, like, 270 times for element-wise computation, because they are very bound by computation and not at all by data communication. But it goes down to, like, SVD, which is more bound to communication, so we get, like, 17 times for SVD, which is still very, very decent speed up. So if you want more details later, there is this blog post that I wrote about these, so it has all the details, how to reproduce this test also if you're interested. And I also have some single GPU QML versus scikit-learn benchmarks. As expected, we are also faster than scikit-learn, because we are using mostly GPUs, which are great for this kind of linear algebra applications, and we can get, like, up to 120 times of speed up, for example, for PCA. And this is also a Tesla V100 versus 80 core node, an 80 core node. Here is also some distributed benchmark. This is the more interesting one, in my opinion. So we have four lines here. The top line is the time that takes to solve an SVD of, in the case written there, 611 seconds for 10 million rows times 1,000 columns on a CPU with 80 threads. On a single GPU, it takes a bit over half that time. And when we expand this to multiple GPUs, so, say, eight GPUs on a DGX1 machine, which is an NVIDIA supercomputer with, in this case, eight GPUs. And it takes 51 seconds. And if we add a second node there, communicating over, over, in FiniBand, for example, it takes 33 seconds. So we have a very good scalability here, not perfect. And I didn't plot more for a single GPU, because we ran out of memory. And this is one of the problems that we address with multiple GPUs, for example. And for, for Numpy on a single, on, on GPU, it would take too long. So I gave up on that. And if we scale up to 20 million rows, so we double the size of the problem, it, it takes about 107 seconds, or for a single node and 60 seconds for the dual node. So that is about 80% scalability. It's not too bad. And to wrap up, Rapids is used to scale up. So if you have like your Pi data ecosystem, doing a scikit-learn, doing pandas, doing Numpy, you can scale that up with Rapids. So leverage performance with GPUs. And if we add Dask into the account, we can also scale out. So you can already scale out CPU processing, but we also want to scale out GPU processing. And this is how we use Dask in that purpose. So this is the roadmap back from, from the beginning of Rapids, when it was first released in October 2018, version 01. So we had almost no algorithms whatsoever. But we are committed to increase the amount of algorithms that are available. So this is the current state in June 20, from June 2019, when Rapids 08 was released. And this is where we want to be, by the end of the year, Rapids 012, somewhere targeting Rapids 1.0. And we are also focused on robust functionality, on deployment, and user experience. So you can use Rapids on many cloud platforms already. If you have also a GPU at home, you can also use that for data science. It has to be at least a Pascal GPU if I'm not mistaken. So this is a GTX 10 or an RTX 20 family, for example. And everything is open source. So you can get everything on GitHub. You can install via Anaconda. You can install via NVIDIA GPU cloud or via Docker and deploy anywhere where you have some GPUs. I have here some additional reading material if anybody is interested. So probably this will be available later. This presentation will be available later. These are posts from different people about the protocols that I mentioned before, and also about performances on Python performance and GPUs from Matt Rockley, who's the BDFL of Dask. So he's also at NVIDIA now, and he is being a great asset towards this distributed data science world for Rapids in particular. And that's it. Thank you very much. Thank you, Peter. Very interesting talk. We have time for one question. Thank you. So I wonder about how good is the compatibility between NumPy and Kupy? Because I expect that most of the functionality is there, but probably there are corner cases. And I use a lot of for instance, structured arrays and NumPy. So I want to know if I could expect problems or not. Thanks. Of course. It doesn't implement every single functionality that NumPy implements, but it has a pretty large API. And you can also find the compatibility list on the documentation. So of course, I don't know everything from top of my head, but there is a very interesting compatibility list. What is implemented in Kupy? What is not? So I think that's the best way to figure out if you can use that for your application. But it implements a lot of the NumPy API. So Kupy predates Rapids, just in case you don't know that. So it's not developed by NVIDIA, it's developed by Preferred Networks in Japan. And it's a very stable library. So it's been around for, I think, 10 years or maybe even more than that. So it has already a good compatibility with the NumPy API. Thank you.