 Okay, hi everybody. So I want to talk about my kind of personal opinions about the GPGPU developer experience. I feel like we don't talk about developer experience enough. When we talk about GPGPU, we tend to focus more on performance issues and distributed computing and stuff like that. I know a lot of the audience here is from an academic background and so folks who focus on GPGPU in academia may not have fully realized how incredibly popular GPGPU has become in the last few years. To give you a sense, this is the downloads for a CUDA toolkit from just one source, which is from the Anaconda Python repository. And as you can see, 11.3 has 1.1 million downloads, 11.4, 1.1 million downloads, 11.1 million downloads. We've got to a point now where literally over a million people are downloading CUDA. So what are all these people doing? They are not writing CUDA kernels. If you look at the Kaggle developer survey, actually most developers are now, better scientists are now using things like TensorFlow and PyTorch and Lightning and FastAI. And so GPGPU is being used extremely extensively around the world now through these higher-level libraries and nearly always via Python. But the thing is that these libraries, like PyTorch, behind the scenes, they are calling compiled C libraries such as for deep learning, QDNN or the PyTorch C++ library or the C and C++ mixed library. So although the Python developer is working in Python, there's a point at which they can't easily dig any deeper because it's jumping into compiled code. And in the case of things like QDNN, it's not even open source code. So what's the issue? Well, the issue is that for Python programmers, there's things that they either can't do it all or can't do conveniently. So because it ends up being turned into these really very big C libraries or pre-compiled libraries, edge deployment can be very difficult. For example, when you install PyTorch, you're actually installing over a gigabyte. It's an over a gigabyte download. And trying to turn your Python code into something that you can then put onto a mobile phone or Raspberry Pi or whatever is incredibly challenging. From a developer experience point of view, it's actually very difficult to debug your work because Python programmers are used to using the Python debugger, but most of the real works that's being done in your code is not happening in Python. It's happening in these lower level libraries. So trying to understand what's really going on is extremely challenging. Same problem for profiling. So obviously we all want our code to run fast. And that's challenging to do when you can't easily just use your Python profile to jump in and see what's going on, where the holdups, how do I make it faster? A lot of people think that it's not important when I speak to people. They say it's not important that Python programmers can kind of dig into the underlying kernels and understand them and debug them and customize them because Python programmers are happy working at these higher levels. But actually, this is a big challenge because realistically, whether you're doing research or production and industry, at some point you want to dive in and change things. And in my experience, most of the time there's something I would like to try and change that's buried down inside one of these pre-compiled libraries. Also as an educator, it's very hard for me to teach people what's going on because I can't show them the actual code that's really running behind the scenes. And so for understanding the implementation details, whether it's for educational reason or because you want to understand how the algorithm works to think about how you can improve it, this is either impossible or extremely difficult. And this kind of hackability is critical for the developer experience, in my opinion. So there's various hacks to try and handle these deficiencies. So for example, PyTorch now has a specialized profiler just for profiling PyTorch. NVIDIA has a specialized profiler as well. These are really neat tools and it's really cool that they're being provided for free. But the fact is that it's still not a great developer experience to have to learn a whole new tool which works in a different way and that's not actually giving you a consistent view of all of your code. So for edge deployment or even sometimes a web hosting, there are hacks like in particular tracing and a just-in-time compiler that are provided by both TensorFlow and PyTorch. So the idea is that you use the JIT or the tracing mechanism to basically turn your Python code into basically some code in a different form. In particular, it's likely to be ONNX, which is kind of an open standard for sharing these kind of models. The problem is that Python is a really rich and dynamic language and so in either of these cases, they're not capable of handling all of the things that Python can do. So for example, in the case of the PyTorch just-in-time compiler, there's all kinds of things where it's just going to give you an error and say, I'm sorry, I don't know how to do that. More frustrating for me, I find, is that very often it does something slightly different to how Python works and it's then very difficult to know why did it work in Python and it didn't work when I compiled it to ONNX. Another very interesting technology is XLA which comes out of Google and is now available as a backend for both TensorFlow and PyTorch. So this is an accelerated linear algebra compiler. It's a similar kind of idea to the PyTorch JIT, but it's something which is specifically designed around creating a really accelerated, fast version of your code. And so nowadays it's used, for example, when PyTorch wants to talk to a TPU, it will go through the XLA compiler because that's the best way to create TPU code at this stage through XLA. So these are all nice to have, but they have a lot of shortcomings. It's not nearly as convenient and not nearly as good a developer experience as using just Python and using the Python tools that Python programmers are familiar with. Another very interesting new approach is Jax. Jax is another Google project and it's also a Python library, but it's actually specifically designed to bring Python over to XLA. So it's written from the ground up for XLA. And what's particularly interesting about Jax is that you can kind of write your own kernels. So you're not as limited as you are with tracing and JIT approaches. You're still limited to doing just the stuff that you're underlying seed or cruder or whatever library has written for you. Or else with Jax, you can do a lot more stuff. There's a lot more flexibility. And so this is very interesting approach, but we still have the problem that the code that's running on the accelerator is not the code you wrote. It's a transformation of that code through XLA. And so again, profiling it and debugging it and understanding really what's going on is difficult. Also, in order to provide these composable transformations, Jax has a very interesting but in some ways a very limited programming model. It's highly functional and immutable. And so Jax ends up with this kind of complexity from this functional programming model. State management becomes difficult. Things like random number generation becomes particularly challenging. And obviously in my world of machine learning and deep learning, random numbers are very important as they are in many other GP GPU areas. So I feel like these are all like amazing technologies. So much impressive work going on, but it doesn't feel like, you know, the really long term solutions. I don't see how any of these things quite end up giving us the developer experience would like to be able to offer. Another very interesting technology I wanted to mention is TVM. So TVM is an Apache project nowadays. And you can use TVM directly from Python and you basically end up creating these compute expressions in this case using a Lambda. And if you're familiar with something like Halide, similar kind of idea, you can basically create a schedule which will figure out how to... Well, you can show various ways that you think it might be best run on an accelerator. And in this case, you're actually binding axes to blocks and threads on the accelerator. So this is a super convenient way to write kernels. And more importantly, perhaps it also has things like auto schedulers. So this is how you can create things that run as fast as QtNN or, you know, specialized linear algebra libraries from NVIDIA or whatever, without having to write all that, you know, unrolled loops and memory management and whatnot. But as you can see in the end, it's still not anywhere near as convenient as writing normal Python. And the thing you end up with is, you know, this kind of compiled code that again has all the kind of developer experience issues I described before. Perhaps the most interesting path for the future for me right now is Julia. Julia is a fairly new language. But what's really interesting from a GPGPU standpoint is it handles nearly all of the developer experience problems I described. Nearly none of them exist in Julia. And the key thing is that in Julia, you can write kernels that look a lot like you would write in CUDA but with less boilerplate. And you can do in paralyzed operations. You can handle memory. That can all be done in Julia. And so I think this is a really underappreciated important idea, which is that developers should be able to use the same language and the same tools throughout the hierarchy of abstractions in their program. Again, speaking as an educator, this is incredibly important for teaching people what's going on. It's really important for a researcher because you can hack in at any level. It's really important in industry because you can ensure that you can jump in and make sure the performance is working properly for you at every level. And it also kind of, it opens up the research world in such a way that things aren't off the table. You know, I find that the things that get worked on in deep learning research are the things that are kind of conveniently accessible through libraries and a lot of stuff that isn't has just not really been touched because it requires people to go in and write their own CUDA kernels. And very, very, very few people have the patience to do that, at least in the deep learning world. So, yeah, really, I guess this is a bit of a plea for the GPGPU community to consider, you know, building the next generation of languages and tools, which allows developers to really do everything that they might want to do in a convenient way. For Julia, I feel like there's a lot of gaps in the developer experience there more generally, which I think the community is very familiar with around deployment and the amount of memory use that it requires and the amount of latency it requires to start up and so forth. But I do think at least with Julia, it feels like something that there's a path there that could eventually lead to a really beautiful developer experience. And that's not a path that I see available in really any of the Python frameworks that I see right now. And I would love to see things like TVM being taken, you know, more integrated with those ideas into languages and tools. So, yeah, that's that's the end of my thoughts on that. Thanks very much.