 Hey everyone, thank you for coming to this talk and thank you to EuroPath and for hosting me and for supporting virtual talks, this is very helpful. So we are going to talk about PyTorch 2.0 and why should you care about PyTorch 2.0. So before that, let's talk a little bit about me, just to clarify, I'm not a co-author of this framework. I'm a research engineer at Meta and my work is mostly focused on building light from learning agents and I'm happy to talk about that after the talk. But yeah, so what are we going to talk about? We'll talk about PyTorch 2.0 mostly, but this is what somewhat of the formal agenda kind of look like. Now, I'm not really going to talk about what PyTorch is. I assume people are familiar with that just to bring everyone on the same page. It's an open source machine learning framework and it provides numpy like arrays with GPU exploration. It's basically super useful for training deep neural networks and what differentiates it from other machine learning deep learning frameworks is the ease of use that it provides. So back in the day when PyTorch came out, the other frameworks that were available, they were usually compiled frameworks. It was very difficult to debug errors that you would come across when you are training your models and PyTorch provides what is called a deep remote, which basically means whatever network architecture you run that and then you get errors which are very easy to debug through. So much so that there's this famous quote from Andhra Karpati who's a very famous machine learning researcher. It's a pretty old tweet, but they basically say that, you know, I've been using PyTorch for a few months now and I've never felt better. My skin is clear. My eyesight has improved. So this is just a testimonial to PyTorch or rather PyTorch 1.0. Now point is if PyTorch 1.0 was so good and people are writing these nice testimonials about it, why do we care about PyTorch 2.0 or what led to the need of PyTorch 2.0? And I think it can be broken down into sort of three reasons. One is deep learning uses GPUs for accelerating training, right? And throughout the last couple of years, GPUs have been getting faster and faster and the default eager mode, which is what PyTorch 1.0 or 1.x family was supporting was just not even to keep up with the GPUs getting faster. So what was happening is your GPUs would get faster, but your machine learning code would not and you know, performance matters when you're training these systems. The second aspect was the overhead of the framework was becoming more and more evident. So PyTorch was written to a large extent in Python and Python is not as fast as we wanted to be. And so there was a framework overhead. The developers were working around this by pushing more and more of PyTorch, Python stuff into C++. So you had PyTorch, but most of it was written C++. And even that was not enough to take care of this overhead. And the third aspect was PyTorch code was becoming more and more of C++, which means that had to do more work. It was slowing down the depth velocity. So all these reasons prompted the developers to take a hard look and figure out how to design PyTorch 2.0. And so these are the broad goals of PyTorch 2.0. The goal is to make training 30% faster and reduce the memory usage while making sure there's minimal changes to the code or the workflow. And this minimal changes to code and workflow is important. You can always make the system faster, but then you can break the UX which made PyTorch very useful in the first place. So they wanted to do it while keeping the UX similar. They want to make it easier to write a PyTorch pattern. A PyTorch backend basically means that I write down the machine learning operation that I want to execute. And then I write a system which takes these operations and actually executes them. So this is what a backend refers to in this case. They wanted to add support for dynamic, they rather wanted to improve support for dynamic shapes and distributed capabilities or training models across multiple machines and those kinds of things. And the fourth aspect was they wanted more of PyTorch to be written in Python so that more people are able to contribute to it. And we'll see for this talk as to how does PyTorch 2.0 fulfill all these goals. So starting with the performance benefits, before actually I talk about performance benefits, I want to say PyTorch 2.0 is completely backward compatible with PyTorch 1.x series. In fact, the last table release in the 1.x series was PyTorch 1.13.1 and you could very well argue that PyTorch 2.0 should really just be PyTorch 1.14. So in that sense, PyTorch 2.0 is fully backward compatible and to use PyTorch 2.0, you can, if you have a PyTorch model, you can start using PyTorch 2.0 without changing any single line of code and everything would just work. And you get some of the performance benefits that you start getting are in terms of, for example, implementation of high performance transformers and transformers are the most standard machine learning architecture that is being used right now. So having faster transformer means that your machine learning workflows go faster. A bunch of other operations have also been accelerated and this is again all just out of box performance. You don't change anything. You just install PyTorch 2.0. That's it. So some of the other operations which have been accelerated are multirate attention, convolutional transpose operations and interpolation operations. So this is all good, but it turns out just these changes are not enough to get a 30% speedup that the team was targeting initially. And so they introduced something called as Torch.compile. And Torch.compile is where we'll be spending bulk of this talk because Torch.compile is a function which you would be using if you're using PyTorch models to accelerate your models. So what is Torch.compile? Torch.compile is basically a function which takes a callable and it returns a compiled callable. And the hope is that this compiled callable would be faster to execute and it would be requiring less of memory than the underlying function. And so Torch.compile can take a function. It can take a neural network. Anything that's a callable can be accepted here. Let's take an example. So I've written this full function which is usually just taking two tensors, x and y, and it's computing sine of x and cos of y, and it's just adding them together, so pretty straightforward not doing anything fancy. And so I have this function and then I compile this function by calling compile and by calling Torch.compile. And I get a compiled list of full. The first thing is the first thing to note is this compile underscore full is a drop in replacement for full. So wherever you are using full you can just start using compile underscore full and it would just work as it is. It's just a simple statement showing that it gives the exact same output for a particular X and Y, but in practice it's going to be the exact same thing. And this compiled full is going to be much faster in practice and we'll be using much less of memory than the original full function. And the good part is the full function can have branches. It can have if conditions. It doesn't just have to be sort of a linear flow of execution. So you can have branches and torsure compile would be able to handle that as well. So I think this was a big win compared to the previous at first that PyTorch was doing and some other frameworks are doing in terms of compiling the functions. And then you just don't have to compile your functions. You can also take a machine learning model or PyTorch module specifically and then you can compile that again by calling torsure compile. And the compile model that you get, that's a drop in replacement for the model that you have. And again, it works everywhere. Now the thing is the examples that I'm showing you are like, you know, hey, I ran it to clear it works and that's good. But you know, you are not going to be working on my machine. So the question is how do we know if this is widely applicable or how do we know if torsure compile works very widely and it's not working for these specific examples. So what the team did was they took 163 open source models from three popular machine learning libraries and benchmark. So the libraries that they used were hugging phase transformers. So those of you who work in actual language processing or use transformers would be familiar with hugging phase transformers. It's like the de facto standard library for people to get started. Then there's a library called Best Tim which is a state-of-the-art PyTorch vision model. So they took a bunch of 61 models from Tim library. And then there's a benchmark feed called as Torchbench which is a set of popular code bases from across GitHub. So they took 56 models from Torchbench. So all in all, they had these 163 open source models. And what the team did was they took these models and they just did a torsure compile of one line change on all of these to see whether torsure compile works out of the box releasing an art. And the results were quite impressive. First thing is in 93% of the cases torsure compile just work. There are no errors and the model is optimized. In the other 7% cases, there are tweaks and additional flags that need to be passed. But in the vast majority of cases, it just works out of the box. And this is what the speed ups look like. Now a couple of caveats here. These are somewhat old slides. Or this vertical result is somewhat old. So it's possible that the newer results are with newer additions to torsure compile. These results have become even better. What exactly are we looking at here? We are comparing the performance of the eager mode which is PyTorch 1.xc against torsure compile. So take a model which you wrote using PyTorch 1.xc and then you just compile it and then you are comparing the average gain, the gain in the speed of execution. So, and the last detail is that these experiments are these sort of benchmarks were done on an NVIDIA 800 GPU. I'll come to that in a minute. But basically what you see is roughly 38% speed up on an average on 10 models, 76% on Torch bench models, another 52% on hugging face models. And again, all of this just by doing a torsure compile. That's it, nothing else. Yeah, the detail about the NVIDIA 800 GPUs. So torsure compile, it's going to compile your model and then it's going to rely on the GPU to sort of accelerate them and we'll talk a bit about that. But that means it requires sort of more recent hardware to do those optimizations. If you have a very old GPU, then those optimizations will not be available. And so in both cases, the torsure compile would just say that, look, I can't optimize it. I'll still sort of compile it, but there's not going to be any real performance benefits. But even in those cases, torsure compile would not fail. It would just sort of throw away a warning saying, look, I can't optimize it. So, you know, you're just going to get the same performance as before. Yeah, okay. And just to sort of use another testimonial, this is from the primary maintainer of hugging face transformers where they just, where they said that just adding this one single line of code gives them a speed up between 1.5X to 2X. And this is pretty amazing. It's just one line of code. And that's it. That's all you need. Now, one caveat. Now, what I'm showing here is the time it takes for an eager compilation mode, which is PyTorch 1.x, 1.x, and compile mode, which is, you know, PyTorch 2.0, which is supported in PyTorch 2.0. So I took a standard, I think a dense net model from hugging a strong PyTorch model hub. And I ran it in the eager mode and the compile mode. The first thing that you will see is eager mode run time is 30 milliseconds, whereas compile is taking 35 seconds. It's not what we were promised just now, right? But remember, when you're compiling models, the first pass is going to be expensive because in the first pass, the model is actually being compiled. When you do torch or compile, it doesn't really do anything. But when you do the first pass, that's when it optimizes the model. So even when you do the first run and when you compare the first run, the numbers would look completely upside down. But then when you run this a couple of times, then you'll see the difference. So then you start seeing that when you do this eval, it takes about 0.29 to 0.3 seconds, whereas in the compile, it takes 0.2 to 0.19 seconds. And if you want to compare the median speed, median time, then you see, oops, sorry, then you see a speed of about 1.49%. So I'm showing this to say that if you run it for the first time, and you're like, hey, my compile model is taking a lot longer, it is expected for the very first run, but all the subsequent runs would be faster when you're using compile. So just a little caveat out of the way. Okay, so this strikes the first goal, which was to make any faster and lower the memory usage without change to code or to workflow. The only change that we have introduced so far is torch or compile. And really that's the only change that's going to be there. The next part, for the next part, we'll look at how torch or compile is working. I'll introduce a few other arguments that torch or compile accepts, but that's pretty much it. Okay, so what does behind the scenes look like in this case? This is how the entire process looks like when torch or compile is called. I'm going to break it down into pieces. It's not super important to understand how all of this is working, but this is useful to reason about the performance gains because in some cases, you might not actually see a substantial performance gain and then it's helpful to just sort of know that this is how the process is working and whether the results that you're seeing are matching the intuition or not. So the first part is where you write a function that you want to optimize. So we have got another full function. Just to quickly describe what this is doing, it takes a tensor x, it runs a convolutional 2D on that, then it applies a batch norm layer. Batch norm is a point-wise operation and then it applies a real non-linearity, which is basically a rectifier unit. So these are standard machine learning operations. If you do that, the first thing that happens is called as graph acquisition. So what PyTorch does, and we'll talk about these specific things, but PyTorch uses a component called as Torch Dynamo, which basically causes the trace. It looks at the execution of the model and it breaks down this function full into these operations and says, okay, so there's an x, there's a 1 2D, there's a batch norm and then there's a real. So this is the first step, which is called as a graph acquisition stage. In the second stage, the graph would be lower and all that it means is, sure, con 2D, but how exactly to execute con 2D? Or batch norm, but what exactly does batch norm mean? Or real, but what exactly does real mean? So a con 2D is implemented as, for example, you get a rate matrix to pass it to a con layer and then you apply a bias. A batch norm is implemented as, you know, subtract the mean and you divide by the standard deviation and the real is implemented as a max of the input and zero. So it's taking down this acquired graph into lower level operations. And the third stage is where the graph compilation happens where, so we've got the graph, you have lower rate, you have broken it down and then you're actually compiling the graph. And the end result of this is the compiled function that you get out. So there are four main pieces. We would be spending some time on Torch Dynamo. The other things will quickly be basically skimming over because they are sort of very back-endished and not going to change a lot about how you interact with the system. So yeah, so starting with Torch Dynamo, Torch Dynamo is basically, it's a Python-level JIT compiler. What it is doing is it's looking into, when you execute this graph, it looks into the frames and it extracts the sequence of PyTorch operators. And so just to go back, the sequence PyTorch operators is what you're seeing here. And this representation, this concrete batch norm, these are PyTorch operators. This is called as an FX graph. That the name doesn't really matter, but just to say what this is. Yeah, and then you compile it. So just to show you what this would look like, this is our standard full function. There's a sign and a clause. I probably gonna skip over this, but this is just some boilerplate code to show how the graph is being captured. This is what the graph first one into this looks like. So you've got an X and a Y, which are placeholders, so these are inputs. Then you're applying the sign operation, the clause operation, and then you're adding everything in the output. So the graph in this case looks pretty much as you would expect the graph to look like. And Torch Dynamo captures that graph for you. Also, don't worry about the code. There's a Jupyter notebook at the end, a Colab node will be gathered at the end, which has all this code, so you'll have access to it. And just to take another quick example, what if the code has got it's conditions? Then in this case, it breaks the graph into three graphs. It basically has one graph corresponding to the first statement. Then it has one graph corresponding to the case where B dot sum is less than zero and another corresponding to B dot sum not less than zero, so greater than equal to zero. So it breaks it down into multiple graphs in this case. Basically to show that it handles these cases correctly. Yeah, and it also provides a bunch of helper functions like Torch dot and for Dynamo to explain, which explains to you how it is breaking down the graphs and those kinds of things. So for example, in this one, it says, I'm producing one graph, there are no graph breaks, and there are three operations. Okay, sounds good, sounds reasonable. And then in this case, it's saying there are six operations, but there's one graph break, so there's an if condition. It just makes it easier to sort of reason through these things. Okay, then there's a third thing called as a full graph mode. A full graph mode can be enabled by just sort of passing full graph equal to two when you're doing Torch or compile by default, full graph is equal to false. So again, a full graph compiled function is also a drop in a placement. Everything would work as it is, but full graph, there's a spectrum basically. So on left hand side of the spectrum, we have got this eager mode, which was fire dodge 1.x. It's full pattern. You do not need any change in the code at all, but there's a lot of framework overhead. You do not get any fusion. You do not get any static analysis. On the other extreme, we have got a full graph, which is a restricted subset of pattern, and using full graph would sometimes mean you have to do significant amount of four changes, but full graph compilation is going to be the fastest. And in between, you have got the default mode, which is Torch or compile with full graph equal to false, which is a default. It supports pretty much all of Python. There's no code chain needed. And you do get some code fusion and some static analysis, but it's not as fast as, for example, full graph mode. So the trade-off is with full graph, you'll likely have to do some changes on your code, but with the partial graph, you don't have to do any changes. And eager is no change at all. Yeah, so there's some other examples, but I'm probably going to skim over them. They're just showing, if you try to compile a function like this, which has got branches and you try to use full graph. In those cases, it fails with this error that there's a condition and you need to get rid of that condition. Yeah, okay. So we talk about Dynamo. The next piece that we'll be talking about is Inductor. Inductor is basically a compiler backend. Torch Dynamo supports a bunch of backends, but this is one of those. And the way it works is it uses this library from OpenAI called a striton for generating code or GPU. So basically it uses striton to write on, to generate kernels that are executed on GPUs, sorry. So instead of having to write to our kernels yourself, striton does that for you. I'll just show an example of what does a kernel look like. But before that, so Torch.compile, you can pass an argument called its backend, which is Inductor. If you don't pass it, it's defaulted by Inductor. So the thing that I want us to notice is we have got three operations here. There's a cost, there's a sign, and there's an A plus B. This is the triton kernel that Inductor generates. And the thing to look for here is, this is one single kernel, this means this is one single operation that the GPU would be executing. I think the main thing that I want to show here is, if you look at the sign, the two sign operations are now fused together in the sense that they are on the same kernel now. So even though they are written separately, sorry, even though they are written separately, they are part of the same kernel. And being part of the same kernel means they're fused. It's going to require lesser overhead and the execution will be faster. So this is how Inductor makes a code go faster. This also means that now more of PyTorch can be written in Python because Inductor is using Python, which is generating Python looking like code. I think I have a slide on Python. Oh, probably not. Yeah, so implicitly it enables you to write more of PyTorch in Python. There are a bunch of other patterns which are also supported, wouldn't be going over them, just in close to time. Okay, the next piece is called a PrimTorch. So what's the idea behind PrimTorch? So in PyTorch 1.x learn, there were almost like 2000 plus operators. And so if someone wanted to implement a backend for PyTorch, they had to re-implement these 2000 operators. And that's a lot of work. Why do we have 2000 operators? Well, because we want the operations to be faster. So in some cases they would fuse the kernels together. They would by hand write a new kernel and that would become a new operator. PrimTorch is an effort where they are trying to reduce the number of the 2000 plus operators to something like 250. And then anyone who wants to re-implement a backend for PyTorch would just have to rewrite these 250 operators. It's somewhat of a work in progress, but Inductor makes it possible because now you can have these 250 base primitive operators, you can compose them and Inductor would make sure that it goes fast. And PrimTorch, yeah, PrimTorch basically makes it easier to write a PyTorch backend. The last piece is called as AOT autograd. Again, we wouldn't be going too much into it, but basically when you're using these libraries for machine learning, you want an autograd engine. So PyTorch 1.x already had an autograd engine and for 2.0, they decided to sort of re-perperse it. So what they introduced is called as AOT autograd or an ahead of time autograd. What it basically does is, when Torch or Dynamo captures your graph during the forward call, it also generates a corresponding backward segment. And so when you call backward for your model, backward is basically computation of gradient. So when you call backward, the compiled graph for the backward mode is already available and it's just executed and that's why it's sort of ahead of time. Okay, so again, AOT autograd makes it easier to support dynamic shapes, distributed capabilities and sort of taken all these things together. It also makes it easier to write PyTorch backend because you don't have to worry about how the backward calls are working. All these things that we looked at so far, the four pieces, they satisfy all the four conditions that we had for PyTorch 2.0. Just to wrap up the Torch or Compile piece. So this is what the full signature looks like. So we already talked about a bunch of things. We talked about what backend does, we talked about what full graph does. Let's talk a bit about what mode does. Mode, there are three different modes for now. Default mode is, default mode, we'll try to compile it, try to compile the model quickly without taking too much time or memory. Then there's a mode called as reduce overhead, which tries to reduce the framework overhead as much as possible, but it might increase the memory consumption. And then there's something called as max auto-tune, which takes a lot of time to compile, but it will try to give you the fastest code possible. So really it's a trade-off between how quickly you want the model to be compiled and in what sort of, what properties do you want the model to, the compile model to have. This is pretty much what's coming up in 2.x series. There are going to be more of these to torture compile. There's support for distributed tensiles coming in, tensor parallel, 2D parallel, these are different distributed training techniques. I wouldn't be covering them as part of this, but feel free to ask questions on this. And these are a bunch of resources to learn more about what I just talked about. Thank you so much. And the slides and the top, the notebook are available at this link. With this, I'll stop and this is a quick time to ask questions. Super, thank you very much. We have five minutes to questions. Please queue up or ask your questions in the Discord if you are remote. Okay, I'm gonna ask like short question. Like what do you think your personal best, like favorite feature would be in, not this release, but the next one? I'm really looking forward to distributed tensiles. So a lot of job, a lot of work that I do is basically training these models fast enough and the kind of bottlenecks for the overheads that we run into is the communication cost when you are talking to different GPUs across different machines. So the hope is distributed tensiles and tensor parallel effort, which are basically prioritizing all the distributed efforts would just make that go faster. So that's the piece that I'm most excited about in the upcoming releases. Started compile was also great, great help. Again, out of the box performance, 1.5x2x performance is what I've been consistently seeing across all my modules. I have a good question. Thanks for the great talk. In your opinion, how can you reduce this cold start of compilation? There are any techniques or that you are aware of? Cold start of compilers. What specific piece are we looking to? I mean the time that the compiler takes? Yeah, I mean, is there any facilities to pre-compile or something like that? Yeah, I think it's a trade-off between how much do you want the code to change versus how much of the UX you want to remain unchanged. So for example, if the compiler already knew the shape that it's, yeah, let's just go back when it comes and pretend it will make it more explicit. Yeah, so let's say we have this example. If the compiler already knows what the shape of X and Y could be, then it has a lot more information and it can cut down the search space where it's trying to optimize this. And on top of that, if it knew, for example, X is always going to be an N, then it can generate even more specialized kernel again by cutting down into those kinds of subspaces. Whereas in this case, it's like, we look at X and then we'll just, then we would determine what kind of kernel to write down, especially when you throw in branches. In that case, if when you compile it for the first time, it goes through a particular branch, then that branch is going to be passed subsequently. But if you go through the second branch, the next time, then this is sort of the cold start problem again. So maybe you can compile them both or you can break down your code so that it doesn't have branching or big conditions. Yeah, so I see this as a spectrum. How much do you want, how much work do you want the user to do to make the compilation process faster versus how much you want to have a nice UX even if that means you compile it a little bit too? Okay, we have time for maybe one question. On the Discord, it's a channel North Hall if you had anything right now or are we gonna finish? Okay, thank you very much. Let's give another round of applause. Thank you.