 My brief intro fun PyTorch story is like less than a year ago and I decided I wanted to do machine learning. The first thing I did was go to the open PyTorch issues with like very little like ML background. So to me is like an outsider from ML. PyTorch was like synonymous with machine learning. So cool, cool to be here. Really fun. Anyway, my name is Will. I work at Hugging Face. I work on diffusers and this this talk is going to be about PyTorch 2.0 and diffusers. So background on diffusers because not everyone knows about it. Transformers is like our big library. Diffusers is like the little brother library. Fun fact is we have like just as many interactions on GitHub but less people working on it because we have less users. So fun and stressful but like really cool. The main discrepancy with models you'll find in transformers is that inference is an incremental denoising process where we're predicting the whole target sequence which is frequently images at a time. Comparing that to the incremental inference procedures you see in transformers is generally autoregressive. Standard caveat there is that of course you can use transformer models to predict or non non-diffusion models predict whole target sequences at a time. Mainly it's just a separate library for like branding in my opinion but it works. The mathematical way to say that is that we model the joint distribution of a set of latent variables and a target variable if anyone's interested in that stuff. Cool the backstory behind this talk is we did a blog post with the diffusers on PyTorch 2.0 with the diffusers. We being the other members of my team I didn't actually write it. This is a great blog post it's like very it's like lots of multi-dimensional measurements about like different GPUs different things you can enable in PyTorch 2.0 if you're doing like production applications and kind of want like a feeling for what you should expect in terms of inference speeds and whatnot definitely like a really good place to start. This talk's more kind of like a micro view because like if you want to read the blog go read the blog it's really good. Cool PyTorch 2.0 what do we care about in diffusers one is a torch.compile and the other one is fast transformers really it's fast attention blocks I just needed a cool picture to use. The TLDR of this talk is this is kind of what like your inference code looks like if you're using diffusers as an API for inference that you only need a four line diff if you want to enable the torch 2.0 features really it's a two line diff because we'll do the attention processor bit for you if we detect that that's what you have installed. The attention processor thing just the TLDR is like we have a few different like implementations of attention blocks that you might need to use in diffusers and that's just like the API we exposed for it examples of that might be like if you're doing pre-training or if you're doing training with like Lora blocks or something that's like a technical different attention processor you still have to opt into torch.compile obviously we don't do that for you if you want to. This talk is only going to be on this line I wanted to do some stuff on torch.compile but I don't have enough time to like look into it that much we're just going to look at what the new attention mechanisms are in torch 2.0. So okay so this is a starting point this is called a UNET this is the backbone of diffusion models it's a standard my notes here did not do the bullet points so it's going to be hard to read. This is the common backbone of diffusion networks it's a bottleneck network you usually see things like this in auto encoders it's a set of down sampling convolutions followed by like a mirrored set of up sampling convolutions and you have residual connections from the down sample side to the up sample side. The other additional thing that's not shown here and is generally not shown on most of the UNET images online is there's also your transformer decoder blocks kind of intermixed in here all those transformer decoder blocks are the things that have the cross attention on text embedding so when you use things like stable diffusion you have like a little text prompt it's like give me an astronaut on the moon that's how these models incorporate them they do cross attention on the intermediate feature maps of the images and then frozen text embeddings the encoder is of course obviously always pre-trained we don't really use custom ones for any of these models there's no inherent reason you have to use these you can also use transformers in fact we have transformer diffusion models in diffusers as well image modeling is just sequence to sequence so this is just a decent fit for it cool so UNETs step one step two where are UNETs expensive these are a set of rough benchmarks I took using nvidia system profiler on our UNET caveat is after I did this I forgot to do it cross batch sizes so let's pretend a batch size of one is representative I'm not I hope the font's big enough here but if you can't see it basically the single kernel that takes the most time here is the batched attention kernel it's not taking more than 50% of the time it's like 20 to 40% but of the kernels it's the single most expensive one awesome caveat is sometimes it's hard to figure out what what actual kernels mean ran from these things because sometimes especially when you compile kernels you get these really funky names for them okay attention blocks they occur in UNETs they're also expensive step three what are attention blocks this is the canonical attention equation obviously we've kind of like evangelized it but it's not that complicated let's think about this from the respective of just self-attention all we're doing is we're taking an input sequence and then we're performing an inner product of it on itself and then we're using that to compute kind of a weighted average of the sequence itself again the only thing that's really important here is that the expensive part is the memory costs of that inner product of the thing with itself and then writing that thing back out to memory and what I really like about the algorithm here it's from one of the one of the later slide it's from the paper in one of the later slides it's just it shows explicitly like where things are written to memory so that s matrix and that p matrix those are both n squared writes to writes to memory where n is the size of the input sequence the tldr there is those are expensive memory writes and in the naive implementation those can all be launched as separate those might all be launched as separate kernels or if we're doing like back prop we also have to save those intermediate values for the reverse pass okay so now we know why approximately tension is expensive this is kind of just a dump of all the attention attention attention blocks in the stable diffusion 1.5 unit i think i forgot to mention earlier this is all stable diffusion 1.5 along with approximate flop counts and so basically like as we move down the unit that we said before the channels of the actual input image increase but the resolution of it decreases the resolution is the thing that determines the sequence length of those attention blocks and then so the naive memory costs of it are o of the sequence like squared and so that sequence length is a thing that's decreasing which means the most expensive attention blocks of your unit are going to be the ones at the beginning when you're the longest sequence length and at the very end of it on the other side which is when you're back to the original sequence length cool the tldr of then how to make it faster is like you can just have like really smart kernel implementations that avoid materializing those o of n squared matrices and pi torch 2.0 is the thing that makes these natively available for you so pi torch has through the scale dot product attention api has three different kernel implementations flash attention memory efficient attention and then a pi torch c plus plus one which is usually referred to as math in the docs just because like if you're doing um due to like uh just if you're doing like math things we want to make sure your outputs are consistent that's the thing that will give you the most consistent of or might be always consistent floating point arithmetic okay uh we know what pi torch gives us now we're going to talk about like why it's better um side note is i 99 percent sure this is the memory efficient attention one that's in the docs i just couldn't find explicit link to it but i'm like pretty darn confident this is the the right one um anyway so like if you look back at the attention equation the important thing to see of it it's really just a weighted average over the vectors in the value matrix it's not that much more complicated than that um and that the numerator weighting and the denominator normalization are the things that come from the soft max um if you just look at the like standard like high school algebra if you're doing like a weighted average uh the algebra means we don't have to do it like in the way that the attention equation tells us to do it we can just like rearrange it um it's called the like they call it like lazy soft max where basically uh we compute the numerator uh immediately on the fly and then we compute the denominator on the fly and then at the end once we have those we just take the take the fraction of them instead of in the equation where we would do the matrix materialize it take the soft max and then do the weighted average after um so as a result we don't need to write the intermediate sequence length squared matrices to d ram and then if we need them in back propagation we can recompute them with selective gradient check pointing um so it's mentioned a bit in the papers it's not super clear to me but basically like we can also like save parts of those intermediate statistics from the soft max to avoid recomputing the whole thing um but anyway what's interesting about this too is uh it's more flops but when including the recomputation but it's actually faster wall clock time which is like a really really kind of cool little tidbit there next is uh flash attention it's effectively the uh it's the same thing but the method of tiling in the soft max statistic uh statistic summation is slightly different um the tldr as far as i understand is just it's fewer memory accesses um i actually don't entirely understand it so that's the section in the paper if you want to uh investigate it one other thing that i forgot to mention here is that um do there's uh there's one other caveat here which is um this there this is not the actual exact formula you use there's um numerical stability things if you're doing a soft max you like uh you subtract the maximum of like the inputs of the of the numerator because you're taking exponents but that's not super important for this talk anyway um this is a rough performance analysis of the different attention blocks and as we said earlier the ones at the beginning uh the ones of the longest sequence length they're going to be the most expensive um and note that flash attention is not always available depending on the attention dimensions and the floating point data types that you use i don't think it's available for 32-bit floating point in pytorch um and then so anyway that what's interesting here is the choice is nuanced but pytorch always chooses for you this is me forcing the um there's an api to choose which attention implementation it's going to use um there's i don't think it's explicitly documented anywhere what the actual logic for which attention block is chosen um it's in the sdp utils header though i didn't really read it entirety if you want to look at it you can you can look at it it's uh more um this is the simplified benchmarking code this was pytorch is awesome it took me like 20 minutes to write this pytorch is like a really good library i can't can't get enough of it um anyway in summary um pytorch and diffusers in my opinion is a super good example of why the open source ecosystem is so great um everyone gets to build on on top of each other um that was like a lot of really like complicated math but like pytorch does a lot of work implementing that and it's like a single function call is the api that they give us um the fusers then we get to look at our code and figure out where in like our architecture we have that um the uh the the the performance bottleneck is and then we expose another attraction to our users when in the case which in the case of uh attention is we show two lines here but really we do this for you automatically um and user gets a four line code diff and then everyone's happy and everyone gets to make images faster with diffusers anyway uh this is the the blog post we did it's much more kind of macro and it looks at like cross different gpu model types and more kind of what in production uh effect you would expect to get when upgrading diffusers to pytorch 2.0 we also compare versus like xformers which is where you would originally get these optimized attention blocks um really good blog post doing anything in production that's probably where you should go uh that's that's where you can find us on the internet uh better hugging face website and diffusers plus we have a bunch of other open source stuff which is probably more frequently known than diffusers thank you um with how how so like how do you want to yeah yeah but like how how are you you don't need to distribute the workload for like a single single thing like you want to batch distribute across them yeah you could you could batch distribute it across them i don't think you would need to do anything the library itself though uh like right you could you could do that in your user api and then you could uh in stands basically we have these things called pipelines and you could instantiate one to one of your gpus and the other one you could instantiate to the other one and then feed them in separately we have like a lot of people who like open like get up questions like how do i use this for like what thing in what production and i'm like we don't handle that we just straight like model definitions inference code our training stuff is sometimes where we get into more nuances of what um what sort of like like like uh machine stuff we run on yeah accelerates a accelerates a good library sometimes hard to figure out how to use it but yeah well thank you for thank you for listening to me that's better yeah okay um so if there's q and a at the end um just because it's a little hard to hear folks in the front i'll run up and give you the mic so everybody can hear hi i'm ashok imani i work in the pytorch team at intel i will be walking walking through few uh pytorch optimizations from intel uh intel extension for pytorch and uh some of our efforts in uh upstream uh pytorch community projects okay here is a big picture of how uh different layers fit together at the bottom layer we have highly optimized performance libraries uh for deep learning and distributed training one dnn and one ccl these libraries work with all the intel platforms including cpu and gpu's so we try to upstream these optimizations to frameworks such as pytorch and also intel extension for pytorch and then we make sure these optimizations are enabled in uh ecosystem projects usually most of these optimizations get picked up by default uh in the you know by the ecosystem projects but sometimes we may have to uh based on some specific use cases we have to enable and fine tune using our intel extension for pytorch or in general optimizations for specific uh use cases i will be going through some of these optimizations such as hugging phase torch serve py g and uh you know deep speed here is a you know some of the milestones from intel point of view for pytorch upstream uh we have had one dnn default since 2018 uh what that means is users end users get the best performance out of the box uh in eager mode uh and we have since then enabled the block layout which required us to convert the tensors into uh block layout format um with the channel last support later we do not have to do that anymore and then we enabled training and dlm optimizations uh added autom expression with bfloat 16 support for xpu device that means we can plug in intel gpu's for example and more recently in uh one dot 13 we added the channel last layout which i mentioned you do not have to deal with the block layouts anymore uh we enabled one dnn fusions in the taut script jit mode and we also enabled one dnn quant back end uh this future the quantization back end uh has recently been uh promoted as a unified default x86 back end for quantization in 2.0 and in 2.0 we worked closely with the inductor cpu team inductor team to enable cpu optimizations uh with the one dnn fusions we also uh added support for graph neural networks and uh one dnn graph integration i will go through some of these features in the next few slides so 2.0 is a game changer in the sense that the dynamo uh basically allows us to expand our model coverage unlike the taut script where we were limited by some of the limitations uh to support the graph mode for all the models with dynamo the model coverage is much more improved and we worked with the inductor and dynamo core team to enable one dnn fusions uh you know default right so with torch dot compile we have f32 inference on cpu working now and as you can see we have started working since october 2022 and you know continue continue to add new optimizations right now i you know 1.8x performance with multi thread and single thread 1.44x and this is across different benchmarks uh torch bench hugging phase and the team models from hugging phase for vision uh we plan to continue to add more optimizations bfloat 16 and training optimizations uh so you can expect that in 2.1 so the extension for pi torch uh you can think of it as a staging ground so we add the latest futures and experimental futures and uh support for new platform features here and eventually they will be upstream to pi torch uh it also has ease of use python api uh that means sometimes users just have to with a single line of code they can enable most of these optimizations um and we have some examples i will go through them and uh you know if you look at the the chart uh the picture here we added optimizations in every layer uh at the eight and ops level the graph level uh custom operators custom optimizers and also uh the one api stack for gpu is also enabled okay and finally we have the support for python and c++ space deployment so users can deploy directly with c++ uh or usually through python uh the usual way so you know if you look at the big picture for the optimization techniques you can think of it as three buckets operators graph and runtime in the operator section we have the you know think of it as eight and operator optimizations it has the amp out of mix expression layout optimizations support for vectorization parallelation at the operator level and then at the graph mode we have uh the support for tortscript in the previous one dot x uh graph mode uh since two dot o we have support for dynamo and inductor and uh we through ipx as i mentioned with a simple ipx dot optimize you can actually trigger most of these optimizations with a single uh line of code change and runtime it's primarily for uh making sure you're during deployment and for even benchmarking i guess you can make sure open mp and the the threading runtime is heavily optimized and tuned um such as pinning or affinity and so on and we also provide a launcher which allows you to which simplifies this uh process we also have weight sharing if you want to do multi multi streams in the same uh process for example i will we have some examples i will walk through so quickly to walk through a few optimizations so this is an example of how you can enable auto cast for example this is the typical use case uh in pytorch there is nothing special here uh just the highlighted code with auto cast so that should trigger the auto expression and auto cast feature for you with for channel last uh unlike the block layout where users have to insert too dense or to layout conversions uh with channel last we can provide the same performance as block layout but users do not have to insert the block layout conversions so this is uh helps us achieve equal performance and more model coverage so you can enable this feature using the memory format channel last with uh with ipx it's much simpler these optimizations are automatically enabled for you uh with a just single line of code ipx.optimize okay so i think i had cc enabled so the quantization uh is a typical flow this is uh usually in upstream pytorch you have a prepared step and then a calibration step for a static quantization and for dynamic quantization it's you don't have to do anything and then usually convert this is uh this is the standard upstream pytorch way of quantization with uh ipx the flow is similar but we actually also integrated our neural compressor uh intel neural compressor our low pressure library so through autotune future so you can the option three i mentioned here will enable that and we also have the prepare convert and autotune uh available through ipx so you can import that and just enable these three steps so you can benefit from the autotune from intel neural compressor for the graph mode we have uh the one dot x taut script optimizations you can enable that through the usual way where you either jit trace and free freeze or uh the taut script uh method uh you can also optionally enable the one day on graph extension and in 2.0 you just have to do torch dot compile as you know and the graph mode in ipx is again simpler you just have to call the ipx dot optimize graph mode equals true and that will take care of it and uh finally the runtime um as i mentioned runtime uh has few futures one of them is launcher and memory buffer fooling where you can launch multi streams and do weight sharing and uh also uh because it's ipx launcher we can actually without any code change we can enable ipx optimizations and here is an example how you can use the launcher to automatically enable the optimizations for bfloat 16 uh birth inference and you can apply this for other models as well and here is an example for the multi stream that i just mentioned okay so the one day on graph uh extension uh was recently added to one dnn so it's a graph extension in one dnn and we have enabled this in uh 2.0 uh it's primarily targets the sub graph fusion patterns for compute intensive ops and neighboring ops so it's a little bit uh tries to do more aggressive fusions uh than the ones that are available in the existing uh taut script path and uh this as i mentioned this is a you know a beta future in 2.0 uh with support for f32 and bfloat 16 inference um the usage of this is uh simpler it has a simple and easy to use interface so you just have to enable this jit.enable as i showed in the previous code snippet and uh you can learn more about this uh at the link here um it has as i mentioned aggressive fusion patterns so you you can expect to see uh more generally more performance across different models so in 2.0 we also added optimizations for graph neural networks to both 2.0 pi tors 2.0 and also to pi g in in 2.0 we added support for uh sparse matrix multiplication uh scatter reduce and gather so these are for the enabling the gnn models and uh the performance if you see this is uh the standard benchmark uh for the pi g we can see up to 4x performance improvements uh in pi g we added uh index sort and also affinity mixing so that helps performance in pi g as well so you have optimizations both in 2.0 pi tors and pi g as well so this is for gnn's we work closely with the hiking phase to make sure all of the optimizations that i just discussed are properly enabled um most of the optimizations in pi tors upstream pi tors just get picked up automatically uh but we have worked closely with hiking phase to enable intel extension for pi tors and the intel neural compressor that i mentioned earlier um and uh as an example for stable diffusion you can for example uh use a few short fine tuning uh for your own custom data set uh using 4s sapphire rapid nodes uh in under five minutes you can do that and then use the same sapphire outputs to do inference uh you know with under five seconds on the same chip and we have a live demo for this on the intel dev cloud and aws at the links uh shared there okay so we also work closely with the torch serve team at meta and uh other other torch serve contributors and we have enabled intel extension for pi tors the intellect the launcher from ipax the and also upstream pi tors optimizations get picked up and again enabling this in torch serve is a single line code of code change ipax underscore enable equals true and for uh launcher similarly you can enable that uh in the configured properties for torch serve um we did benchmarking on couple of models for torch on the torch vision side and the nlp side and we see uh 7.7x performance and 2.2x and you should expect to see uh similar performance improvements across wide set of models in the torch serve uh repository so uh we also enabled optimizations to deep speed uh the optimizations were enabled through a device independent abstraction uh so we have the cpu op builders uh implementation and also the accelerator implementation uh this is for the cpu side and for the gpu side we have intel extension for deep speed which enables similar abstractions so we have the sickle op builders and xpu kernels uh because this is a device independent abstraction you can expect to see uh the performance gains in your deep speed so the integration and optimizations that we discussed uh earlier should carry over uh to through these uh through this abstraction and uh i have a URL there if you if you're interested uh please check it out and finally a gpu support uh as i mentioned in previously we have the full gpu support in uh for intel gpus the interface is uh through standard api um you know we have this concept of xpu device as i mentioned earlier and uh the back end kernels are implemented using the one api stack uh programming model uh this is uh released through together in the same intel extension for pytorch so if you go to the intel extension for pytorch uh github repository it should support both cpu and gpu uh starting on 1.13 release uh a few links here and um and resources i think that's all i had i will leave the links there if you have any questions i can take go ahead hang on real quick um i was uh i was very interested in the stable diffusion example that you showed um with the five minutes for fine tuning and and five minutes for in five seconds for inference um what kind of uh so you said a few images for fine tuning it can you give an example of the kind of stuff that you could do uh this is more stable efficient specific but just one so we used this uh some custom date uh example data set i think doko or some images uh they applied this personalized fine tuning just on the few images i think maybe even one i think uh so we the main take away i wanted to provide is we make sure all the optimizations such as bfloat 16 and for inference side in date using our uh neural compressor we after you enable them uh you can shorten the time it takes to do your own custom fine tuning using few short fine tuning technique just in five minutes you can do your own stable diffusion on cpus without having to use gpus right and then you can do the inference on the same cpu uh under five seconds as well right so we have an article and uh details and and i also provided a live demo so you don't have to you know build your own pipeline and so on you can try it out uh on the hacking phase as well as i believe aws uh links i i so i think i missed that link because i i went to that link it's not online maybe i missed the link it was the wrong link that i got it i can show it yeah let me follow up offline i can show the link here but i will make sure you can access it yeah we also have i think uh tutorial and a medium blog post and uh you know that goes into detail how you can um bring in your own custom data data set or and do this very exciting thanks any other questions for ashak got a question for on-device um so an old an older on-device deployment story would be um an onyx runtime with an ep of open vino given what you've shown here for a desktop what would be the new story for the most optimized on-device so onyx runtime you know i guess that you can use that as a runtime deployment right pytorch is more for not just for deployment but to do your own uh model development and uh i understand that i'm saying if if you were deploying say an onyx model on a windows device you would use the open vino ep to get really good performance is anything you're showing here replace that on device or are we all talking about cloud interference and training right no it doesn't replace it's uh open vino you can still use that with onyx back as a back-end runtime deployment right it gives the best performance for a different use cases uh this is more like a complementary option for you if you want to deploy on even on windows right but uh most of the talk is on xeon and linux and gpus right but for windows we do have the pytorch upstream you can install pytorch for example on your windows device and run it but it's more like a complementary right it's not a replacement uh thank you very much thank you