 Well, thank you and thanks for coming back for after the break and Yeah, quick introduction. I'm Derek Boyas. I'm with AMD I've been a AMD as a product manager for the GPU compute software stack for the last five and a half years and so through that time I've spent to Looking at Pytorch has one of the main machine learning frameworks and I'll give a little bit of a history today and Some introductions of some interesting libraries have been working on So as part of AMD AI is really one of the key pillars it's Important for both training and inference, you know in the cloud and edge devices we AMD is a wide range of hardware and we see that AI is becoming so pervasive across the industry, you know Everyone here knows about that based on even the last few months of interest and Really want to talk about today the different hardware platforms How those are enabled with Pytorch and then some additional libraries? so first in the machine learning flow, there's Different amounts of compute required for the different components. So in pre-training It's really intense of large number of servers You have many GPUs or other accelerators to connect with a large amount of data The fine-tuning and you've seen all the papers come out recently There's a lot of ways to reduce the amount of data required. You want to iterate quickly and then on the inference side It's all comes down to Latency throughput How fast can you respond and get that answer out? so through all these different use cases you have different hardware requirements and with AMD's has a large suite of devices and everything from small CPUs to large accelerators I'm in the data center GPU Component group. So the instinct is the brand of the product But really it depends on what your application is. You can't just say oh, you need to use this device It comes down to what is the the workload you're doing? What's the performance? What's the the power is actually quite important again depends where you live power is Rising very quickly. And so if we look at the suite of devices This helps give a bit of an overview of the branding and on the left-hand side the CPUs are Called Ryzen and Epic. So Epic is a server CPU. Ryzen is the more client CPU and Pensando is the Network processing component. It's a company AMD bought last year on The accelerator side we have the Radeon graphics cards So those are also used for gaming, but they contain a lot of math accelerators And then the instinct is the data center GPU component which has a lot of matrix math operations high-performance compute support for running in large Systems like the frontier has a exascale type class system and it doesn't have the graphics components So you can you don't do a lot of graphics modeling or the ray tracing And then and the adaptive computing this is on the Xilinx acquisition. So AMD acquired Xilinx last year as well And so there's a bunch of different brands there. The zinc is a small arm-based processor with FPGA logic gates, and there's the vertex and Versal and context various work various devices for your different workloads and And so in the architecture of these devices So if we start on the left-hand side on the CPU we call that Zen so ZDN or ZEN depends if you're from Canada or Then on the accelerator phase we have in the Radeon side It's our DNA and on the instinct side is called cDNA and so that was for Radeon Versus compute as a DNA of the architecture So they have different properties a little bit in terms of how they've been optimized for the hardware and then on the adaptive Computing is called xDNA and so we'll see that as we come through a couple of the other slides So let's talk a little bit about PyTorch and AMD catching the bus so When we're talking earlier like when did PyTorch first start? I didn't start in 2016 2017 I was at AMD and we're looking at all these different frameworks that are available there. I think we're seven or eight on my little roadmap and You know they consolidated but PyTorch has been there for quite a while. So we had this beta in 2017 Did some porting of the cafe to hip cafe and the hip is the programming language That's very similar to CUDA so if your experience with CUDA looks the same similar API and We created a tool to call it hipify to convert CUDA code to hip That's what we did with cafe and this was the first project to sort of to use that tool and it became part of the Compiled path and so then I was here as we worked through this we kept that tool updated So as we're new APIs and implemented It would automatically convert and actually that's still the way it works today Is that the whole code base is still CUDA and as you compile it? It targets the AMD device with the hip API calls But as you can see in this phase between 2018 and 2021 We were catching the bus and this is where you know we spent time on CI So we set up servers so that people could run pull requests they would actually run an AMD hardware and Really getting to the point where we could have front page support and then finally this year where I would say we're a day Zero support on a new release so it's a really great milestone to see because you know trying to maintain your own fork It's not a winning battle So to run code today, it's very straightforward You can either get a container or you install the Rockham driver and then you just install the pip wheel from pytorch.org So you go through the little selection box copy and paste the code in and you're up and running with a System on or hardware and if you've already written an application using pytorch You don't actually have to change any of the code you you actually leave where it says device equals CUDA if torch CUDA You leave that code there. It works and it uses the AMD hardware So we have support for the nightly builds. That's where a lot of the development happens. I invite you to check it out there So a little bit more detail on Things that are happening behind the scenes I Mentioned that we were converting the code so it's using hippify as the tool But you can actually still go and find out if the device is a hip device And so there's a little bit of code at the bottom there to say is rock and pytorch Check the version dot hip and how we do some of this work in the background is with Libraries, of course, there's lots of common math and communication libraries That help accelerate the workloads and these are integrated in to either pytorch and or other projects Some of them are more focused for HPC, but there's a lot that are interesting for the The machine learning space and Rickle is a Collective communication library that actually uses the same API as nickel So it actually doesn't do the hip conversion It's the same API and then it's the AMD implementation underneath to connect to our hardware the composable kernel one is interesting because it creates options for building up your fused kernels So if you're really spending a lot of time building up your own operators you can build fused kernels and Get very great performance through that So if you do want to build the pytorch code base There's a script included in the repository to do that And it's pretty straightforward and it works out of the box and then on the various back-end implementations There's a eager mode and with the new torch compile in pytorch 2.0 That that's available as well, and it uses the open AI Triton implementation We also have worked with the AI template library and so there's a composable kernel component that plugs into that and That's one of the methodologies. We've seen give the best performance for inference for the instinct GPUs And there's a nice blog post so when you get the slides you can just click on the link and view all the details there so that's the side of the Rockham on the Instinct or the cDNA and the rDNA type devices. So Next up is the Zen devices. So the this is the epic and The other CPU infrastructure devices So Zen DNN is very similar to one DNN as you can tell by the diagram it's mostly used for inference and We are working towards getting that upstreamed into the main pytorch repository as well so it is looks at the exact CPU codes and Create some optimized kernels for our hardware and then we also mentioned the xDNA. So this is on the FPGA side So there there's not a built-in path in pytorch But it can take pytorch models and convert them and then run an inference workload on most of those devices And I'll talk a little bit more at the end about how we're merging some of this work all together So in terms of the ecosystem it's really important to be able to change code and see what's going Happening all the way from the application layer down to the device driver and we've open sourced our full stack and so you can Go and actually rebuild the kernel driver if you find you need some tweak or something needs to be modified at that layer You can tell that there's a few different licenses that are required for the different parts of the stack That's really due to the heritage of them like the the gdb Debugger is still a gpl license because of the history of that But the advantage is is that since we're upstreaming that code into gdb You can use your standard gdb commands to do debugging And we really wanted to work with the community. So all this code is posted up on github We're Let's say it's been a struggle to maintain all the requests But we're getting better at monitoring the github library components for forum like input pull requests and Actually tracking components there So one of the other Interesting tools is the kanido profiler. So this uses the core rock prof component with rock tracer as Ways to abstract the information from the gpu and then it plugs right into tensor board So you can use the tools that you're familiar with today and Get the performance perspective of what's happening in the kernel. You can see that memory trans memory Changes from the cpu to the gpu or from The different times that the kernel components take On the ecosystem, there's quite a few libraries even within pytorch And as of the pytorch 1.12 release These four core components were all enabled with the rockham ecosystem So you have the text classification Some recommender systems the computer vision libraries As well as support for audio and signal processing On the sort of expansion side of working with pytorch is the onyx runtime. So we spent a lot of time supporting this infrastructure to Have I guess a good performance with our core libraries but also Upstreaming this into the onyx runtime project So for both inference and for training you can get pre-compiled components for the rockham system that the torch ORT is the python module that you use And it's pretty straightforward to plug it in and as you can see through the the architecture of how Onyx runtime is built it uses the execution provider And that execution provider is sort of a nice abstraction for the different hardware implementations And you can see there's quite a few available for different hardware devices in the market And i've highlighted in purple the ones for amd. So you have the standard cpu components you have The mi graphics and the rockham implementations as well as the vitis from the xilinx FPGA support So then another interesting project is deep speed So this gives you some great support for running those large language models on a reduced number of gpus so you have ways to Use a better memory optimization you can offload to cpu or you can even offload into disc And so there's a couple links there for getting the upstream Components or prepackaged binaries Then this is a topic I find pretty interesting This happening in this space is the concept of intermediate representation And working through these different layers to figure out. Okay, how can you make the hardware abstracted but still get very efficient kernels? And it's usually the way like as you abstract more you get less efficiency So there's always this trade-off between building pre-compiled libraries or doing something on the fly Or working at a very high level of the framework So we've been doing some work with triton the ir there as I mentioned with pytorch compile But the open xla with stable hlo in the i re It's it's also another methodology of using ml ir to Reduce those calls into a nice abstraction for hardware and then have That hardware support for different devices what this allows you to do is Jet create kernels for new hardware very quickly Or a different architecture It may not be the best performance, but it gives you that portability in speaking of performance With pytorch 2 we did a quick study on some of the core benchmarks included in There with the hugging face models and this shows a Nice boost in performance from on average about 1.5 times the performance going from pytorch eager to the pytorch compile mode Some showing much higher performance and again depends on the kernels It depends on the model and then the kernels that get generated And and I see this improving over time as well as people have become more aware of how to build those with ml ir and so For long term. This is where amd is looking to Put all of our hardware platforms together under one roof specifically for inference. We call it unified inference front end And it's available today and we keep adding more devices under the structure We also plan to Well, it's it's available through the inference server. So if you are running it Something that you have a whole suite of models already developed and deployed You can run an inference server. It's standard type of json interface with An api that you can get the components and run it on the different devices We have pre-trained and then pre-optimized models for all the different hardware platforms and we'll expand that over time This is supporting many different frameworks, but As you can see, there's sort of three core stacks for the three types of hardware architectures and we also Are looking at how we can optimize that so it's just less effort to migrate from one device to another and That's all I had for today Open for questions. I don't think any of it's intrinsic. It's more just Looking at it to expose hardware features So one that we've spent time on is in the hardware we have matrix math operations and making sure that the upper layers can see that those exist and know that they can Generate code that will then make use of them So it's all about exposing it to the right layer and then making sure that those layers can create the code that's optimized So I don't see anything limiting. It's just a case of Yeah, then spending time on the on the intermediate layer Any other questions? Is anyone used a name the hardware device? I got a few nice Um, maybe I showed it to the hugging face side of things too is that I know We have a lot of people have tried just the standard like any model that's there on pytorch And they've actually just run it out of the box So, you know, it just does a testament of the whole ecosystem is working. It's nice to see And and if you don't if you find something that doesn't work file a bug and let us know All right. Well, thank you for your time