 Yeah, so my name is Chen and I am a software engineer at Meta. So today I would like to introduce AxiTorch, a native solution provided by PyTorch to enable and accelerate the PyTorch model on edge devices. We will be covering the following four items today. So why on device AI is important? Why it's challenging? And what is AxiTorch value proposition? And on a high level introduce the stack. So first, what is on device? It can be on laptop, AR-MR devices, AR glasses, embedded devices, mobile devices, all these are edge devices. One new popular edge devices probably wearables I'm wearing every day. So the advantage of running on device is very straightforward. The privacy and security, performance and energy saving, and we will be covering that in a bit. So first, privacy and security. So for any model running on the cloud, it means I will need to send my data to the cloud and then the model will run inference and then send the data back. So if we can run a model on device, user won't need to share the data to the cloud. For example, I am wearing my watch every day and I use it for running and exercise. My watch can monitor my heartbeats, my cadence, and I rely to be my own coach. In this case, I probably don't want to share my daily routine and other sensitive biometric information. Also it enables training my own personalized model with my own data, without worrying sharing the sensitive data to the cloud. Performance. Running model in the cloud just means everyone has to go through the round trip traffic from device to the cloud and then from cloud to the device. The time spent on the traffic is not avoidable. It also affect user experience. Say for example, I am in a rural place and the network is terrible, then the user experience probably won't be great. Lastly, energy saving. So if we can run the inference on device, the cloud won't need to provision resources. The cost to run one inference will transfer from the cloud to the device. Now for the use case, we can, I guess like, these are like example use case. So for example, the vision model. So some of the most common models include like say segmented image, object detection, classification, and then yeah augmented reality. And in this way usually the input is an image and the output can be different based on the models. Speech model. Automatic speech recognition is one of the like very common speech model which convert audio to text. And then there's some other example, say for example, noise suppression that is very often used in the video card such that we can do the noise, we can do the noise suppression from the background. And then language model. Usually it just takes text and common model include like assistant, intent understanding. The popular large language model we use very often nowadays is also a typical language model. Genitive AI is definitely a hot topic today and yes, we are working on it. Lama 2 is one of our focus and we are also look into other models like EMU for image generation. The Genitive AI use cases is sort of overlapped with the other three but we listed here especially because the uniqueness for Genitive AI usually it's more like a large model and also like compute heavy. For meta, why do we care about on-device AI? One important reason is the family of apps. So like Facebook, Instagram, most users are on mobile devices and they love to create and share experience while AI affects. So for me I like using different filters and in real video cars. Say for example, filtering out the backgrounds, noise suppression, all these are super handy for users. And then for reality labs, mostly for AR, VR. On-device AI is also crucial to enable many applications. Like say it makes reality, we may interact with the visual object in real world and also we have assistant, it's always available for the smart glasses such that I can say use my voice to take a picture or make a call. So here's the question mark. From a PyTorch program where everything happens in Python, what are the options to enable running on-device? We're like say, we're like in the resource limited device only C++ is available. How do we like, what is some options here? This diagram is for the current machine learning landscape. So this is somewhat PyTorch-centric view but if you see the path to accelerating PyTorch model on-device AI, it has to go through, it has to cross multiple boundaries. A trained PyTorch model is often converted to another IR. As intermediate representation, leaving the PyTorch ecosystem from there or directly from the PyTorch model, it enters yet another of vendor specific arena. Then it will be convert this model or IR to the specific IR, the two chains can understand. And then at this step, we generate artifacts that can deploy on-device. This is the current landscape because AI on-edge is characterized by diversity of hardware, diversity of operating system or in some embedded device, there's even no operating system there. And diversity of chipsets, proprietary IPs used by OEMs, platform specific proprietary two chains. Yeah. So it is how it will impact developer. The resulting production workflow involves lowering scripts. Lowering here means like generating platform specific artifacts using the targeted two chains. So such lowering that cross boundary from one domain to another likely will introduce loss of information because each domain has its own language. And the translation from one language to the other language is not perfect. Thus by the time the model is deployed, it is no longer a PyTorch model. And we cannot reason about where the parts of the deploy model come from in relation to the original PyTorch module. To maintain the version of tools is also just not trivial. Like imagine one of the conversion tools just get upgraded. Then the workflow chain here likely is just broken and it's just hard. Now you will say, sure there's a loss of information, but I am able to get an awesome acceleration of my model. So I give that you got performance and maybe some device reach as you deploy some of your models on different devices using different two chains. I say some because these two chains are all or nothing. So if I have a model that is not lowerable entirely, then lowering just fails. It also has a cost because we lost the probability. The production code needs to be aware of the target device as the model loading and running APIs become platform specific. We lose the productivity because the deployment engineers need to use 10 different tool chains and every may result in its own error code. And the round chip from all the way down to the beginning like can be really time consuming. And also we lose like the visibility in terms of profiling and debugging. It is like say for example, if some part of the model is really slow, it is now really painful to figure like where the app is and what is the original pytorch line. Then also like what we lose is the coverage because all or nothing, the nature of the lowering process means it's not composable. Say for example, if I cannot lower part of my model and I guess like then it's pretty much just like we cannot use this hardware to round this model. So now we know the problem. Machine learning engineers love authoring models in pytorch. However, on device deployment requires developers leave pytorch ecosystem by various of converters and platform specific tool chains. And our solution is access torch. It is integrated with partners and platforms with the pytorch ecosystem. By providing a set of integration points, we will, which we will talk in a bit, we want vendors and OEMs that provide on device machine learning acceleration and tool chains become our partners and be part of pytorch ecosystem. The value this bring to users, pytorch users is performance. Now we have for the ability because the API, the same APIs can be used for model loading and running regardless of the hardware it is accelerated on. Why? Because integration to the access torch stack means the user facing API is the same. It also saves, it also saves the pain point of productivity because the standardized upsets and minimum loss of, it also have like minimum loss in translation. Furthermore, being in the part of the ecosystem improves the performance and functional debuggability and we'll cover that in a bit as well. We also improve the model coverage because the composable nature of the solution means say for example, I have a model and part of the model is not supported by the device, the rest of the stack can support that. So here I would like to introduce in access torch from the pytorch program, how do we enable it running on device? So we start with the original pytorch program, we use export an API that is introduced in pytorch 2.0. We export a pytorch core program to a graph representation. Access torch enables transformation and optimization entry points for the target hardware and then it compiles down to the access torch program, which contains instruction that the runtime is able to execute. The access torch runtime, which is installed on the edge devices is able to load the access torch instruction and then do inference. So the runtime pretty much is just passively executed instruction. So let's zoom in a little bit. Let's first take a look at the pytorch program. So this is a snapshot about the pytorch program. Usually a typical pytorch program is touched on the module and usually it's composed of multiple different modules, including a set of pytorch operators. By leveraging touch.export, which is introduced in pytorch 2.0, we are able to capture the graph. The graph is represented as export IR and as we can see here, each node represents the operator. With the graph it's much easier to work on. So if we look at the example code, we can see like there's two convolution and convolution then run a relu and then follow by next convolution and another relu. And then if we look at the graph, it's exactly like convolution followed by relu, convolution followed by relu and then output. So the graph is like with the graph, we are able to recognize what operators is used in the pytorch program. Now the pytorch 2.0 capture mechanism, which we use here is the touch.export, basically can generate a sound graph. Export that is concise yet can capture wide range of dynamism. It will generate standardized core 18 operators and will ensure the consistency between the authoring and deployment. And then here is the compile process. User can pick and choose and customize their optimization through a set of well-defined APIs. Let's zoom in a little bit and then see what are the customizable options. So first, a very common used API is quantization. Quantization is a very typical way to compress the model size and it's also like a very critical part for compressing large model such that we can run it on device. And then the next common and also like, yeah, a common API that user will use is delegate. Delegate means like delegate to accelerator. We can either delegate part of the graph or some of the graph to the hardware or software that is a more performant. Say for example, if we have a library, we know XMPAC is a library and it's a highly performant library. And say for example, in a graph, we have a linear operator and we know XMPAC can run linear away faster than we can delegate the linear operator to the XMPAC. There are some other powerful backends available on device like GPU and other accelerators and pretty much we can leverage these backends by delegation. Lastly, memory planning. So during the inference, there could be like many intermediate tensors that are, we can plan memory ahead of time to save the compute. Say for example, if we add two tensor and then the result of the add will be fit to a next matrix multiply operator and then the intermediate tensor actually can be free and then be reused by other compute. So with all these entry points, developer can pick and choose based on their use case and their requirements as well as the target hardware architecture. By leveraging exit torch SDK, so we provide native support SDK as well, we can provide a tight feedback loop including profiling and benchmarking. Like say we observe some, like say for example, linear is super slow for some reason on device and we notice it's the bottleneck. Then by leveraging the SDK, we can recognize it and further optimizing it. So now we finished ahead of time preparation stage and generate the model. We are now leaving Python and then enter the exit torch runtime in C++. So our exit torch runtime is very lean. The goal for the exit torch runtime is portable and lightweight. The target of it is to run in any platform, mobile embedded device, microcontroller. It's going to be embedded friendly C++ so which means there will be no dynamic memory allocation in hip, all of them is static allocate and minimum dependency on C++ standard library. No assumption of OS or file system, very small runtime size, link only selected kernels that is used by the model, core 18 compliant reference kernels available such that it will be consistent with the original PyTorch model. And then it can link against third party kernels or delegates as they would like. So with exit torch, we also provide native SDK support. For example, I'm taking this one as example, if there's a runtime failure, say for example, in this instruction, we know the node 73 fails. Then by, so the 73 here we call it debug handle. By using the debug handle, we can trace it back to the exported graph, which is the top node that is also 73. Then we can trace back to the exact Python line. We also provide a list of other SDK tools aiming to provide a smooth developer experience. It mainly to focus on our mainly focused pillars, including visualization, profiling, and debugging. We aim to provide easy to use Python APIs to parse profiling and debugging results together in sites. So where we are now, we announced exit torch MVP release in October this year during PyTorch conference. It's already available in open source. For users, feel free to try it out and file issues or bugs if there's any. We will respond as quickly as possible. For contributors, it's not open yet to offer a list except strategic partners developing delegates for exit torch. So if you have your own hardware, if you have your own compiler, you're welcome to reach out or just follow the guideline in the tutorial. We plan to do after release on April, 2024, targeting on delegate enhancement with your SDK and backward forward, compatibility contract. We also plan to do a beta release on October, 2024 with strong out-of-box performance, committed involvement, and ensure full A10 compliance. So how do we get started? Check out our website and the GitHub Rapples. We provide a list of examples to show how to deploy the models across different platforms in tutorials. And we will love to hear the feedback and use cases. Yeah, thank you. And I'm happy to take questions if there's any. A minimum resource required. So currently like our runtime is targeting within less than 50K. And then, but like, I guess like for any models, say for example, if I want to run mobile nets and then we will need some kernels. So when I say 50K, it means the runtime size, and then we will link the kernels that is needed to run, say, mobile nets. And then those kernels will be additional resource that's needed there. So essentially like I guess, if I just take an example to run mobile nets on device, it will be like the runtime size plus the model size, and then some kernel size that is needed to run the model. Does it answer your question? And most of these devices go on the battery, they have a basic property of being like CD devices, because they have to conserve the battery. Now if my model is running, and it has to give like a real time data crunching, so how do you balance the performance versus the battery performance? Oh, that's a very good question. So I think on nowadays, like mostly like a very, usually like the model is running like a pipeline. So usually like say, usually we will hear some, like usually we will have some trigger word. Like the model to recognize the trigger word is very lightweight. And the heavy model will be only enabled like once the trigger model detects the keyword. Say for example, like say hate portal, like it is a common like I guess trigger word. And then a smaller lightweight model will try to recognize the hate portal keyword. And then the next model, like the heavy model will only run one after the trigger model is like detect the keyword and then run afterwards. So that's how we sort of like balance out, like sort of balance the performance and the battery. Does it? So you mean like with the large model running on the device versus the large model is just not running. I think likely like having the model definitely like will add a lot of compute to the. Yeah, because what would be the battery consumption in that case where I just want a lot of information that is like model is continuously running and hard in the CPU, so. I see. Have you taken those things also into consideration in benchmarking? Oh yeah, that's a really good question. I actually not aware of the announcement. Maybe I can check it out later. Oh, okay, okay. Yeah, I probably can take a look later, yeah. That is a very good question. Right now we are C++ only. Extended to the rest is like we had some discussion but it's like likely right now our focus is C++. We may like need to like really look into like if there's a use case or like anything that we want to enable. Oh, we actually can. I mean so I guess like contributors not open like I mean like the extortion in general but if like we want to develop a delegate that is for partner then welcome to reach out or we have actually have a guideline in the tutorial to introduce like how we introduce a safe example. They are on hardware and they are on compiler to the flow. Oh, that's a very good question. I think maybe it's easier to like say I think what I can do is like I can file an issue on GitHub so we can track on that. Cause I think there will be some like we probably need some internal discussion. I think on some like I can think like it is possible but we need some internal discussion. Yeah, yeah, thanks. Oh, I think AOT inductor is mostly on the server side. Which side? Yeah, this one is mostly talking on edge devices. Yeah, yeah, yeah, yeah. You mean like from the PyTorch program and then export it will there be any failure? That's a very good question. So it is working progress. Now it's hard to give a number right now but I think many models can be exported directly and then for the export API our goal is we need some user interaction. Essentially like the Torch.export say for example it fails and then it will give like it will try to give users like say why it's failing and then user can like say oh, I will need to modify that specific of line because of dynamism or like something like that and then we can come back to the Torch.export API and then after we make that change and then we can come back to the Torch.export API and then export model. So I think for the Torch.export actually during the PyTorch conference the number they claim is pretty good. So the goal is like able to capture a sound graph from any PyTorch program. So if your model cannot be exported that's what we want to check and it will be good to like file an issue and then we will look into like why it's failing. Is it better than sorry? Oh Torch.export. So I guess like Torch.export is exporting an Onyx graph is it? Yeah, so I think Onyx graph is for Onyx runtime. Is your question like say successful rate? Oh, that I am actually not sure. Yeah, yeah, sorry about that. Any more question? Oh, you mean like comparing XTorch versus TensorFlow library? I actually not quite. TensorFlow itself is in C++ and there's the front end. There could be a Python front end I remember in TensorFlow but the TensorFlow library if I remember correctly is in C++. Give everyone.