 So good morning everyone. My name is Arénome Razenajato. I'm a senior engineer at Huawei France in Paris. This presentation I worked on it with Cébric Bastoul who was chief scientist at Huawei France at the time. It was also presented at the Impact workshop in June so the content may be a bit similar but I hope it will be interesting to you all. So today I'm going to talk about mind sport and more specifically AKG which is one of the components of mind sport. So mind sport is an AI framework from Huawei. Its main challenge is to be an old scenario framework. I'll come back to that a bit later. So first of all before I go into details I'd like to very quickly present mind sport. So it's a AI framework from Huawei. Its main feature is to map high-level DNN models to efficient parallel implementations. It does that by hiding the complexity of the architectures to the AI developers and one of its goals is also to support for the fast development of new models. For that there are many components that some of them are completely architecture independent such as type derivation, automatic differential, differentiation, second-order optimization and as you go lower and lower and lower then we get to optimizations that have to be architecture-aware because you have to know the architecture to do automatic parallelism to optimize the memory, to do operator fusion which is what I will talk about later. So operator fusion is important because it enables critical optimizations. We can improve data locality and the use of the processing elements. It allows us to minimize the intermediate results communications and storing and also when the target architecture is an accelerator a GPU, a TPU for example it also allows us to reduce the cost of the kernel launches which can be very costly. For that we need what we call automatic kernel generation because when you are targeting specific operators sometimes you can use optimized libraries but since we want to support the fast development of new models from time to time we encounter new operators that were not encountered before so in that case we need to be able to compile them and to optimize them. In the case of the auto kernel generator we exploit the fact that operators are by nature regular and then hence are fit for polyhedral compilation. However polyhedral compilation for those who are not very familiar with that mostly targets outer parallelism and data locality. Most of the time that's fine but as I said we want to target multiple architectures and it's not always the best strategy for each and every scenario. So here's the outline of this talk. First of all I'd like to present a little bit of the context then I will explain a bit more what AKG is and finally I'll talk about the challenges of polyhedral compilation for AI. So first of all in AI the models they the data and the operators are represented using computational graphs so these graphs are directed graphs the nodes are the operators with input and output tensors which need which may need to be communicated. These operators they can either be framework between built-ins or customer operators that are provided by the users. In these graphs the edges are tensors that can have various shapes and various types and they are used to store input data model parameters or just temporary results. These graphs they are built using the framework input language or they can also be automatically generated. Then these graphs they need to be compiled and executed so usually they are built into subgraphs that would select different operators and prepare them for scheduling and then around time we'll execute those subgraphs and launch the operators whenever the input is ready. So when the input is ready meaning that when the previous nodes have output their results. One of the key optimizations that you can do here is operator fission. The idea is for example we have a graph with several operators and the idea is to fuse those operators to in order to get it a bit more optimized. The benefits are we can get better data access and storage optimization because we can promote intermediate results to fast memory instead of having to communicate those results between operators. We have better hardware usage and occupancy. This enables us to do some more loop level optimizations to reduce the cost of the kernel launch since we end up with a smaller amount of kernels and also sometimes we may also fuse independent operators to better use the available resources. This also allows us to do interoperator analysis and for example this can lead to common suppression elimination or dead code elimination, algebraic simplifications and so on. So here on the right we have a small example where we would have two operators a multiplication and an addition and with that operator fission we would have to execute the first operator, communicate the the output tensor to the second operator whereas using operator fission we can end up with only a single operator. So only one kernel launch, only one loop nest. So in mind for this graph kernel operator fission happens in three phases. There is the partition phase, then the fusion phase and then we can generate a new code for the optimized kernels. I will go a bit faster on that. So here is an example. For example we have a subgraph that we want to optimize using operator fission. We can expand some of the operators and do some analysis to find out how we could fuse them and then apply the fusion. So here is a quick result, a very view of the impact of these graph kernel fission on some key networks for MindsPore. So on the right we have graphs for NLP and recommendation and computer vision networks. We can see that on the lowest we get an average speedup of 30% on recommendation networks we even get to 136% speedups. So this seems very nice. We do manage to get some great speedups using operator fission. Regarding the polyhedral aspects, it's less challenging because we don't have to raise the normal polyhedral compilation. We don't have to raise the code, but it's easier to translate. But unfortunately, this fusion, it highly depends on the impact of this fusion, may depend on the target architecture. So this is what I was talking beforehand, the multi-scenario challenge. In the case of MindsPore, we target three architectures, CPU, GPU and NPU. Doing that is a challenge because we do not want to have a big software fragmentation, but we need to apply different optimization strategies as transparently as possible. For example, here we have a very basic example. We have a matrix multiplication. So on CPU, the naive way to optimize that would be to exploit vectors and cache localities. So we would interchange the two inmost loops. Whereas for GPU, this is less important because there are more threads. So we want to maximize the thread usage. Whereas for NPU, for example, we can just use an interest for that. So here the challenge is we've got a subgraph to optimize. We have a framework to do that. And we want to target multiple architectures. And how do we do that? So this is where IKG, the auto kernel generator, comes in. So I will give a brief overview of the architecture, then talk a bit about polyhedral scheduling inside IKG and some other optimizations that are needed inside. So IKG is one of those components of Mindsport. It supports three inputs, the MindIR, which comes directly from Mindsport, custom operators, which can be defined by users, and the TVM hybrid DSL for fast design. There are three main passes. The first one is the operator normalization. This is necessary to prepare the polyhedral processing. We have to make sure that it complies to linear constraints because the polyhedral scheduling needs linear constraints. It may also apply transformations to reduce complexity, for example, in lining or fusion. Then there's the core of the kernel generator, the polyhedral scheduling, which can do loop optimization, thread polarization, tiling, things like that. And then there's the backend optimization, which generates the code for the target architecture for CPU. We generate LVM IR for GPU. We generate CUDA codes and for NPU, we generate the CCO code. So, for example, in IKG, we have a lot of passes around the polyhedral phase. So as I said, the first phase is the operator normalization, then the polyhedral part, and then the backend optimization. I will not go into detail into all of the passes. It's just to show that there are many passes before and after the polyhedral scheduling. And many of these passes are specific to whatever architecture we are targeting. Then it's the same inside the polyhedral scheduling. We have several common passes, the initial scheduling, the actual polyhedral scheduling, a few analysis, the tiling, and then the final code generation. And then for each target architecture, we also have many, many passes because of whatever is needed for each architecture. So the challenge here is that polyhedral scheduling is responsible for critical actions and decision. It can extract parallelism and expose the parallel loops. It can extract what we call tileable loops, which is important for memory efficiency. And it can also leverage loop level optimizations such as data locality, access patterns, et cetera. However, polyhedral scheduling up until now was not really designed for all scenarios, what we call the whole scenarios context. The current state of the art scheduling algorithm from Pluto and ISL is supposed to be domain and target independent. It extracts outer parallelism and inertial locality. And for example, for when we are targeting NPUs, we end up having to use rescheduling passes to enable specific optimizations. This scheduling algorithm lacks the ability to inject new constraints for specific targets. And that's one of the things that we have done in AKG, the ability to inject new constraints to be able to guide the scheduling algorithm towards a better solution. In our case, in our case, these new constraints that we want to inject, we decide them using nonlinear approaches. We had one application to optimize low store vectorizations on GPU. So this mechanism, we called it MindTricks as a reference to MindSpo. Maybe I will not go too much into detail in that part. But so we enabled that in our AKG and we managed to get sensible speedups on some networks. Thanks to, as opposed to the standard scheduling, which would be the Pluto one, with these new constraints, we managed to get speedups on some operators and some networks because we managed to better to leverage vectorization, better vectorization, better data locality, etc. Another key optimization that is possible thanks to polyhedral scheduling is tidying. So this is another part that is very important for memory locality, for data locality, for memory optimizations, especially on accelerators where we sometimes have limited memory. So determining the best ties is critical because otherwise the kernels do not fit on the memory of the accelerators. Same thing, I'm not sure I can go into details here. But the main key point here is that AKG can both find tile sizes using tuning under automatic systems. And this symbolic tidying is a real-time tile size optimizer. So we don't rely on tuning for that. Then I'd like to talk about the challenges of polyhedral scheduling for machine learning. So this is a call we made a few months ago. It's not public anymore, but the thing is polyhedral scheduling relies on linear programming. It can be very, very slow, depending on the cases, because sometimes we end up with kernels that have hundreds of statements, hundreds of tensors. The polyhedral scheduling for this kind of operators can actually be very slow. So there is a challenge here in terms of compilation time. Another challenge is the ability to support sparse computations because polyhedral scheduling only mostly applies to regular codes. So, for example, when dealing with sparse computations, you don't want to actually iterate over all the steps. Maybe it could be possible to do that using polyhedral scheduling, but it's not very well known for now. Another challenge is speed. So we know that AI is full of approximations. We tend to use lower and lower precision floating poly arithmetic. There's quantization, specification, things like that. So usually in compilers, we want to preserve semantics, but in that case, maybe it's not necessary. So we may be able to approximate things. There has been work on approximate computing. There has been work, more recent work on trying to apply transformations that are actually not equivalent and then then deriving corrections for these transformations. So the key here is that polyhedral compilation right now has generated a lot of interest in the AI community. We have tensor compilations, PLAT ML, MLIR has some polyhedral technology as well, but the problem is that the polyhedral community is actually a very small community. So we need new people coming in and we need help actually to do all of that. So I'll just conclude on that. We have a polyhedral frameworks for AI, but the usual challenge from compilers is not the same because the idea here is that we need to optimize operator fusion. So as opposed to usual codes where we would have to analyze them, just analyze and extract the parts that can be optimized using polyhedral compilation. Here we know it's possible, but the challenge is to select the appropriate fusion strategy. What operators should be fused? What operators should not be fused? There are other bigger challenges because polyhedral scheduling is at the moment not very scalable. Extracting polyism is not the problem here. The problem here is to actually be able to exploit it on each target architecture. We have made new improvements to scheduling for that. We try to inject new constraints that we derive from nonlinear optimization. We have new ways to find tiling sizes, but there's still a lot of work to do, and that's where we would like everyone here to maybe help us. We already have something that's working that's efficient, but we are facing challenges. The code is open source, so you are kindly invited to have a look at it, to test it, and maybe even contribute. That concludes my talk. Thank you for your attention. Any questions? I have a quick question. My question is how well is the quantization for the models, for training, and things like that? How effective is this compared to non-contest version? What's the actual delta between for the efficiency when we have quantization? Sorry, I couldn't hear very well, so could you repeat the question? Yeah, sorry. I was asking what's the actual impact of quantization for your models? Like in terms of efficiency, how well has it improved when you quantized for the training and testing things like that? So the impact of the polyhedral scheduling on the networks, so this is not... First of all, in terms of operator fusion, here we had some results on several networks, so depending on the type of the networks, we have average speed-ups from 30% to 136%. Then if you wanted to know about our specific strategies for polyhedral scheduling, here are some results regarding what I mentioned before about the non-linear injection of constraints. Same thing, it depends on the networks, but on some networks, we managed to get up to a speed-up of seven, thanks to the fact that thanks to our new constraints, we managed to guide the scheduling towards improving vectorization and coalescing on GPU. Okay, so for the quantization alone, how much is the actual impact? Like did you micro-benchmark some of the optimization strategies which you tried to do? Okay, so these measurements, we did that on the execution times of the fuse operator, so we have the... Within MindStore, we have the ability to profile most of the things. In that case, we specifically measured the timespaint in the fuse operators. Any other question? So I guess I'll hand it here. Thank you for your attention.