 Good morning. So, she's going to talk about Global Icel. Okay. Thank you. So, some of you may have already heard about Global Icel. It's the new instruction selection framework that's being developed in LLVM and it's been spearheaded by Apple for the 64-bit ARMv8 target, also known as AR64. Recently, other people have started contributing. For instance, there's Crystal for ARM, there's myself from the NARO, some people from Intel and AMD are also sending in patches. So, we're going to talk a bit about status at the end of the talk. So, right now, let's see what Global Icel is and how it works. So, first of all, so we're all on the same page. Instruction selection is a phase in the compilation process where the code is translated from the target independent intermediate representation used by the middle end into the machine-specific representation used by the back end. So, in our case, this is the LLVMIR, which we all know and love, hopefully, and machine instruction or machine IR or MIR, which may be a bit new to some of you, if you've never worked in the back end. So, we're going to talk a bit about it. So, this is a simple example with a function that just takes two values and returns their sum. Like I said, this is very close to the machine. So, for instance, you have this addWRR. So, this is an add between two registers. That's the RR. There's also a version for immediate. That's RI and so on. The W means that it works on the W registers, which are the 32-bit registers in ARMv8. So, that's pretty close to the machine, but it's far from being just an in-memory representation of assembly. So, it's still abstract in many ways. For instance, it's in the SSA form, so it can contain free instructions. It does contain free instructions up until register allocation. It also has virtual registers. So, if you'll notice here, we have these %w0, %w1. These are physical registers, but you will also notice these %0, 1, 2, and so on, which are the virtual registers. These are an infinite number, and they're the ones that the backend mostly works with. So, the physical registers are usually used for very specific things like for, if you know from the ABI, that stuff is going to have to be in that register. Otherwise, you're going to want to use a virtual register, so that's why we insert these copies from the physical registers into the virtual registers that we know how to work with. And these registers, unlike the ones in LLVMIR, they don't have a type, but they have a register class, which in this case is GPR32, so a 32-bit general purpose register. Another thing that exists in the machine IR is pseudome instructions, which are instructions that don't correspond directly to a hardware instruction. They're used for various purposes in the backend. For instance, if you want to have a certain sequence of instructions that really have to stay in that order and you don't want the scheduler to put anything between them, like accesses to the thread locale storage descriptor or something, you're going to use a pseudome instruction for that and then expand it later on in the backend when you know it's safe to do so. So that's enough about the machine IR. At the moment in LLVM, instruction selection uses yet another intermediate representation known as the selection DAG. It's a graph of nodes. We're not going to talk about it. All you need to know is that it's pretty complex and it's different from both LLVM IR and the machine IR. And instruction selection on it goes in a number of steps. First of all, we build this new representation. Then we run a series of combines, which means we replace certain sequences of nodes with different sequence, and we do this to make it easier for the subsequent steps to deal with the code. Then we run type legalization, which means we get rid of any types that aren't natively supported on the target. So for instance, if you have a 64-bit type, but your target only has 32-bit registers, you're going to want to break it up into two 32-bit values and so on. Then we run more combines, then we legalize vectors, then we legalize types again, because maybe legalizing vectors introduced more legal types. Then we run combines again, then we finally legalize the operations. So for instance, if there's a division operation, but you don't have hardware division, you might want to replace it with the library call, stuff like that. Then again, more combines. Then we finally select the instructions, which means, so now we know everything in the code can actually be handled by the target, but now we have to actually replace this with actual instructions. And we do this with some complicated, pattern-matching algorithm that is not the purpose of this talk. Then at the end, we schedule the instructions, which at this point is really just a way to linearize this graph. It's not the scheduling in the backend, we have other passes further along that deal with proper scheduling. And then way at the end, we finally emit the MIR. Okay, so as you can imagine, this has quite a number of drawbacks. First of all, it's difficult to learn because it's a whole new representation, a whole lot of steps. There are a lot of subtleties for each target, like what each series of combines can do or not. There are all sorts of issues with that. It's very difficult to maintain, test, and debug because you can't run just one of those steps at a time. You have to run the whole instruction selection pipeline. And the most you can get is some debug dumps after each step, which is good, but it's not great. So we wanna do better there. It's not very flexible because every target has to go through the exact same steps and it only has a limited set of hooks to customize the behavior. And because of this, we have all sorts of pressure on those combines to try to fit the code into something that the rest of the framework understands and can do a good job with. And we also have fixed up passes that run after instruction selection on some targets to patch up things that can't be selected properly with this framework. Another problem is that it has a lot of inherent overhead because you have to build a whole new representation. You have to run all those steps. There's actually so much overhead that we have yet another instruction selection framework which is not global I-cell, it's fast I-cell which was introduced several years ago. And it only runs at O-zero and it's basically a trade-off. So you're gonna spend a lot less time compiling but you're gonna generate really naive code and it doesn't use any intermediate representation. It basically just generates the first thing that comes to its mind. And when it doesn't know how to handle something it falls back to selection dag. But other than that, they don't share any code or they share very, very little code. So for these reasons and others people have started working on global I-cell which is meant to address all these issues. We want it to be easier to develop, easier to test, easier to maintain. We want to have the same path for both fast instruction selection and high quality instruction selection. We want more flexibility for the targets. And since you're probably wondering about the name the previous instruction selection frameworks worked at a basic block level. So during instruction selection you could only see the current basic block. With global I-cell you have access to the whole function so you can see where the operands come from, where you're using the result even if it's in a different basic block. So we're hoping to be able to make use of that in the future. So the way we're going to achieve all these wonderful goals is by not creating yet another intermediate representation and instead we're just gonna use the machine IR but with a few new concepts like registered banks and generic instructions that I'm gonna talk about soon. But the core point here is that the meat of the representation is the same which allows us to structure the whole instruction selection process as a series of machine passes. Which is great because we have a lot of infrastructure in place for dealing with passes. Like we can run a single pass at a time which makes it easy to test. We can dump the IR before and after a machine pass. We can get debug dumps for a single pass at a time if we want to. It's also very flexible because each target can now introduce any number of custom passes at any point in the instruction selection pipeline. It can also replace one of the standard passes with something custom if it needs a different approach to do things. And hopefully it will be faster. We can't know that yet. So now that we're convinced that machine passes are awesome let's see the standard pipeline. So if you recall from the previous selection that stages were kind of doing some of the same things. For instance the IR translator basically just builds our representation. So we're basically getting our custom machine IR at this point. We're going to talk about each of these in detail later on. Then we run legalization as we did before but this time it's just one step. We're going to have register bank selection which is a new concept. It was introduced specifically for global I cell. We're going to see why. And finally we do instruction selection which as before means selecting target specific operation codes. So let's take them in order. The IR translator is going to take as input LVM IR and it's going to output generic machine IR which means that instead of those target specific upcodes like add WRR we're going to use generic upcodes like generic ad generic brand generic store and so on. So to get a feel of this this is the same code as on the previous slide. This is the final machine IR that we're trying to obtain and on the left side you can see the generic the corresponding generic machine IR. This is as far as the IR translator gets us and what you should notice here is that as I said we have the generic ad here and at this point we don't care that it's adding registers we don't care if it's legal to add things on this target we just know that the intention of the code is to add two values. And these values don't have register classes yet as you can see there's a dash here but instead they have some types and these are different from the types in the LLVM IR they're closer to the machine and basically they're scalar values on any number of bits pointer values into any address space and vectors with any number of elements of any dimension. Some of you may notice that at this point we already have the physical registers here and some of the final operation codes and this is because of one very important thing that the IR translator does which is ADI lowering. So at this point we already know that the target says that the parameters are gonna be in W0, W1 so we're just gonna put them there right from the start. You should note here that although it's possible to have final machine IR at this point it's not compulsory so the target can choose to use generic op codes for ADI lowering as long as it preserves the intention well enough. Right, so now we have our presentation we're going to legalize it and as I hinted earlier the legalization in global ISO is a lot simpler than the one in selection that because of one key decision which is that it's not types that are legal or illegal it's the combination of operation and type that is legal or not and it's interesting we actually had discussions about this some of you may have seen on the mailing list with x86 with the AVX instruction set where suddenly we have legal I1 vectors but if we legalize them only for the AVX instructions then we're kind of breaking things in other places where we don't want them to be legal so now we're not gonna have that problem anymore because we can say okay it's legal only for this operation not for anything so hopefully this is gonna solve a lot of similar problems and at this point the target is gonna have to say for each combination of operation and type what it wants to do it can market as legal in which case the legalizer does nothing or it can choose one of the predefined actions like I said widening or narrowing a scalar which means breaking it up into smaller types or introducing extensions to larger types for vectors you can ignore some of the lanes or you can break it up again into smaller vectors and so on you can replace with library calls or you can have your own target specific custom C++ code that does whatever it wants with that operation whatever floats it both of course you can also mark it as unsupported in which case instructions selection will just fail so that's it about the legalizer the next phase is the register bank selection which as I said before is new to global I cell and the concept of register banks is also new to global I cell and it roughly corresponds to the hardware concept of register banks or register files so for instance on RV8 we have two register banks there's the general purpose register bank and there's another one for floating point and vector values so these can have different dimensions, different numbers of registers naturally copying data between the maybe more expensive than within the same bank and certain instructions have different variants depending on where their operands live in which register bank so for instance you can load a value into a GPR or you can load it into an FPR and they're entirely different operations or you can do a bitwise or on a GPR or on FPRs and it's very important to get it right from the start this is another pain point with selection dag where many times we selected the wrong instruction because we didn't know where the operands would live and then things could be a lot slower or in some cases they could even be incorrect because at that point you have to introduce copies between register banks and there are targets where it's legal to copy in one direction but not in the other and instruction selection had no idea about that so now it's gonna be a first class citizen of instruction selection and hopefully we can handle all those problems up front so assuming we have select oh and another thing here is that we can decide to spend as much effort on this as we want I mean if we're running goes your own we don't care that much about the quality whatever sure we're gonna introduce a lot of copies as long as they're legal sure why not and at this point we actually have two algorithms for this we have fast and greedy we can add any number of other algorithms in the future it's easy now because we can just replace things right and finally there's instruction selection so at this point we know that everything is legal it corresponds to something that the hardware can actually handle we know where it lives and roughly in which register bank we don't know the exact register yet but we don't care so at this point instruction selection just replaces with the machine specific op codes so on the same example we'll just have to replace the ad with the WRR because here we have two registers that live in the GPRs which is really nice and again we have to constrain the virtual registers so that instead of having a type in the register bank they have a register class that the rest of the backend can understand so you're probably wondering at this point what's the difference between a register bank and a register class why do we have two and the reason is that the register class is much more specific than a register bank so for instance a register class can be okay general purpose registers including the stack point or you wanna handle it or can't handle it or for instance on thumbs some instructions can only access eight of the general purpose registers so we're gonna have a register class for that so we know they're in the general purpose register bank but we wanna be more specific than that and this will be important obviously for the register allocator and other passes so after we've done all of this we're ready to pass it onto the backend to do its magic like I said this is where we're gonna do register allocations, scheduling, target specific optimizations whatever okay so now let's talk a bit about the current status of all this like I said there's this has been in work for over a year but we're still considering it a prototype it is built by default but it's not enabled by default you have to pass a certain flag like dash global I cell or if you're working from clang you have to tell it to pass it to the backend so we use dash MLLVM dash global I cell we also have another nice flag which is global I cell abort equals zero which means if instruction selection fails don't abort and instead fall back to the previous instruction selection framework so this is to make things more robust to allow us to test things we're probably going to be using this for a while even after we enable by default so there's a lot of work in progress in global I cell at the moment and several fronts one of them is improving the framework itself so one of the directions here is to generate more code automatically some of you are probably familiar with table gen if you're not it's a tool that we're using in LLVM to generate code from some very simple descriptions of registers or instructions or whatever and these are used all over the backend and this is an actual example of what the definitions for the register banks look like for ARH 64 so for instance you have the GPR register bank it has a name and it has register class the register class is associated with it and this is one register class but the nice thing is that register classes have subclasses so if you're covering one register class you're covering all of its subclasses so what we've actually said with this line is that we're covering about 11 or 12 register classes it looks really simple and behind we're going to generate a lot of code to say which tells us you know if a register in this class can live on this bank or the other way around so we can generate a lot of stuff from this small snippet of information and for the FPR again this is a lot of classes here it's like 60 or something another front where we're working hard is target adoption so since this was developed initially for ARH 64 that's naturally the one where it's progressed the most so it now passes over 60% of the test suite without falling back to selection dag there are actually plans to replace FastEyeStell for O0 this year and you know at the moment it's naturally a lot faster than selection dag because it's not doing as much for instance all those combines we're not doing that we're not selecting anything too complicated at the moment the instruction selection in global ISO is not very intelligent it's like I said we're trying to replace FastEyeStell which is just generally the simplest thing possible so the hope is that in the end it will get within 1.1x of FastEyeStell and there's work going on here especially at Apple and we're trying not to step on their toes too much we're letting them do their work most other people are working on the other ports so like I said I'm working on the ARM port which is the non 64-bit ARM V8 so that means it's 32-bit ARM V8 it's all the older ARM thumb and so on AMD are also contributing patches for their GPUs and Intel for x86 nobody's in any rush with this because as I said we're working a lot on improving the framework itself so for instance for ARM I'm working a lot on ABI lowering because that's target specific and it's very likely to stay that way so that code is I can write it and it's probably going to stay that way and the other hand if I write things in the instruction selector that's very likely to go away it's going to probably be replaced with something generated automatically by TableGen so I don't want to invest a lot of time in supporting a lot of instructions and stuff I'd rather support as many calling conventions as I can instead because that's a better way of spending time at this point so to summarize this is happening as developers we're very excited about it I think most of the people working in the back end are going to be happy with many of the design choices in global ISO if you're just an LLVM user unfortunately you're probably not going to see the effects of this anytime soon I mean even if it gets enabled by default this year for O0 that's just going to be on ART64 it's probably not going to be great from the start so there's still a lot of work before this can reach the users but as any change that makes the developers happy eventually the users will feel the results so for reference we have docs which explain how to port to a new target and everything there's also a very in-depth presentation about the APIs and everything and this was given by Apple at USLVM you can watch it if you have a target that you want to port this to and that's about it any questions? go ahead so the question was if we're going to reuse the current table and the answer is yes so Apple is working hard on this this is actually the first point here we're trying to use those to generate code that fits global ISO so we're trying to keep the same descriptions but do things in the global ISO framework instead what's the current plan for Combine? oh that's a lot of code and the plan is it's not part of this prototype but in the future we're hoping to reuse them so that's also going to be part of this whole table Jennifer yes so that's exactly the start of the slide with the transformation IR transformation IR Translation? here one more here the question is about the one down the back yeah so you mentioned that you translate this to a scalar a pointer of the actual values is there an idea to distinguish the native pointers versus pointers to managed for languages which you are do you see a way? oh I honestly don't know so if the address space is not enough then it's probably not supported right now so the only thing you can specify for a pointer is the address space and the number of bits that it occupies so if you can model it with that yeah okay please can you estimate the architecture or the work at the moment I would say there's a lot of work because we're not generating enough stuff automatically but hopefully you know if we can replace the instruction selector with one generated automatically and the legalizer with something generated automatically then there shouldn't be much effort all you have to do is the ABI lowering as I said which is basically just one class that you have to inherit and you have to tell it how to lower returns, arguments and calls so that's not an awful lot of stuff and you have to write the descriptions for the register banks which again shouldn't be too much effort and then of course there's a lot of work with tuning it getting it to do things yes so what is the current status for AR64 are we able to encourage the compiler or do we have things to add there's still stuff to be added so it's 63% of the tests we does were the last numbers that were published and it doesn't include self-hosting no it doesn't include self-hosting yeah go ahead so the question is if it's built by default right yes it is so this has happened quite recently I think in the past couple of weeks so now we're building it by default all the bots are building it and testing yeah anything else no okay well thank you very much