 Welcome to this last talk of the LLVM toolchain dev room. My name is Davy Duhes. I work with software at ImSys. And in this talk, I will tell you about how we use LLVM at this company. But first, just warm up with some very basic LLVM stuff. How do you use LLVM? So if you have your software application in some high level code, and you have a target architecture where you want to run it, then you need to compile. You take LLVM. And what's happening inside LLVM is first, you take a front end, which can turn your application from your favorite language into LLVM assembly, or IR, the intermediate representation. Then you can use the LLVM midland to optimize this intermediate representation code. And then finally, you take a backend which targets the device architecture where you want to run your code on. And this backend will turn the LLVM assembly into your targeted architecture assembly code, or binary executable code. What this talk is about is how we want to improve the efficiency. And I will tell you efficiency in what sense. If you consider LLVM assembly and the typical target architecture, then there's a big gap. The backend needs to do typically complex translations. There's a big semantical gap. Instructions that architectures are typically not designed with respect to the compiler's intermediate representation. What we want to do at ImSys is lifting the instructions that architecture closer to the LLVM assembly level, so reducing the gap. And then we use an LLVM backend to target this architecture. So that's what the main topic of this talk would be. But first, I would like to very briefly just talk a few words about the company itself, what we are doing. Then the ImSys clean processing technology, our code technology, which we utilize to be able to reduce this gap. And then last but not least, I will also say a few words about our TaylorMade instructions that's for LLVM. So ImSys AB is a Swedish semiconductor SME. It's located in the North Stockholm area. We are working with our own proprietary processor core. We sell devices, modules, the processor IC. And in the future, we plan to go to sell IP as well. The company has a history as supplier of network embedded controllers with some special features. These pictures are some of the actual applications devices where our processor was used. But now we want to retarget our processor for the intent of things as we see some good match for our values. And using LLVM is part of this retargeting. As a device, we want to provide a single controller solution for IoT applications. That's the ImSys AMLA. It's a small, handful application or device with some IO capabilities. It can be connected directly to an LCD display and a touch panel. And the different IO capabilities can be used with an extension board for which we have a reference design. Software buys, which might be more interesting in this session, I would like first just to give a very high level overview. So you can have our development device. And then you develop your application code. And we support you with an Eclipse-based integrated development environment. You can develop your application in CNC++, in which case we want to use LLVM to generate executable code. And also, we support Java execution. All right, that's about the company. But don't forget this topic is about how to reduce the gap and have a tailor-made instruction set architecture for LLVM. So first, I tell you about the technology which we use to reduce the semantic gap. Let's revisit the software layers, the abstraction layers. So we have the ImSys processor core, which supports an instruction set architecture. Then you have your application code again, which can be Java or assembly cc++. And our instruction set support provides basically two instruction sets, one we call ICJ test for LLVM. And ICY is for LLVM. And ICJ is an instruction set architecture for Java. Now I won't talk more about Java. Let's focus on the LLVM stuff. Here, if you want to think about what are the main levels of abstraction, then usually you say that, yes, we have a hardware. The hardware provides us the instruction set architecture. And then we have our software, which runs on this instruction set architecture. But actually, there is, let's say, a forgotten layer of abstraction. Historically, it was there. But nowadays, it's very hidden and typically not used much. That's microcode. And actually, microcode makes it possible to have a really, really tight control over what the processor does. So I would like to give you a brief idea of what microcode is. So the processor has a micro-program. The micro-program is a list of micro-instructions. And each micro-instruction consists of fields, separate fields. And each control field has a control value which directly controls the behavior of the different functional units of the processor. One of the functional units is the sequence control, which then decides which micro-instruction to execute next. So this is a very direct close control over the features of the hardware. And an architecture which operated by microcode is basically an operation-oriented hardware architecture. Everything, every step inside the processor is directly controlled by the microcode which we developed. All right, so that's how microcode works. You have some idea now. But what is it good for? There are a few things which we value in microcoding. First, as I already mentioned, we have complete and actually deterministic control over what the processor is doing. Then our hardware, the processor core, can be minimal since all the complex control logic is implemented in microcode. So there is no need for the pipelines. We don't have cache hierarchy, out-of-order and speculative execution. We don't have a complex hardware state to maintain. Also, since we implement different features in the microcode with direct tight control, it means that the utilization of the actual hardware can be maximized. So we provide maximum efficiency with our hardware. Microcode also means flexibility. So we can basically implement any kind of computation in microcode, which makes our processor core a multipurpose device. We can develop a general purpose instructions at architecture, but we can also microcode special digital signal processing features like FFT or encryption and so on. And also, since microcode is a very special kind of software which is stored in a special memory in our processor, we can overwrite the microcode. And we can reconfigure the device dynamically. So that's also an important feature for the future. So if you think about the abstraction layers again, then we can match them to this hardware microcode software layers. So the processor core itself is, of course, hardware. And then the instructions that support, the instructions that architecture is actually not hardwired in the processor core. It's defined by microcode. And then, above the microcode, we have a quite thick layer of software. On the high level is the replication code. And the lowest level is the instruction set, which is implemented in microcode. And the high level and the low level software is connected by LLVM. OK. So what it means, I just would like to point out what it means to have a microcode-defined instruction set architecture. Typically, you have your application code. And then you have a compiler which turns it into your assembly code and binary executable. And that's typically hardwired into the processor. But if you think about an operational-oriented hardware architecture, that actually the microcode supports us to lift the instruction set architecture from the hardware to a higher abstraction level. So what it gives us, what it gives us, the subtraction possibility is that we can implement domain specific operations, like I mentioned, FFT encryption, and whatever. And it also provides us with the possibility to have a rich and balanced ISA which the compiler can target. OK. So that's about our core technology. And then I will focus on, for a few slides, on the actual instruction set architecture which we are implementing for LLVM. So we have LLVM assembly. Everyone knows that very well. And then we have our own ISAL instruction set architecture. So first, what do I mean when I say that we lift ISAL to match LLVM assembly? It means that we provide semantically matching instructions for basically all LLVM assembly instructions. So it means you have an addition in LLVM assembly, then you have an addition in our instructions set architecture. Of course, this is not a big thing for these basic operations. Every processor has an additional instruction. But we have the same matching, semantically matching corresponding instruction for complex operations as well. For example, like bit reverse operation, count leading zeros, or even intrinsic floating point operations and things like that. So we developed our own LLVM backend, which turns LLVM assembly into ISAL. And this backend does not need to do very complicated things actually. So it's very simple, kind of efficient, and usually mostly it uses general LLVM facilities. And this is thanks to the matching semantics. Also, of course, LLVM assembly code can be optimized using the LLVM middle end. And we are very happy with that because we have direct use of those general LLVM assembly level optimizations because our backend will not modify much on the semantics of the code. So we benefit directly from the optimizations. And also, since the ISAL instructions are matching the LLVM assembly instructions, we don't need to do much more target-specific magic in the backend. And so far so good, but LLVM is based on a theoretical model which has some characteristics which makes it practically impossible to directly implement, of course. So we need to think about how to constrain ourselves to be able to implement an instruction set architecture for LLVM. First thing is operations. So LLVM assembly has instructions and intrinsic functions. As I said, we support or we provide the semantically matching operations in ISAL. And additionally, we need it to add some kind of system operations, management operations, like handling IO, managing the execution state, and also some special data movement operations. Then the next thing is the supported value types. So LLVM assembly has a virtually an unlimited number of single-value types. If you think about, there are a lot of floating-point types and much, much more possible integer types. They are typically not used, but they are there, in theory. So of course, we need to define a set of integer, floating-point, pointer, and vector types which we support. Then registers. LLVM assembly has an unlimited number of registers which is, again, impractical to try implementing. So we defined a set of register windows which we support. And of course, when talking about registers, it might be worthwhile to mention that of course there is a big difference between LLVM assembly and reality as LLVM assembly is in SSA form, static single assignment form, which is, again, not practical to implement. So of course, we don't support that. As for the argument, basically each instruction has source and destination registers as arguments. I will tell you in the next slide why this is not always efficient to have, or why it is not always efficient to have only that. So we added special instruction variants to support accumulating in source registers and also to work directly with immediate values. And also the binary representation is very important for us. So LLVM assembly has the bitcode as a binary form. We developed our custom dense binary coding and I will tell you some more details about that too. All right. So optimizing operation sequences. First, let's have a look in accumulating in source registers. If you consider this simple addition A equals A plus B in a, let's say, a regular form of an instruction, you can say add A, A, B. So add A and B and store the result in A. It means you have an opcode, the destination register and two source registers. But it's quite obvious that the destination and one of the source registers are the same. So why should we store it in the program memory twice? We can have a special variant and accumulating or as we call it an in-place update variant. Add update A, B, in which case the first source register is special as the result will be stored back to there. So here we saved one argument in the binary representation and since this kind of possibility where updating in-place update is possible, we can save a quite considerable amount of program memory. The other type or the other special variant of instructions is working with immediate values. So now if you consider A equals A plus 42, then if we have just a regular additional instruction like in the previous example add A, A, B, then it means that 42 should be in a register. So you have a special operation, move 42 into register B and then you can use the addition. In this case, you use actually two instructions to implement this behavior. If you have a special variant, add immediate value, then you can just use directly the immediate value instead of the second source register. So you saved binary space again and of course you can combine these two two way of handling the argument. So the combined special variant is add in place an immediate value where you just define the source and destination register as one register and the immediate value to add to it. So here we saved even more program memory space. In ISL, we have this kind of special variant basically for all similar instructions where it's possible to have, okay. And then optimizing binary representation. So the things on the previous side already contribute to having a reduced binary size in the program memory. But also it's important to be clever when you design the binary encoding of an instruction set architecture. So we want to have a high code density, which means that a particular piece of software should consume as small program memory as possible. But we have a lot of instructions in ISL. So to reach our goal, we must have variable length instructions. Our instruction, length of our instructions varies between one and 10 bytes. And here you can see the distribution. So most of them is three bytes long. They average is somewhere around 3.4 bytes. I don't want to go into much details about how the actual binary encoding is structured, but I would like to give you some idea of what kind of problems or what kind of characteristics you need to think about. As I already mentioned, we want to maximize code density. So most typically, the more frequent instructions should have shorter representation. So the repeating instructions will consume less space. Then we also want to optimize the footprint of the microcode implementation itself. We want to be able to reuse codes for part of the code logic. So we, of course, want to have some regularity how we encode the operations. And also we want to be able to optimize the, not just the code, but the computation logic by sharing these parts also between similar instructions, grouping them together. And this part related to the third consideration is of course, at the end, performance matters. So we want to minimize the actual execution time as well. Here, the binary encoding is relevant because decoding is dependent on the binary encoding. So with the clever spacing and formatting of the encoding types or the encoding formats, we can reuse the decode time. And also we can make possible that decoding an instruction and actually starting to perform the operation can overlap. So in this way, we can minimize the execution time. And I talked a lot about binary coding and code density. So we have some preliminary results with ICL about code density. Or actually this diagram will show the binary size. So in this case, the smaller is better. We compiled the Texas Instruments Suite benchmarks with LLVM and normalize the results to our ICL. And as you can see, basically ARM Cortex requires at least 35% more program memory and X86 requires 80% more program memory to store the same TI Suite benchmark applications. So our target, one of our targets was to have a very dense code representation, which we think we reached. And well, with this exciting results, I would like also to wrap up. So I talked about our core technology, the MCC in Processing Technology, our operation oriented hardware architecture, the processor core and our firmware, which is based on microcode and implements the actual instructions that architecture, which we tailor made for LLVM assembly. I also talked about what is the relation and how we use an LLVM backend to efficiently, and I can say simply target efficient or generate efficient code for ICL. And I also would like to mention that ICL, the implementation of the instructions that architecture itself and the software ecosystem around it is still work in progress. So it's not available yet, but we plan to release it as an upgrade for our MCC Sambler device sometime in the next year. So thanks for your attention and I'm ready for questions. Yes, please. Yeah. I have the microcode that you wrote. Sorry. I want to take the microcode that you wrote. You want to check the microcode? Yeah. Well, we can discuss that offline if you really want to have a look into our microcode so that that's not open right now. So it's really connected to the hardware architecture of course, and that's a proprietary thing. But if you're interested, we can take it offline. So you can replace it as a part of LLVM and replace it with a closed source microcode. But we are not able to check this point. No, I can, maybe I can answer this question by showing another slide. So right now we developed the microcode by hand and right now with the, not ICL, ICL microcode is not, in that sense, it's not released yet in any way, not as binary, not as source code. But previously the practice was that we developed, the company developed the microcode and yes, it was proprietary and closed software or microcode firmware. With ICL, I cannot tell you how it will be released. But we plan actually to use LLVM also not just to target our instructions at architecture but to generate microcode from high level software code. And in that sense, of course, if we will have, yeah. So in that case, the microcode itself will be generated by an open source software and so this problem could be resolved with that. I mean, was it, yeah, okay. Yes, please. How did you find the trade-off in complexity because it seems like you've got a simplified back end? Yes. Well, haven't you replaced part of that back end by something written in assembly? Yes, so we are thinking about that the complexity from the back end was actually moved down to the microcode implementation, right? Well, yes, that's true. And of course, it means that, yes, the microcode needs to implement some complex features which could be error-prone and it needs discipline to implement correctly. So actually, that's why we want to replace this, let's say, handcrafted microcode development with actually generating microcode. But also, this slide might be related. So our strategy is not implementing I-cell at once so because that would be quite big work. So in the first phase where we are now, we identify the base subset of the instruction set and we implement that in microcode as a proof of concept and it will be fully functional. Of course, it will not immediately provide the same code density because we will not be able to utilize all the instructions. But after that, and we partly started this work also, we want to implement an emulation support. So complex I-cell instructions as a first step and as a reference implementation could be actually implemented using I-cell itself using the simpler already microcode operations. And then later, we plan to do continuous development to actually on demand support the features which actual customers need or the community needs. Was it answered to your question? Okay, thank you. Yes? Yes, so can you explain why you had to do this? Because basically, you're constructing your ISA based on MMDF to simplify how you write the backend or if you had to do this, I mean, there were some... Okay. What kind of concept you had that forced you to do this instead of choosing like a MIPS or ARM or whatever or whatever. Okay, so I think your question, I tried to reference it. So one part of your question was why just don't, why do we need to, what, okay, sorry. So you asked what are the benefits of using microcode to implement this I-cell, special instructions to architecture. And then you also asked about why using this processor? Yeah, it's somewhat related, but I would like to take it as two parts. So the first part would be, sorry, I just want to find a slide. So of course, having a simple way to implement the backend is good. It's good for IMSIS because IMSIS is a processor company and we have expertise with the processor and microcode development. So implementing complex compilers is not really our field. So if we can have a simple backend, that's just good. But actually, that's just a side product. But we wanted to utilize microcode for ERDs five characteristics. So in general, we believe that by matching LLVM, the LLVM assembly level, we can have a rich instruction set which provides complex instructions. So the binary code, the size required in the program memory for the executable code can be minimized smaller than other mainstream architectures. So that's a benefit. And also we believe that this microcode in the complex operations helps improving the performance. So in this case, the whole complex operation is in microcode. And otherwise, if we would have only a simple risk instruction set, then the same operation would take, I don't know, a sequence of assembly instructions which would take extra time to decode all of the instructions and so on and so forth. Yes. And then why not using other processors in general? Well, because we have our own processor technology and yeah, of course we want to explore the possibilities in that. Yeah, I was wondering what was the job actually? Yeah. Why it was better to design your own compared to using something that... Yes, and then, sorry, it's better to select directly. So this is a bit more detailed overview of what we support. What is interesting is these green things. So we have the Java support, we have ICEL, and we plan to have a special part of the instructions that support DSIX. This is domain-specific instruction extensions. So as I said, microcoding provides us flexibility and reconfigurability also. And we plan to generate microcodes from hotspots of application code also. So it means that not just ICEL, the general purpose ICEL itself can be utilized and exploit the processor core, but we can also implement complex application-specific features in microcode to improve the performance. So that's also a special feature. Yes, please. You made some claims about how your ICEL compares in terms of code density to existing ones. These claims look very similar to the claims made yesterday in those five talks. So how does yours either compare to risk five and code density? Yes, that's a good question. So we didn't compare with risk five and probably that's a missing thing, which we should do. But we believe that risk five is still risk architecture. So we were, of course, relays back to the previous question. So why we decided to implement something specific for LLVM and yeah, risk five is there. But we believe that our binary coding, matching the LLVM assembly instruction set, provides better code density. I cannot tell you right now actual figures, but we will check that out. Yes, is that answer? Okay, yes. Yes, do you want? Yes, you're in the black shirt. Yeah. Oh, sorry. I was wondering, do you have any benchmarks on the performance here and similar question to that one? This looks to me like you are looking for more complex operations to be done on the hardware side. So this looks like there are a couple of attempts to, for instance, to have a recent JavaScript interview with an interpreter or Java code integrators by the code developed on the hardware side itself. So do you have any ideas going in that way? So you could implement these op codes like your own instructions and maybe someone is there. Yes. So about interpreting different things, of course, microcode. So this is not hardware implementation. Yeah, it's similar, very low level and it's very close to the hardware, but yeah, microcode is rather software than hardware if we find it to compare. And of course, it's possible to implement other instructions at other kind of operations in microcode. Right now we don't have any plans to do it, but it will be a possibility if we see that there is a customer need or something for that. But if you're accumulating all these instructions and getting the real benefit out of that, rather than generating the efficient list code or such, so that's my- Ah, okay, okay, I see. So you are rather thinking about directly executing some special instruction set or just compiling it into some good instructions at architecture or, so that's- Would that be a basis for some future ISIS from you to plant in hardware or your plans just emulate that from now on and work in that basis? Okay, so your idea is to have this microcode implementation as a reference and then move everything into hardware? Yeah. Okay, so no, we are not planning to do that. So we believe that microcode itself is, yeah, it provides us the flexibility to be able to change, to be able to, maybe I show another slide then. So as I mentioned, our core- But are you getting some benchmarks results? Oh yeah, the first question, sorry, yes, the performance. So as I said, the ISIS implementation is still work in progress, so we couldn't really execute anything yet, but we are closed. So I cannot tell you any actual execution time figures, but we expect that we can be better than ARM Cortex M0, for example, but I don't have any figures yet. So that part is work in progress. So I can, yeah, I can tell you about our estimates, but yeah, that's nothing. But about keeping everything in microcode and not moving into hardware, as I mentioned, our processor core is quite small, compared to the mainstream ones. And it's theoretically possible, while we have 65 nanometer, what we use now, and we don't have this really massively multi-core solution yet, but it's theoretically possible actually to have several thousands of cores on state-of-the-art processing nodes. And in this case, each processor core has its own micro-program and it's software configurable. So depending on the application, the software itself would be able to, let's say dynamically reconfigure, repurpose each one of the cores separately, depending on the actual application requirements. So if we would put something into hardware, then of course this wouldn't be possible, everything would be fixed. So this kind of flexibility, we believe that will be very important in the future. That's why we want to keep everything in microcode. All right, so I hope that answered your question and we can take that offline. So thank you for your attention again.