 Shall we just start in time? Yeah. All right. Those who are late, well, they have to suffer. So welcome to what I claim is the nerdiest talk of the conference, so I claim the title. If you don't want to be needy in technical details, which are probably irrelevant for your future life, so you might better leave now, or you'll waste 40 minutes of your life. This is something which is of interest for many people inside Red Hat, but not really too many outsides or other forms of work as well. So I'm talking about really the intricate details of how the CPU is actually executing things. And I will actually use a lot of references to old technology, because they're just simple. So we will see this over time. So when I talk to people, even if they are coming with a CS background, most of the time we don't have any understanding of how a CPU actually works and what they're operating on. So I'm there thinking of this at the very high level. So let's talk at something very simple. So to perform the operations, which a CPU is normally supposed to do, we have at the very core from the very early on, we have something called the ALU, arithmetic and logic unit. In the earliest days, the internet computer only consisted of that. So there were paper tapes and paper, not a card, punch cards and so on, which are fed in the machine. But everything else was simply read from these things, punched, or run through the arithmetic and logic unit, and then we have an immediate output of some sort and so on. So this is the very core, and this is why all computers to this day exist. The only reason they exist is they can perform something according to the wishes of the user of the programmer. But to do this, we need what is called state. So we need nowadays in more complex machines, we don't want to have data streaming in, streaming out without any dependence between the different instruction or the different operations which we're performing. So we need to keep track of information. So one of the most obvious ones is we need to have what is called an instruction pointer. So even the simplest model, so if you think back to a Turing machine, in the theoretical aspect of a Turing machine, it had the position on the tape on which I did reading the operation. So we need this kind of information, and that's we still have up to this point. So but if we want to influence this kind of thing, each CPU itself needs to operate even on something like instruction pointer. So we need, as core part of the CPU, we need instructions which are operating on the future way the CPU is actually executing. So we need execution control instructions inside the CPU itself. So and we are not there anymore that we are hard coding instructions which are executing on the CPU itself. So they are now living in either Harvard or Von Neumann model in the CPU in the memory itself. So we need to be able to read instructions from memory with keeping state data in memory. That's the part which is easy to understand. Everyone knows about this. But for a CPU itself, it means that we need to have access inside the CPU to memory. So load and store, this is not something which comes normal. So even in today's machines you see that we have the memory itself is not part normally of the CPU itself, it is somehow attached to it. There is a non trivial way and lots of implications for what this means for the operation of the CPU itself. So when we are loading things and so on, what became clear very early on is we need to have some form of representing what is the content of what we're loading there. We cannot just say, well load from there, do the operation and store it somewhere else. It turns out that this is far too slow so we get to this later on. So we very early on introduce in computers the concept of registers. This has simply specific high performance locations which are close to the CPU core where we can intermediate the store data for a short period of time for certain operations and so on. So we have this kind of infrastructure in the CPU itself but we're able to utilize them. We need more operations inside the CPU. So we need load and store operations. We need transfer operations off and we need to move things between different registers so not just between the memory which is load and store but also between the different registers themselves. So and the whole thing has to nowadays run on operating system which adds more requirements on the operations. So we need to actually have operations which allow us to administrate the running programs themselves. So in the early days we only had one single program occupying a single machine at any one time. This became really, really inefficient over time. So now we have in most of the systems and then the clear distinction between the system mode of the CPU where we can administrate that we can run multiple programs on the same machine at the same time as if they would be running on their own. So they themselves don't necessarily have to know about this and this needs some form of abstraction and form of some instructions which allow us to perform some system operations mode. So this, as I said, is there not only for isolation, as I said, but also for security reasons. So all of these kind of operations have to be performed by a CPU itself nowadays but how this actually works while they're going to some of the details here. So I'm not going into all of this. This is far too much, unfortunately. I don't have a week for you to listen to me which would very least need to actually cover everything. So I will not go into many of the details which are related to the memory handling and so on. We'll be just a very short excursion to that. I'm mostly looking at some of the more nerdy aspects of instruction encoding, which is the next thing we're talking and how individual instructions actually get scheduled in the CPU and what kind of advancements have we made in the last decades in these areas. So instruction encoding is something quite abstract to most people. For those who are living in a compiler world and these kind of tools, this is a natural way to think about this but just imagine we have a general purpose CPU and we have to tell it somehow what kind of operations it's supposed to do next. Well, we can do this in various different ways. So in the simplest form, which is usually called the three address form, we have something that when we say, well, we encode some in some way or form using a number, we're numbering them through an operation which we can perform. Then we have a source or two source operands, perhaps. Then we have a destination target, somehow specified in this all in some form which can be stored in memory. That's what SEMDA is creating. That's what usually called is machine language so that we have a sequence of these encoded instructions in memory which when they get executed sequentially, perform the operations the program is designed to do. So in the simplest form, an operation as in this case here, for instance, would operate on two registers. There's all internal to the CPU itself. So we're saying take the parent for the operation which is specified in the subword there. Take the operands or two operands from this register and once you're done with the operation stored in that register, so that's the simplest form. The same thing has to be, in most cases, also allow for a form where some of the operations, some of the operands or all of the operands are coming from memory. At the very least, this has to be done for operations like load and stored, the things which I mentioned at the beginning, that in load and store we have to somehow encode a memory address from which we want to load the data or where we don't want to store the data and in this case also register in the sense so it's the register from which the data is taken or in which it is stored. So these are some of the simple instructions which we can encode. Other things, other operations might slightly vary maybe they need a couple of different fields in there but in general, you get the gist, this is the kind of thing where we have to do and how we can encode this. How this looks in practice for different processes is actually quite different. In what is called risk world, we use instruction set computers. We have usually the instructions encoded in 32-bit words. So there are variants, we have compressed formats, we have truncated formats and so on where this is somehow made smaller. This is mostly done for embedded systems but in forest systems most of the time we have 32 bits available to encode odd instructions with all the parameters, et cetera. So what you see here, both at the top is where you have a single instruction being represented encoding of the instruction represented and at the bottom which is an overview over the different instruction level. This comes from risk five which is completely freely openly developed CPU architecture initially out of Berkeley and you see that it's really uniform in the sense that you have different fields inside the 32-bit words which makes up the instruction which makes it very easy for a logic, for a hard-coded logic in the form of an ASIC or something like this to get to the individual pieces of information. We always know that within, let's say 12 bits inside the word we find that this is always the destination register or some form of register and we get this information out. There's no decoding needed, we just need to get this out and the actual instruction which is to be performed can also be very easily looked up. So you see there the errors originating there from some of the first bits in the words and the 32-bit words. There are the ones which are used to access this kind of table here. So the table itself describes what kind of operations at the high level are to be performed if these bits, these five bits are half a certain value. So for instance, all kinds of load instructions have all the five bits which you're seeing here to be zero. So in this case we can just write some very simple logic for the CPU itself which is accessing or initiating the execution of what is the load instruction whenever it sees these zero bits there. It's very simple, it's very fast, there's not much logic necessary, not much electricity necessary to do the decoding there and actually start the execution. On the other hand, this is the exact counter example. This is how the Exodus 6 instruction format looks nowadays and if Intel gets the way, it gets more complicated with every single day. So they are inventing yet more and more of these kind of instruction encoding extensions and you can imagine. So this is not on a single 32 bit words. All of these blocks is individual blocks and these are individual bytes and not all of them can appear in the same order. So you actually have to decode the first byte to know which path to take to decode the second byte. That's the sequential operations. We hate sequential operations. This all has to better be in parallel. So in a risk architecture, we can decode all of them at the same time, all of the different fields. Here we can't. This means that to accelerate these kind of operations and actually perform them in the first place, an enormous amount of logic has to be necessary in what's called the instruction decoder in a system like Exodus 6. And if we're going to talk about this more, we want to actually be able to decode more than one instruction at the same time, imagine the nightmare. So we don't even know where the second instruction starts. Leave alone what the individual fields of the first instructions are. So these kind of operations are requiring lots and lots of logic. There are thousands of, I don't know actually what to know. So millions of gates necessary just for the decoding and to keep these gates running with the electricity. So you can argue that the instruction decoder of one of the high-end Xeon chips probably takes as much energy as an entire arm chip at a low end range. It's mind boggling what we're stuck with it. So that's something to actually think about when you're looking at the CPUs themselves. If you're not doing this, if you're looking into really energy efficient computing and so on, Exodus 6 really should not come to your mind and Intel has the problem that they even wanted to base their future architectures like for instance, they had the arm, they had the makings still on the Exodus 6 architecture, they simply cannot let go. It's really, really sane, a strange situation for them. So how does the CPUs now actually executing what it's supposed to do? So with the instruction format, I introduced the concept of decoding which is also in here. But the actual sequence of executing instruction can be summarized in these steps. And this drives from the fact that yes, we started out defining what a CPU does even when we didn't have integrated CPUs when we still had transistor logic to make them up explicitly. So we still had this as a sequence. So we are fetching the instruction from memory. We are decoding it. That's the thing from the previous slide. We are trying to figure out what does this instruction actually do? Then we have to decode it and find out, oh yeah, here are the parameters fetch them. Fetching them can mean in different ways. So we can read them from memory. We can deduce them or get them from the instruction itself. There are so-called immediate instruction which are decoding parts of the operands for the operation in the instruction itself. Or in hopefully in most cases the data comes actually from the registers themselves in there. But then all of that kind of things depend on the decoding to have happened before. Before that we cannot really start. So once we have all the parameters in place in wherever they are necessary, we can then finally start the execution. So I'm not sure how many EEs are here and how many of you have then designed your own CPU. You know that if you're writing something like an ALU, arithmetic logic unit, it's not as if you say, ah, here are the electric inputs and well, a nanosecond later I expect all the output signals to be available at that point. There are propagation delays. There are lots of logic in the meantime. So there are things which it takes time in short to actually finish these kind of operations. So there are limitations when it comes to actually execution. And only if we are reducing the frequency of the chip itself to a really, really low number can we really expect that the propagation doesn't really have an effect. And in the case, I don't know where we ever started and machines where we had less than megahertz available and for CPU frequency at those machines we didn't care about that. We had one single cycle going before we actually had the result of the instruction available. That was nice. But over time we sped up the process by a factor of 5,000 or even more. Well, all of a sudden the speed of light and the propagation of signals actually makes a difference. So this is not the case anymore. Then once we have the result of the computation and whatever form it takes we have to write back the result. This can be into memory. It can be into a register. So we also have to update what's got the state of the CPU usually in the form of status flags for arithmetic unit or other things. And once we are done that we have finished executing the instruction. And that's the way how to this point in logic we are executing instructions. But if we would constrain ourselves to executing them really like this. So all the sequentially and before we actually haven't reached the end of step five we cannot step up and do this for the next instruction. We could not scale up CPUs, the performance levels we are seeing right now. So what I'm going to describe going forward is how this actually works, how many of these things have been improved over time. But first, well, let's go a step back in history so anyone knows what kind of CPU this is. Koiz? Pardon me? Huh? Well, it could in theory be the Z80 but it's actually the 8080. Well, no, no, it's 8080 which means that this is the data hidden registers which were in the 8080. So the 80 was a successor of it. So I like the 8080 as an example because it's so simple. We actually understand it at the transistor level nowadays. We have everything freely available. So, and I want to go through these steps on the example of the 8080 because in theory, really nothing has changed. We have the same components to some extent. They are a lot more complicated and they're working differently and some additional components but all of these exist as well. So for step one, for fetching things, what is involved? We have some internal register, the temp register which is taking the current instruction which is to be executed and we load the instruction from the memory. The memory is addressed by at the bottom right. You see that's an address bus. So we put the address on the bus and then next cycle we can read through the data bus which is on the top, the white from memory which is stored into the temperatures. Then the decoding happens in the 80K, 5080, 80, that's the PLA but nowadays it's a lot more complicated of course as we alluded to before in instruction encoding. So these kind of things used to be very simple. What they are doing is they're setting various of the internal lines to address data flow and also the execution selection and so on internally. It's basically a lookup. If this instruction come in, then set these lines in at this time cycle. So next comes the fetching of the parameters. 8080 was a simple machine. It didn't really have the three address form so there only can be one additional parameter address in every single instruction. So we only need to worry about loading one of them and that's basically loaded into this temp register which you see there, the accumulator ACT is an explicit register which you feel itself. So that's always implicitly used in the arithmetic operations and to load something you need to again put something on the address bus or you can get it from the data bus in the next cycle or you get this from the register block on the right hand side. So you can add everything in there. Then you do the operation. The ALU is triggered by the PLA and by the instruction decoding the PLA sets the appropriate bits to instruct the ALU what kind of operation to be performed and then we write back the results. So either in any of the registers including the ACT registers in this case or we write them out for instance for store operations onto the data bus after previously having selected the trace using your address buff and we are also updating the flex. So this is the kind of operation which are going on all the time but at that point we are running into one fundamental problem and that is that we cannot speed this up indefinitely because memory is low. So if you think about this in the terms of how we used to do, so this is this area so phi is the clock speed in a one megahertz world with static RAM attached to the CPU we were able to rely on the fact that after we put the address on the address bus in one cycle, the next cycle we can read the memory content works. So but now let's speed up the whole thing to gigahertz clock speeds. Well, now it takes a hundred cycles to read from memory and just imagine what this means from fetching the memory is fetching the instruction to memory to the decoding phase we would have to wait a hundred cycles. And therefore the frequency would not be two gigahertz it would be actually 20 megahertz, wow less. So this is not going to work we need something we need to be able to do something in the meantime. So we have to modicize the memory addresses so we are not loading simply a single byte we need to actually load more than something like this and make it available inside the CPU itself. Me. We need to do something while the memory accesses are actually happening so we need to keep the CPU busy. And these are the guiding principles of the last 25 years of CPU design to increase these kind of things increase what was called the IPC the instructions per cycle rate to more than one so that we actually can perform more work there. So this is the overview which I'm going to use now which is describing exactly the same things which you had before so the steps with a couple of different additional blocks there so at the top you have the decoder which after fetching the instruction for memory is decoding instruction. What has crystallized in the last couple of decades is that after that we actually move things in what's called the decoded instruction cache or it has various different names I just call it this. And this is simply storing the decoded instruction whatever internal form the developer of the CPU finds useful for each of the incoming instructions which is not that problematic for a RISC CPU in theory we can store it in RISC format in there as well unless we actually want to have something else but for a CISC view like actually the six this is crucial because the decoding as we said is so complicated and what is more to the point nowadays X6 instructions actually don't get executed as is the X6 front end actually translates each X6 instruction in the number of micro ops which are executed so the decoded instruction cache actually caches all these micro ops which have been the result of the decoding. So after that we are getting into what's called the reorder buffer so this is something which I'm going to talk about in detail now is the piece of the structure which is pulling out of the instruction cache that decoded instruction cache instructions one after the other once they can be executed and that's the important part the first part the first decoder and the decoded cache they are in order executing all the instructions in order and so on the rest of them don't necessarily do this and what this means we're going to talk about now. So let's talk about this actually means in in term of some actual code and I have to apologize that it actually looks a little bit weird now going forward because the font for some reason changes changed from the time when I actually wrote the slides so there for this is now it's okay but later on you see there the the markers which I have actually offset on some random number. I gave each of the units which you see there an individual number and we are now looking at an instruction so instruction sequence which you can see on the left hand side how this actually proceeds through the CPU itself. So the first thing is the first instruction FLD which is a floating point load gets decoded so it's the first they were fetched from memory then it gets in a single issue and we are talking about doing it in a traditional way so the instruction gets decoded and then it is it can be operated on so in this case instructions in the decoded cache and because there is nothing else going on it can immediately be executed and so on. In the meantime so remember we want to do multiple things at the same time. The decoder is not doing anything anymore so we can actually get to the point that we have the second instruction decoded. Why is there no caching cache access? Why is there no one in my example here? Remember one of the things which I said before is we want to amortize accesses so we don't want to access memory for everything instruction so along with the memory necessary for the first instruction we get memory for the second third and whatever instruction as well. So these kind of things are necessary for performance we cannot wait on memory every single time. So this is instruction caches and what they are for. So in this case so we have now some form of parallelism there so we have the first instruction being actually executed and the second instruction being decoded and this continues to be done so once the execution of the first one is done so it starts the execution actually and so on because it's a load it also uses the caches et cetera and the second instruction can be put into the reorder buffer and the third instruction can be started to be executed. So that's actually more efficient already much more efficient than what we have seen when looking at the 8080 execution model where we had the single instruction in flight at any point in time but it's still not really that great specifically because well the instructions which you see here for instance the first and second instruction they actually are not depending on each other in any way or form they could even be executed in the reverse order without affecting the correctness of the program. So what has been done in high-end CPU design is to analyze on the fly while the program gets executed what are the dependencies between the different instructions and that's what is represented here so here I gave every single instruction a number and sequence and on the right-hand side you see a dependency graph of the different instructions. So only if there is a straight line so an error pointing between two different nodes do we actually have dependencies so which means here if you look at this there's no error between one and two and two and three and three and four and five the first five instruction could actually be executed at exactly the same time if we have the necessary bandwidth there and it turns out the high-end processors nowadays have these kind of things. So we are talking about multi-issue CPUs where we have decoders which are capable of decoding one-in-one instruction cycle and then the decoder instruction are stored in these caches and therefore if at some point we have enough instruction in the decode cache and we have resources to execute them in parallel and there are no dependencies there we can execute more than one instruction at the same time this is how we get IPC numbers which are larger than one. That's a very important thing for performance there. The other thing is so that's basically half this so this is the explanation now in graphical form so instead of first being the first instruction being decoded and so on all of the instructions are decoded and here again for risk, trivial because we just know every single instruction is 32 bits in width for assist especially exodus takes horrendously difficult. So you see there what kind of hoops they have to jump through to actually make this work and so once we do the decoding we can all store them in the instruction cache and depending on what we have in terms of execution units and so on we might be able to start executing them but that's not always the case so we don't always have enough execution units in this case there's single one so even though in this case we have the second, third and fourth instruction already decoded, it's already new reorder buffer if we are not also having a specialized execution units we are still bottlenecking there so we only have one single operation which we can perform at any single time which means that we actually want execution in addition to having parallel decoding et cetera in these kind of operations so nowadays you will see if you look at the block diagram of CPU all the longest I have no Q and A you have multiple execution units actually done for a special purpose so you have adders, you have floating point operations et cetera so they're usually specialized in some way or form but you have multiple of these pipelines going at the same time so which means that if we reach the state of the execution of the first instruction we actually do not require perhaps not only the first but also multiple instructions at the same time if we have this and we have no dependencies between instructions this is why this dependency getting to it is very important and you can imagine that it's also now something which compiler writers have to take into account they're actually writing codes so that they're generating it so that the CPU has a much higher chance of starting to work on multiple execution at same time because there are no dependencies between them so in this case now we're happy but now what happens in this case now the J stands for jump that's resized assembly by the way for those who don't know what happens in a jump instruction so we have it to the point that we have decoded the jump instruction but we cannot actually start doing anything else until we actually execute the jump instruction the executing the jump instruction means that the instruction pointer which is a register in the CPU has been updated to point to the next instruction at which point we can fetch from memory the instruction and decode it, et cetera, et cetera so because we know that I'm fetching from memory and decoding and getting it to a place that we actually have the instruction in the rear or a buffer, it takes time we actually have dead air here we have bubbles in the pipeline as people like to call this kind of things so before we actually get to the point that we can fetch the memory, the location and here you see unfortunately it's the location it's supposed to be here, of course and that's just the problem with the font sizes until we are actually at the L3 level here, label here we have nothing else to do in the meantime so that's really, really bad this is where branch prediction comes in that has been in the news for us a couple of months a lot this kind of, we see how necessary it is to actually have this if we have what's called branch prediction available then after this instruction here is actually decoded we can already make a guess as to where it will be executing next and so the CPU will in most cases already start fetching memory from somewhere it's not necessarily the right place but from somewhere it will start memory fetching memory and decoding instructions so this is something which is going on all the time but at some point it might be wrong and how do we handle this these kind of operations that you actually get a good prediction it's a bad prediction because in case of bad prediction you have to roll back all the computations you're doing and start from scratch it's a pipeline install so you're actually losing lots of performance so what is being done is we are having branch prediction units which are taking, which are implementing this state machine using these kind of things that's a simple one nowadays in Exodus 6 you will find something they have the most advanced one which is looking something like this where the branch address the actual physical address of the branch is taking into account together with a couple of other inputs using hash function which is then looking at the global address table which is using the state machine from the previous one and this will give you well yeah it's the target address is this so this actually works remarkably well so there's another limit to the parallel execution and so on and this is, so if you look at them and I think they might be wrong so these are these instructions so these four instructions can be executed in parallel so but what is about the next instruction now so this instruction here actually is loading into the A2 register but in this instruction here is using the AL2 register so you could say well this is a dependency but in reality it's not it's called a false dependency and we have to recognize these kind of things but if we have a single location in the CPU where the content of a register is stored then we cannot handle that efficiently so what is nowadays actually done instead of having for each locator for each register a specific location available we have what is called register files nowadays that's a concept which I found most people have no clue about how this actually works so registers are nowadays not actual locations for data registers are pointers into a data structure a block of RAM so to say a very very fast RAM where the actual content is based on and if we come to a false dependency we are simply allocating memory in the register file for new memory in the register file for these loaded memory in this case and pointing to this new location say oh yeah by the way from now the A2 register is actually here the old memory location stays in place it's not affected so the old instruction can actually execute at exactly the same time there's a clever concept by itself so which means that we can actually execute all of these things at the same time because they are not really having dependency problems so the last thing which I have is I wanna introduce the concept of pipelines and for that I give you a couple of nice diagrams so this is the concept how an adder works an AP data and I talked to you already about signal propagation is limited in its speeds so there are lots and lots of gate lots of trenches and end gates and so on going on here and therefore we cannot perform arbitrary many operations at the same time so probably complex operations so as an example and this is not the case today and more so but imagine you want to have a 16 bit adder available you can implement this by writing this one here to 16 bit it gets more and more complicated because you need more and more levels of logic to actually handle the propagation of the carry bit but you can also construct a 16 bit adder by having two 8 bit adders the problem is that you need to store for the time the first 8 bit adder gets executed the inputs to the second bit and you need to store the results of the first adder until the results of the second adder become available that's done using latches so they are synchronized they're on the clock of the CPU but it means now that you have using limited logic and for that 16 bit adders but the result is not available after one cycle but two cycles well but the lower half is not busy while the second half is being executed so what happens now in the pipeline, the CPU pipeline, is that instead of waiting for the result of an arithmetic operation for instance to be available in the end we start with the next operation before that and for a complex one such as multiplication we have actually I think the latency nowadays of seven cycles or something like this we can have multiple of these operations going on at the same time without this this is an efficient way of doing this but it requires also help from the compiler lots and lots of logic in the CPU I have to make it quick now so that's the last slide I actually have so everything was very very simplified in the talk here if you want to get an image of how complicated this is this is a block diagram of Intel Skylake you see all the things which we talked about here and a lot more but you can imagine how all these things have have interdependencies between each other and all this has to be put in place and we have to write code, we have to generate code to compile it to actually allow these kind of integrates interplay to actually happen efficiently we can very easily write code so that this thing stalls and it behaves like a 10 megahertz processor the art is to write code to actually utilize all of this logic exactly all of it in parallel at the same time only then are the CPUs are really really good that's the magic of how we are writing high-performance computing code where much of this is working where we need to express things like parallelism not implicitly for the CPU to be discovered and the compiler to be discovered but we have to make it a little more explicit so this requires the programmers help to actually utilize fully and this is why we have to understand how CPUs work this is just as I said not even trying to give you an impression of our knowledge how this actually works it just serves as well hopefully you're going out here now and you're interested in this topic there's lots and lots of literature of course available about these kind of things and in my opinion you cannot ever write good code without understanding how CPU works and how compiler works right that's it any questions? yes how does it address those registered books? register files then you don't see that at all they're completely utterly transparent so if you're addressing a register at any point in time normally there is one single entry in the register file which corresponds to one of the entries in the registers so think about the registers being pointer variables all of a sudden that's what I said so most people have no fucking clue about this they're complicated, Liz how does it keep track of how putting all the pieces of the instructions back together without having a little tension? well, yeah there's this thing here in the bottom well it's back in there as well so the reorder buffer, the ROP, is the piece which keeps track of everything so it gives every instruction basically a number and keeps track which instruction has to be what's called retired before what are the instructions that's the magic piece there so that's unbelievably complex logic there has been a guy at IBM called Tomasso who in the seventies I want to say started writing algorithms for these kind of things which have been implemented in hardware to actually keep track of these things so there's the literature out there you can just pick it up this is one core? this is one core only no, no, that's different, it's one core no, no, no, this can be two threads I didn't even start to put hyper threads because this gets more complicated still no, no, but this implements two threads that's the logic, so I don't want to go into that, that's too complicated, yeah other things like virtualization extensions you're into this where you're allowing the second operating system so well I should repeat this actually so how does virtualization figure into this I mentioned that I didn't talk at all about memory I didn't hear me talking about how we actually implement memory there especially virtual memory and so on so if you're interested in this, I have a paper I've written on this a long time ago, it's just 120 pages or something like this so if you want to know about that so I have to come back and talk to you for a month about it so this is not so easy so how this works is the normal virtual memory has there is something in there where translation between the address a program seat, a normal program seat and which are hyper seats there's a translation mechanism in there called page table trees and virtualization just adds another layer of it a completely separate edition page table tree the virtual address of the program gets translated into a virtual physical address the virtual physical address gets translated into a physical address that's the magic band there comes some kind of logic in the CPU which makes the inner OS think it's alone in its world and can do separations and not and so on but that's not so important the most important part is the memory part and yeah, read the paper I heard people can read it in a month yes where's running Minix? oh no, that's not here so not on these kind of things the Minix part is on the... this is a core the large CPU itself has many of these cores and it has things like memory controllers it has files for the network and so on and someone it has a little embedded arc processor and nowadays atom processor which is running Minix which is outside of that but not in this one so it lives somewhere outside we have nowadays in the high-end Xeons 28 of these on a single chip in addition to that we have memory controllers which are logic by itself massive amounts of logic by itself then we have files for Ethernet, etc then we have the PCIe hub, etc and then we have what is the out-of-band controller itself these are the tiny little processors Intel is using ARM actually the six nowadays they used to use ARC and they are running their own little operating system this is where the Minix stuff comes from alright, I guess we have to get out here so, thanks so in college we designed processors and they ran at 1 gigahertz so I can see what was happening I didn't know that if you ran it past 2 gigahertz or 1 gigahertz that you didn't get this stuff back in one cycle that's news to me, I was like... yeah, I was not an EE, I was a computer science guy so I didn't know there was... I'm a CS guy now it is over I'm older than you I bet that so Peter and I are, perhaps, it is safe I have a quick question for you okay, so I write garbage collection operators and we have re-variance and I have long maintained and papers have long supported that you can't set a conditional on a re-variance that you can do a load you can do two loads, which can't do a conditional and I had somebody challenge me recently and say... when you mean that conditional if GC cyclist... if thread local flag is true do something different and that would take way longer than double of... oh, I thought I was lunching yes it is except you're not getting in it just completely depends on what you're doing if you have the cash is primed and so on so if you make everything local to the core itself the if doesn't really take that long if it's predictable so the state machine for the branch prediction also conditional loads and whatever you want to do they can be terribly efficient you just have to make sure that this hash lookup which is happening there if you design this thing actually always gets the right address so for instance in the thread library so in the C library itself what I have is to to prevent the use of the log prefix what actually it says all the time I actually have check global variable if the variable is not saying that a multithread program I'm jumping over the log prefix one single byte and that's actually more performing as... performing a single optimum operation you just wrote my name oh that's my job I'm wondering how would you use to take the complexity of this top and put it into a very commendable way maybe some part of this thing is awesome would you have a chance to it makes my slides look like crap I'm the only one who writes his own presentation software I'm the only presenter here with his own presentation software I wrote it very good setup it worked the first time doesn't blame, doesn't disconnect of course this time I checked ahead of time