 Hello. In this session, Alex will talk about challenges when Vector Smith virtualization. Okay, Alex, the stage is yours. Okay, thank you very much. Okay, well, I'll just start with a brief introduction. My name's Alex Benet. I work for a company called Linaro. So I'm a member of the virtualization team there. So as a result, I tend to work on projects like Quemu and KVM. And because Linaro is interested in the ARM ecosystem, ARM is basically what I do. So let's quickly introduce Quemu. So what is Quemu? According to its web page, it's a generic open source machine emulator and virtualizer. I think the queue originally stood for quick, but we like to think of it as pretty fast now. So Quemu supports two types of virtualization. The first is probably the most familiar to most of you. And that's when it's used to launch guests using the Linux kernel virtual machine. This requires support from the hardware and uses hardware features to accelerate the virtualization. The second mode is less well known. It uses a just-in-time recompilation engine called the tiny code generator or TCG. And this allows you to fully emulate a system of any supported architecture on any other machine with potentially different supported architecture. So let's have a closer look at the two in detail. So as I said, hardware-assisted virtualization is what most people are probably familiar with. This is the sort of high-performance virtualization that you see in the cloud or if you're doing old-fashioned things like server consolidation. So here Quemu is responsible for launching the virtual machine, but it hands most of the work off to the KVM subsystem inside Linux. And most of the work that it does is supported by the hardware. So the hypervisor only really gets involved when the guest access is something that's virtualized and the hypervisor needs to do something about it. Most of this is dealt with by KVM itself, but ultimately we can always go all the way back to Quemu for Quemu to deal with the access. So Quemu usually gets involved for older emulated hardware or also in various IO situations. Just a quick note, the EL modes here are ARM terminology, so they stand for execution level. So EL0 is user space, EL1 is kernel space, and in ARM we have a specific EL2 which is the hypervisor layer. So the second mode that we're talking about is full system emulation. So this is the sort of thing you'd see if you're running the Android emulator. Also it's quite heavily used in embedded development because it's often easier to debug stuff on an emulator than on the real hardware. And another use case for full system emulation is when you're bringing up new architectures. I think in fact RISC v, before the hardware came out, one of the fastest ways of running RISC v was under a Quemu branch. So in a TCG system the emulation looks quite different from KVM. Everything's run in a single user space process. Quemu allocates a block of memory to represent the guest system and then it dynamically recompiles the guest code to emulate the guest system in user space. It's quite slow because every guest instruction takes a good number of host instructions to emulate. And there's also a bit of a cost to emulating the memory management unit. This in case you're wondering is why the Quemu binaries are often called the Quemu soft MMU. But as far as the guest is concerned it's running on real hardware. There is actually a third mode that we use specifically for emulating Linux user space binaries. You tend to see this in situations where people are doing cross development. So cross tools or if they want to run stuff in a root FS. Another use case is if you've got a legacy binary for an old architecture that you can't rebuild for whatever reason. Now the mechanics of the JIT are mostly the same but it does run quite a bit faster because we're not emulating an MMU. It also only works for Linux binaries because Quemu is not emulating the kernel itself. It basically takes a system call from the guest, munges it a bit so it can be passed directly on to the host on the guest behalf. So now I've talked about Quemu. Let's talk a little bit about these vectors I'm discussing and what they're all about. So a lot of the activities that we take for granted in modern computing have been enabled by vector processing. So things like video playback, audio processing, 3D modelling all involve large amounts of data processing. And most of this data processing involves a thing called data parallelism and this is where vector processing really shines. Now here's a quick quiz question. Can anyone name this machine? Very good. Yeah, it's the Cray one. It's one of the very early supercomputers of its time. It was built a few years after I was born. Even though it only clocked in at 80 megahertz it could do 250 million floating point operations a second and this is in part due to its vector-based design. It had 8 vector registers each capable of holding 64 bit elements and this allowed it to execute the same floating point operation across multiple elements in the vector register. I don't know if anyone saw Liam's circuit less travel talk yesterday but as you can see, they did it first quite a long time ago. So while supercomputers were early users of vectors eventually it came to workstations and they started to use them as well. Spark was quite early on with its Viz or visual instruction set but most PC users will probably come across vectors the first time when Intel's MMX extensions were introduced. These are sometimes known as the multimedia extensions because media processing was one of the very first workloads which these extensions were designed to accelerate. Now the early workstations only supported integer operations but it didn't take too long before floating point became the Norman. AMD's 3D now I think was the first one to introduce it to the PC. So let's have a closer look at what a vector register looks like. So this is a wide register which contains multiple elements and these are usually referred to as lanes. So in this example we have a 128 bit wide vector and it holds 4 32 bit values. Now why is this useful? Let's look at an example vector operation. So this is a vectorized add. So with a single instruction we're adding the contents of the VM and the VM registers together but importantly each individual lane is processed separately. So this is why these instructions are often referred to as SIMD instructions which stands for single instruction multiple data. So by saving the cost of having to decode an instruction each time and also by having multiple arithmetic units in your CPU you get a simple parallelism in your processing. Now if you look at the history of these instructions on the PC you can see there's been a steady growth as they've tried to catch up with their supercomputer counterparts. The first expansion from MMX doubled the size of vector registers to 128 bits. The current iteration AVX 512 as was surprised no one is 512 bits wide. So as you can see for that you can do 8 double precision operations at a time or 16 single precision operations all the way down to 64 byte operations. Now ARM have taken this to its logical conclusion and they've introduced a thing called their scalable vector extensions. So this introduces vectors which can be up to 2 kilobits in length and also introduces some novel instructions that allow you to utilize these vectors without having to hard code assumptions about the number of lanes you have available to you. So the idea is you can write code that will run on your phone which might have a small vector length and without having to recompile it also run it on a supercomputer with a wide vector length and just get an automatic performance boost. Let's have a look at an example. So this is a classic C string copy function. I guess quite a number of you could probably come up with a simple assembler function to loop through reading a byte at a time and then storing the string at the destination. But how would we do this if we wanted to operate on an unknown number of bytes at a time in a vector register? So prepare yourself. Here's some assembly. You don't have to take this all in at once. I'm going to go through it bit by bit. But that is the complete string copy. So one of the keys to understanding how SVE works is the idea of the predicate register. So a number of operations either set the register or use the register to control the vector operations themselves. So if we look at our vectorized add from earlier it's now being controlled by this predicate register P. So in this example the predicate register only allows two of the lanes to be calculated and leaving the other values completely untouched. So let's look again at the assembler. So these first two instructions are setting this up for the copy. So the first instruction P true, sorry the P true P2 sets the predicate register P2 to all true. So we don't actually know how wide the register is but it's just saying use all the available registers when we're using this predicate register. So the next step is to load as much of our source string as possible into the Z register. But you can see there's actually two additional instructions around this, the set FFR and read FFR. And these are referring to a thing called the first fault register. Let's just have a quick look at that. So the first fault register solves one of the problems you have when you're dealing with large chunks of data of when you have to go over things like page boundaries. It might be our string finishes just before a page boundary but if we're reading things in big chunk at a time we'll run over the page boundary and generate a fault. This is obviously going to suck for performance. So what the first fault register allows us to do is speculatively load as much data that is possible. And then the first fault register will report how far we got afterwards. So here we go, we've got the set FFR sets the first fault register and says load as many bytes as we can based on our predicate register. Then finally we read the first fault register into P0. P0 is now set to the number of bytes that we actually read. So it's either the full vector length but if it faulted it might truncate to the end of the page. So the next thing we need to do is we need to test for a null termination. So again we're doing this across the whole vector register at once and this compare instruction basically sets the predicate register for every byte that has a terminating zero in it. The next instruction, the break A simply sets another predicate register P0 to be at the point of the first zero because we don't want to copy any zeros after our terminating the termination of the string. Finally we need to store the result back. So now we're using the P0 predicate register which is exactly the number of bytes that we need to do and we store that back into our destination. And finally we need to know how many bytes did we copy. So we increment X2 with the number of bytes that was involved in the operation. So all of this code doesn't actually have any knowledge of the size of the vector but it's still in copy as many bytes as possible at once. Now if you want to compare this to some other code you can go and dive into G-Lib C string copy functions. There's a directory called sys steps and you'll see that with some of the other vectorized support that you have to go through extraordinary lengths to make sure things are aligned to run over an edge and stuff like that. So before we move on to the next bit let's just recap. We've talked about virtualization. We've got many flavors software based and hardware based. Vectors have been around a long time but their usage is growing especially given all the data intensive processing we have to do and their key feature is their length which makes them very useful when the task exhibits data parallelism. So let's talk about the challenges that vectors present in pure software virtualization if we're using Chromium's tiny code generator. Now we have a problem. Chromium aims to be a flexible system so it operates on a large number of guest architectures. There's currently 20 but we're going to be 21 soon when the RISV stuff gets merged. The back end also supports most of the popular architectures. So we have to have a system that takes full advantage of the architectures that we're on but without being hard coded for any particular X to Y translation and this means we have to be more flexible than a lot of single purpose translators. Now why do we do code generation? Well interpreting instructions is going to be slow and all processors have common functionality. They all do logic arithmetic flow control. We should take advantage of the features of the host that we're running on. So you can think of code generation as simply a compiler but instead of working with source code we're working with the machine code of our guest. Now the process is fairly simple. On demand we take a block of machine code from the target. We convert it into an intermediate form from which the final jitter code is generated and we call this TCG Ops. So let's have a look at an example. So this is a fragment from a little benchmarking utility that I wrote to test out vectorizable kernels. It's a very simple one. It's simply going through an array of floating point numbers and multiplying them together. Let's look at the assembly just quickly step by step. Load the two values from our two pointers. We do a multiplication. You'll see here that the guest code that we're running is actually vectorized. So it's actually doing four multipliers at a time. And then finally we save the result. So let's have a look at how this is... Oh, sorry, and then we do our loop. So let's have a look at how this is broken down into TCG operations. So the first instruction, this is just the first instruction, LDRQ0X0X21. So we're loading 128 bits from the thing that's pointed at X21 indexed by a register X0. So first of all, we need to get the address of the load by adding X0 to X21. So that's three operations. The next thing we need to do is we need to do the load from memory into one of the temporary registers, which will end up being a host register. Now, as you can actually see, we're doing two loads because we're doing two 64-bit loads. So we do actually need to calculate the offset for the second part of the load. And then finally, we store the results into the register file. So this is QEMU's internal representation of the guest CPU state. Well, that's quite a lot of TCG ops just for our first instruction. Let's have a quick look at the calculation. So I'm not going to go through this in too much detail, but the key thing to take away from here is instead of actually generating code, we're calling a helper function. We're loading two values from registers this time. That's what these MVOS sets are. Call a helper function and then we store the result. But this is only doing one 32-bit operation. So we actually need to repeat that code another three times. So this isn't too bad if we're doing four multiplies, but it's soon going to add up if we do 16 multiplies on an AVX512 or 64 potential multiplies if we've got a full-width SVE register. And why are we going through all this marshalling process? Well, TCG only really understands two things. It understands 32-bit registers and 64-bit registers. The other two types are just aliases depending on the size of your guest and the size of your host. So clearly it's time for the TCG to move with the times and actually have first-class support for these vectors. But there are a couple of problems we need to get over first. So the first of all is there is an intrinsic link between TCG types and TCG operations. So this adi, so ad immediate, we have a 32-bit version and a 64-bit version. So with that in mind, how are we going to define our types of vectors? So one naive approach is we could just introduce a TCG type for each vector size we've got. So we could add 128-bit and 256-bit and 512-bit and actually it's going to go on quite a bit already. But that's not actually enough. We're actually doing operations on smaller chunks of those vectors. So for example, we might be operating on 264 operations at a time or 432-bit operations at a time. So maybe we need a type to represent each layer to the vector. But even that's not enough because most vector operations can also run on smaller than 32-bit size. One thing that's been introduced recently is half precision calculations, 16-bit values. So you end up with an exploding plethora of TCG types and that's a problem because for each TCG type we're introduced, we're introducing more TCG operations. So we need to go with something that's a little bit smarter. So we introduced a thing called TCG VEC. So this is a special type that represents a vector. We knew from the start that we needed to support multiple vector sizes. As I've pointed out, there's been a steady growth in the size of vectors and it's not a trend that's likely to stop. Even SVE with its 2-kilobit vector support leaves plenty of space in the architecture to have even bigger ones. So we need to do this in a way that doesn't explode the TCG ops space. Secondly, helpers are still going to dominate things like floating point and it's likely to do so for the foreseeable future. So we need an interface that's presented to our helper functions that should be as efficient as possible. So we don't want to end up doing lots of marshalling back and forth between registers. So we avoid the marshalling by passing pointers directly into our CPU environment. This does actually help the helpers as well. It means the compiler can vectorize the helper function and that's a bit of performance back. And finally, there are enough operations that can still be dealt with in generated codes. We need to maintain the ability to do that. And we need to ensure that there's enough information in each TCG operation that the back-end can make the most efficient use of the host processor that it's running on. So let's just have a quick look at the sort of TCG IR that we get with a TCG VET. So here we go. We have the guest instruction here with 128-bit XOR. It's working byte-wise, but it doesn't really matter because an XOR is an XOR. I don't care about each bit. But now you can see we've got quite a nice compact representation of that instruction. We still have to deal with loading the value from the register and storing it back. But you can see that the generated code now is a lot more compact. In fact, because we're running on a X86 that's got SSE, we can take full advantage of the SSE registers and do the operation 128 bits at a time. So we've got much closer mapping between the guest vector operation and the host vector operation. So this gives us better code generation, and it also gives us more efficient helpers for the times that we need it. So let's have a look at the blazing performance that we now have. Well, that's a little disappointing. So in most of the test cases we've got here, the TCG VET code is running a lot slower than our existing TCG code that does all the marshalling. There is one case, though, this bytewise bitfiddle that's running a bit faster. So let's just have a quick look at what this function's doing. So the main difference here is the bitfiddle is doing more logic operations per loop. In fact, this is exactly the reason Seymour Cray introduced vectors back in the 70s. His observation was although the process of loading stuff into a vector register was pretty expensive, when you chain operations together, staying in the registers, you could get that performance back. I say it might be clearer if we look at the guest assembly. Don't worry about understanding all that, but the key thing to see is he's actually doing a lot more calculations in each loop. So there are more operations in each translated block. So this makes sense, because the target workloads of SIMD instructions are aimed at processing large streams of data. So as a result, the CPU designers have sacrificed latency in favor of throughput. So executing a single SIMD instruction per loop is really the worst case you can possibly have. So is there a way we can test this? I should point out now that I'm a gentle user, so let's recompile all my benchmarks with fun roll loops. Now, for anyone who's not familiar with this compiler optimization, this basically tells the compiler to unroll its loop as much as it possibly can and use as many registers and do as many things at once. And now we can see our performance is getting a lot better. So although ChromeU has improved as well with this fun rolled code, we can see now two of our test cases beat it using the TCGVEC implementation. So, importantly, so we're getting closer, but there's a few more things that we need to look at. So there's some further work to do to look at handling of loads and stores. So this is the act of loading stuff into registers from memory and storing stuff out. That's currently not being converted to TCGVEC. The other thing that we need to look at is a thing called register liveliness. So when you're doing a compilation, you try and keep values in registers as long as possible. And although ChromeU has to make sure that it stores the values out to memory after each operation to make sure that we're always correct, we do a better job at reusing values in registers for the follow-on operation rather than loading them all back again. Right, now we've looked at TCG. It's time to look at what we can do with hardware. Surely hardware makes things a lot simpler, right? It's doing all the hard work for us. I should note, when I'm talking about KVM here, the same sort of considerations also need to be made for other hardware assisted options. So there's things like Zen hacks and HVM on MacOS. So let's have a look at the typical architecture of a virtualized machine. So here we have a single execution unit, the CPU. The hypervisor is a software layer, and its job is to deal with making sure the shared resource of the CPU is shared out between the host and the guest. And it can do that a number of ways. So as I say, it's a shared execution environment, but for the virtualized guest, there are two things we can do. We can trap and emulate. So trapping and emulate means any time the guest accesses a virtual resource, you trap into the hypervisor, and the hypervisor emulates the behavior and returns to the guest. Now this is fine if all you're doing is updating a TLB entry. It's a bit expensive, but you're soon going to get back to running code. So if you have to trap and emulate every time you access an SVE register, your performance is going to fall through the floor. So the next option is to do a context switch. So we need to copy out the state of the guest when we return back to the host and then copy it back at the end. Now this is nothing new. This is something that your host kernel does all the time. The kernel has to switch between different applications running, and as it does that, it copies their state onto the stack, switches to another state and goes back. The key thing is here, occupancy is actually quite important. So what the kernel tries to do is it tries to keep busy tasks on the same CPU for as long as possible, and if possible any other tasks that need to run on another CPU resource. So your occupancy is important. Let's just have a quick look at the size of these contexts though. An arm, for example, has 32-bit general-purpose registers. So you need to save all that state into memory before you switch states. That's 256 bytes of data. It's quite a lot, but it's not too bad. However, the SVE registers we have could be a lot bigger. That's 8K's worth of data. So if we suddenly start copying the whole of the SVE state every time we transition from the host to the kernel or from the kernel to the hypervisor, it's going to slow pretty quick. I mean, do we have to save this context every time? Certainly if you're going to switch to another user space process, it's highly likely it's going to access the vector registers. Even if the application itself hasn't been compiled as a dedicated vectorized workload, the chances are it's calling library functions that are likely to use the vector registers. So things like string copy, mem copy, and whatever will all have accelerated versions of what your library detects. And there's also the kernel. Now it used to be said that the kernel doesn't do floating point. That's not actually strictly true. So when it comes to SIMD registers, usually there's an aliasing between the floating point registers and the SIMD registers. And the kernels certainly take advantage of SIMD registers when they can. Things like in kernel crypto or doing RAID checksumming are all things where they use these registers. Hypervisors, less so. Hypervisors tend to be written to be very small and tight and minimal in what they do. So you're okay there. So how can we make this a little bit faster? Well the first thing we need to do is we need to detect usage. So what we do here is we disable access to the SIMD and FPU every time we enter the guest. Well not every time, but for the first time. And then the first time they access a SIMD resource that causes a trap to the kernel and at that point the kernel can then swap the context. Then it re-enables the SIMD or FPU processor. You return to the traction instruction and you'll use the space process to get on with its life. We need to do a little bit of bookkeeping for this though. So we need to keep track of whose SIMD state we have on each CPU. And we also need a per task variable where we can save the SIMD state when it's not being actively used. And now I'm going to introduce an additional flag here. This is just simply a state on the flag. So when the kernel knows when it switches to use the space it needs to enable trapping so it can pick up if the SIMD stuff is accessed. So that's what we do on a host kernel. Luckily for doing the VM this is going to be pretty much the same thing. Now the hypervisor has a slightly different mechanism for trapping its guest's access and of course the kernel inside the virtual machine will also be tracking its user space usage. But the ideal case then should be that if a VM has to exit for something unrelated as long as the host kernel doesn't schedule anything else onto the CPU and the VM gets back to running pretty quickly we don't need to save and restore our SVE state at all. Now this is an ARM-specific thing enabling SVE in ARM so the kernel already has support for SVE in 4.15 and this extended the deferred system which was already being used for the NEON and advanced SIMD which are sort of like earlier iterations of the SIMD instruction set. But currently in KVM you can't have any guests that access SVE registers. There's currently work underway to do this but it's certainly almost going to be ready before you actually get your hands on any hardware with SVE. So I'm expecting most open source developers will get their first experience of using these SVE registers on Kremu. So with that let me just summarize. Vectors are great, they really help us with heavy data processing workloads but vectors are large and this does mean that we have to add special handling in kernels, in hypervisors and emulators. So with that that's the end of my talk, I shall open the floor to any questions. So the question was can we use the same technique to trap access to vector registers on X86? You're asking the wrong person because I'm not very familiar with X86. So X86 does have some code to deal with this sort of thing so it actually actively tracks when the kernel wants to do something in user space. So I think it always swaps context if it swaps to a new task but if it's then wants to do something in the kernel for some crypto related function it will then save the state at that point trash over it in the kernel and then return. But for the details you'll have to ask someone who understands X86, sorry. Yeah. Yeah, so let me just see if I can find the example. Sorry, I'm looking at very small version on this. Right, there we go. Next. Right, yes. So the key is the TCG vector operation explicitly has an encoding of the size. So that one at the end is actually a logarithmic scale so we can pass the size as part of the TCG operation. Yeah. But we do end up still have to have, it doesn't matter for things like XOR but you still have to have a TCG op for each lane size that you're dealing with. So you end up with a sort of for example a vector multiply 32 vector multiply 64 for the size of your support. Anymore? Well, thank you very much.