 Rwy'n meddwl, fel ydych chi'n mynd i'ch ei wneud flynyddoedd. Rwy'n meddwl, yn rhaid o'r bwrdd, yn fawr o'r gwrdd hyn o ymweld yma, o'r ffawr yn hynny o'r cymryd gyda'r GDK. A rwy'n meddwl i'r cyfweld, rwy'n meddwl i ddim yn fwy o'r newid o'r cyndigol, ac rwy'n meddwl i'n meddwl i'n meddwl i'r cymryd deall ynglynig. Yr ffordd o'r cyffredinol hefyd yn ymddangos, If you don't understand some of the details of this, don't worry, it's really just to open up what's going on under the covers in the compiler and how you can use it to optimise a specific Java method by generating a handcrafted compiler graph that bottoms out to some super efficient assembler. So I'll motivate this by talking about the actual use case, which is using NVRAM. I'll talk about the equivalent library that Intel will develop for you doing that from C and show you how we need to provide a Java equivalent. I'll show you what changes I would propose as its first attempt to try and do that from Java with all the functionality of the C library efficiently in Java, but without having the compiler backup that really gives it all the extra performance. And then I'll show you how I built the intrinsic into the compiler so that you could basically get very, very simple instructions inlined into the gistive code and inlined through into the client code that uses this API just to sort of show how you can really tweak the maximum performance out of things with the compiler. So to start off, let's talk about NVRAM. This is basically a memory device you plug into your machine. It's just like memory except when you write to it on one program run, if you bring it back up and you map them and you're back into your virtual address space, you'll find all the data you wrote that last time is still there, so it's got persistence. You can think of it as an extension in the middle of the memory tier between archival storage and the volatile running program memory. You can have some sort of storage that will persist across runs. Your choice at the moment is a spinning platter disk or a flash-based disk. Well, NVRAM is actually a memory that acts like memory that you can write to like memory, but it's there next time you come back up again. It's presented as a device, so the memory is managed as files of data or blocks of memory in a device, a mapped in device, and you map those files into your address space. At least that's the model that I'm looking at, how you use it. Of course, when you've then mapped into address space, when you write through into the virtual address and things have flushed back into the cache system and are available for other threads in the same program, it's also mapped back through into the device memory and it will stay there. That's not automatic, and there's a difference between the synchronisation that's used to synchronise things between cache and the operations needed to flush things back into the actual memory, and that's where the interesting thing comes in. The application really needs to be involved in making sure not just that things are synchronised between threads, but that data is also with another level of synchronisation synchronised back into the physical memory. I'm going to show an example, just a simple example of how that comes in and why the application has to be involved in that. So imagine we've got a block of N and V RAM, say you had half a gigabyte of memory that you're using for a rolling transaction log. So here we've got it divided up into your words, and there are into your counters and indexes, curses that are used to keep track of the live segment of the log. So the first two words in the log identify the start of the live section and the first non-live word at the end of the live section, 4 and 14. And there are two records in there. One has size 6, which is in the header word and then the six words in all, and there's a header word with size 4, and there are three data words for that one. There are actually three different tags of using this as a log. When you allocate a log record for a transaction, you haven't yet got a point where the transaction is committed, so it's an uncommitted record. If you crash, the data in there is rubbish you throw it away because there's nothing waiting to roll forward. That transaction is automatically rolled back. If you have a committed record, as we have here in the first entry, that means the transaction is committed, the coordinator of the transaction has written this and made sure it's out on disk because at that point if we crash, we have all the information to make sure all the other elements, participants in the transaction roll forward. So we've got to guarantee the safety. When we've got to commit in the coordination process, we can make sure everything actually happens consistently. Everything is replayable. There's a free record at the end there. Obviously that transaction started later, but at some point you need to free up records in the log and eventually they get reclaimed and you can roll the log round and just keep it continuously running. So those are the three different tags you'd have in the header word. Now imagine that somebody wants to allocate a new record. Here's an example of why the writeback is really important. So the first thing that's done when you want to do an allocate is you would find the end of the log. Well, first thing you have to do is take a lock because there may be many threads trying to allocate transaction records, so they have to synchronize in the shared state. That's an in memory volatile lock, but it's a synchronization of some sort. Having got the lock, you're then able to first of all write the record with a header length and it's uncommitted by default. Then you write the cursor so that the cursor is now beyond the end of that record. It's now in the log. At that point you can release the lock and then the transaction, the thread that's managing that transaction, go and write the data independently because it owns that bit of space. Nothing else is going to overwrite it. You've got consistent state for the log, except it's really important that the right of the record information and the right of the cursor, the end cursor is done in the right order. And they're flushed back to memory to physical memory between the two rights. So what you need to do is you need to write the header there. Make sure it's actually flushed back into physical memory before you write the cursor and update the cursor. Now imagine if you wrote them both and then flushed afterwards. It might be that that first line gets evicted and you've got a count of 18, so the log doesn't actually have any new data flushed back into memory and you crash. When you come back up, there might be an old transaction record in there which is committed within valid data. See if you have any consistent log. Worse, there might be a record with the wrong length and the end cursor and the end when you skip through the records where you reach to wouldn't add up. Your data is actually inconsistent, not just out of date. So it's important that you write and you commit back to physical memory the record length before you update the cursor. And the application has to do that. It's the application's use of this memory that determines how that's done. It's not something that the memory system or the stuff that does the write back can do for you because it's about semantics of the application. Here's another example, a transaction log commit. At this point, there's no need to have a lock for this. There's a thread managing a transaction. It changes the record state from uncommitted to committed just by writing that field and changing the tag in it. But before doing that, it's going to write some transaction data for that record. Now again, that stuff has to be committed before the header of the record is changed to reflect the commit. It has to be written back to physical memory. If you didn't do it in the right order, you might find that the tag gets changed and written back but then you crash and the data hasn't actually been written. Not just if the program doesn't write it back, it could be written back because of the effects going on that cache line that happens to have that record header on it. So it's really important again that it happens in the right order, otherwise you might get a situation where you've got a committed record and the data that's associated with some other transaction or might even be inconsistent data. So it's really important that the application has the ability to manage not just making things visible in the cache system, but making sure that cache system has actually been, the changes that cache system have actually been persisted and persisted in a specific order. So let's talk about libpmem. Now this is Intel's C library that gives you an API for doing exactly this. It allows you to access data and then map data and map it into memory, update it and have it mapped back to physical memory. It's been implemented for Linux and Windows on x8664. It actually exists for Linux ARM64 except there isn't actually physical hardware for ARM64 yet, but that should be available very soon. So the work had to be done on both and we've simulated it to test it on ARM64. So what you do is you use an F open to open a file from the memory device and allocate a block of memory for that file, for storage for that file. You'd use the M map call to map it into a memory, which is an old way of doing it. There's actually an extension to M map that allows you to say this memory is backed by a synchronous device and the payoff there is that you can actually write it back more efficiently. So the problem that we use M map, which is what I've called plan A, is that when you come to write back physical memory you've got to do it via a file descriptor flush. Now normally when you do a flush on a file, you know which blocks are dirty and the dirty blocks get written back and the other blocks don't need to be touched. So that's fine, but when you're writing actual writing map memory as memory with this model, there's no way of knowing which are the dirty cache lines from software. So if you were to do it via a file descriptor, the driver that's associated with this non-volatile RAM would have to write back all of the cache lines. It's very inefficient. Every single little bit of memory you want to flush, it's going to have to actually flush right back every single cache line in the map range. So it's really not an option. So what the alternative is that you can use the, if you map it with a sync mapping, the operating system knows about it, you can then actually use a cache line right back operation to actually make sure the memory is out on disk. There's a hardware operation on Intel, a hardware operation on ARM, well there's three on Intel that allows you to do that. So that's the option that the library provides and it gives you, it gives you functions you can call to do that as long as you do the map sync mode. So that sounds great, why don't we do something like that in Java and that sort of wants to prototype and I work with our transactions team to do it. The problem is it doesn't really work well in Java. You can open a device file using file open, that's fine. You can map things into a memory except that, oh, if you do use the file channel map method to do that, it doesn't know about the map sync flag. That's not built in in the underlying implementation in the JDM at the moment. So there's no plan B option. You're going to have to use the file to go to force. So you can actually write directly to the memory using the map buffer API. But the problem is when you come to a force, it's going to have to be that really inefficient write back and we found it was very, very slow. So there's a bit of code to show how you do that. You get a path to a file on the device. You create a file under that name. You'd map it in. You do your buffer put and you want to make sure that thing is out in memory. It's going to be really slow every time. So it's not really an option to do that. What about going in to do it via, the problem is you've got everything, all the functionality you haven't actually got performance. That's the one thing that's missing in this way of doing it, the standard way of doing it. Okay, so what about using JNI? Well, unfortunately, that's not going to work via byte buffer. So if you think of what you'd actually have to do, you'd have to call out to JNI to create a map and then pass the information where the map is through in some sort of handle. And then all your writes would have to go out via JNI calls to do every write of a block of memory. And then all of your flush operations would be JNI calls. This is going to be incredibly expensive. So you really can't do it via JNI. Not only does you not have the performance, it's actually worse. You also don't even get easy views either. So it's really not an option to do it that way. So what you really want is to build this into Java. Okay, so the way to build it into Java is to take, one way to do it is to take those existing APIs and to extend them in some way. So they know about this stuff and they can do things using a more efficient model in the way that the PMM C library does. So we take that program at the start there. We need to tell it when we do the map, we need to read write sync map. Not just a read write map. We want to use synchronous flushes for this. And when it comes to the force, we really want the force API to say, well actually, it's this range of the data I want forced. Because I haven't got dirty marks for the dirty pages anymore, we need to say which particular bytes we want written back. And we want the JVM to do that in an efficient way. So we'd implement that by changing the client APIs there and also re-implementing the mapping and the byte buffer to actually do the relevant operations. And that's relatively straightforward. We had a couple of extra modes for the map mode that the file channel uses when it maps stuff in. And when we come to actually see the implementation of the map method in the internal implementation of that, it actually keeps track of whether we've got a synchronous mapping or a normal mapping. And when it comes to do the M map further down inside that map call, we change it so it passes that through in the native call that actually does the operating system call. And we also tell the buffer whether it's synchronous or not. So we've got a couple of changes to propagate through. So in the actual native call that does the mapping, well, if the sync pack flag was passed in, we add in the relevant operating system parameters. And there's a bit more error checking because the kernel might not support those flags. Or we might find that you've used the wrong device and it's not sync maffable, so there's a bit more error checking. But basically we can get ourselves a sync map, a bit of persistent memory if we want, a bit of NVRAM. Now, we also need to change the map by buffer class. It needs to know whether it's working in sync mode or not. And we need to change the constructors of it so we can, by default, work not sync. But if we pass through in as the sync flag, we can use a synchronized buffer. And then that means the force API has to change. The old force API used to take the start of the mapped range and the length of the mapped range and just say do an FD flush on all of this, please. Using a native call to force zero, which does the relevant file script for operation. So if we change that, we can add a new method which takes a start address and a length, a start offset and a length. And we can call it from the old force method, say do everything. But in the new method, we can call something which is going to do something a bit smarter. And I've added a method on unsafe that is going to evict cache lines one by one. So this is going to use native code to do it. So it's not the most efficient method, but it'll be a lot better than doing a force. It's only going to evict the cache lines that are affected. So what does the unsafe method look like? Well, it's a nice simple method. It checks to make sure the addresses length make sense. And then there's two things we need to worry about as far as cache level synchronization is concerned. Before we flush the cache lines, depending on the hardware, we may need to be sure that all previous writes of those cache lines are visible in the cache. Now that turns out to be a no upon birth x86 and now it's 64, but I left it in just as part of the model. Inside a loop, we round down the address to the start of a cache line, and just iterate a reach line, a line at a time, calling a write back method, which is going to evict that, flush that cache line back to memory. Possibly evict it and intel some of the instructions evict, some don't. In arm it just flushes it back to memory. And then afterwards, we may need to make sure all those flushes, although cache line evictions or flushes have happened before we allow other memory operations to happen. So we may need another memory barrier operation to make sure the cache is consistent before we carry on. So the implementation of that is three native methods. We also need to know what the size of a cache line is. That's hardware specific as well. So we have another native method which you use to retrieve the size of a cache line. We use that to set up the size and also the mask which you use for rounding down the address. So there's the actual Java implementation underneath that. The problem is we want to do something that's hardware specific. We want to write back a cache line, or we want to do a memory synchronisation. So we can't just write some C code to do it. So actually what we have to do is jit some code using the stub generator, which generates lots of little stub routines, and we'll have a stub which will do a cache write back. So we can pull that out of the stub routine code and it takes a void star argument address, which comes in as a long, which actually gets cast to a void star, and we use that bit of jitty code to do a write back instruction. And we'll have an implementation of this for x86 and one for arm64. There's another helper routine which does a synchronisation operation. It takes a boolean argument, true or false, to say whether it's a precync or a post sync. So there's another jitty routine that does that. And the jitty code, that gets called from the implementation of the unsafe methods. We actually call out to those that help us to do the two different write backs. Here's the jittying process. This is something in the stub generator class. It's got a buffer which it's writing into. It gets the start PC, that's the function. We're also going to return, build a stack frame. That's the arm instruction that clears a single cache line out to memory. And it takes as input r0. That's what we passed in the start of the line, the line address. So it'll take that address and make sure that line is now out in memory. Tear down the stack frame with a leave and return. So it's a nice simple little jitty function. And there's an equivalent jitty function for the synchronisation. Now on arm, we don't need a synchronisation before doing the flushes because they automatically ensure that we've drained all memory writes and the cache is consistent before they start. So we can just, if we come in, if we've got true in arg0, we have a compare and branch of non-zero. So we'll jump to where that label skip is bound below and we'll bypass the memory barrier. Otherwise, if it's false, we've got a post-synchronisation, we need to a full memory barrier. And that makes sure that the cache lines are actually evicted memory before any more writes can happen. OK, so... Oh my God! Right, so that's... That was the... How do we now make this really efficient? How do we now make this really efficient? Well, what we need is plan C. So basically we need to get the compiler to recognise these methods and instead of calling out to native code, which is really expensive, we actually want to get the compiler to recognise this and just generate machine code in line in the compiled method and inline that into caller methods. So there's a basic simple plan for it. We need to say which methods we want inlining. That's those native methods. We need to tell the compiler how to recognise them and tell them they really are an intrinsic method. We need to actually say, yes, they want this intrinsic translated and these are the functions to use it. And then we need to, for each of those methods, build a little function graph that represents a function which does a cache write back or a synchronisation. So we basically build a compiler graph, a high-level graph, and then we add some rules to translate that to machine code. So this is the bit that I hope you'll find interesting. Don't worry if you don't understand it. It just says where you're looking at the code and you'll need to go and study it if not. So it's very easy to say this is a method that we want the compiler to recognise. You just put an annotation on it. It's an intrinsic candidate. You need to, in class, in the Sylvian symbols at HPP, there's a lot of templates that describe bits and the segments of a method name. You can identify a method in intrinsic. So, for example, that template there says that string is the string that identifies a method that takes a longer to unvoid. Do name says that write back zero name is the name of a method called write back zero. That was the name of the unsafe method. And do intrinsic says, well, we've got an intrinsic called underscore write back zero. It's on class defined by this symbol, which is another name, the unsafe class. Its name is write back name. It has a long void signature. It's a method called by the normal protocol. And the same with the precinct and the posting method, we define them. So that's made them known to the compiler. We can see a call and say, yeah, I know this is an intrinsic. Now, inside C2 compiler, there's a method called this intrinsic supported. It looks at a particular intrinsic idea it's found and says, do we really do this? There's a big switch on the ID. I've put in a little hack so that we can say actually don't recognise that so we can switch it off and switch it on. We can only just drop out of that and say true. And that means, yeah, translate this. But we can also switch it off to do a comparison. I did that comparison. Inside the actual compiler library call kit, which is the thing that does all the translation of methods, there's another switch on those intrinsic IDs. And there are two functions used to translate the three calls. There's one that generates a compiler graph for a write back and one that generates a compiler graph for the synchronisation operation, either pre or post. It obviously takes an argument, true or false. So we just call into those methods. So these are the ones that actually build the stuff that's basically a compiler graph that can be translated to machine code. Now when we come into inline unsafe write back, we've got the stub of a graph for a function and it's a function that takes a long. So the first thing we do is we plant a null check to make sure we haven't been passed a null unsafe object. So null check receiver says put in a test there to make sure there's an unsafe object or throw an exception. The next thing we do is we take the argument one from that function call, which is the address, and we get the node that represents that input argument and we create a cast node because we want to process it as a pointer and it comes in as a long. So we pass it into a new cast node, which now has type void star, or type pointer, it's a pointer node. And that we then use the GVN transform call to add it into the graph, so that's now linked into the graph. Then we create a new node that I've added called cache right back node and it takes three arguments and we create that and add it into the graph. Now the three arguments are the address, which is argument two. Argument one is a link into the current chain of memory nodes for pointer types because we want to link it in the memory graph so it's ordered with respect to previous memory operations. Input zero is the current control link. We want it to be linked into the control flow. So it's got three inputs and that's just, that's all it is, it's just a node. That represents an abstract right back operation in the back end we'll translate that to the appropriate instruction. We link that into the graph for the transform call and then we update the memory chain. The pointer memory chain is now this node. We've added a node in, so we've updated the end of the memory chain so other operations happen in order with this. That all mean these will be scheduled in the right memory order and that means we'll get the right cache synchronisation. But the right back will also ensure the right memory synchronisation. Okay, and there's a similar thing if we call the uns right back sync method we'll create a precync node or a post sync node. They need to be linked in the control and memory structure. They're both memory nodes so they become the new tail of the memory chain. And the actual nodes are just classes. They're instances of class node. They have three inputs which they pass up as their three ins. They have a bottom type which says I'm of type memory. That means that the back end knows that they need to be processed as memory nodes. They'll get a memory input. Okay, no problem. And there's a similar one for the precync node. And then the final bit of the equation, we need to actually in the back end say translate. When you see one of these nodes in the graph translated to machine code. Now this thing might have been added in the actual unsafe method. That graph might have been inlined into two or three callers. So we can actually get this machine code inline to the generated code just as it would for any other inline Java code. And what we do is we've got an input which is a memory address for this one. It should be just a raw memory address. It's a type indirect, which means you don't want to think with an offset or displacement. We want just an actual address for a cache line. So those asserts check that. And there's the DC instruction that basically flushes that back to memory. There's a different back end for Intel which will generate the relevant Intel code. And for the post sync node, we have an encoding that plants a member. For the precync node, you don't need an encoding because there's nothing needed. So it's just an empty step. So basically that means you've now got a really efficient way of getting the actual instructions, the single instructions you really want executed inline right there up into the whole call chain. Right. I'd better stop there. So the good news is we have a plan C. It's not actually the plan we may stick with. We may end up... I'm going to have a chat with Alan Baker about this. We may end up re-implementing this some other way. But that's just a really good example of all the different steps involved and the sort of benefit you've got. We did actually do some benchmark tests on this. It's a very, very big improvement. I can't give you the actual figures for that. But this is way, way better than the native method version. And the native method version is way, way better than using the driver mode for the PMM or calling out to J&I. So this is actually a really... can make a phenomenal improvement to your performance. Right. Thank you. I don't suppose we have time for questions? Yeah.