 This talk is about the memory access API, which is a new API which we added as an incubating API to Java 14 Just as the gate was coming crashing down. So basically on the very last date and The title of this talk is the Liberally inflammatory. I hope that by the end of this talk I will convince you that The role of the new memory access API is not to completely replace the by buffer API. You can decide Whether you want to use the new API with by buffers or if you want to completely replace by buffer usage with the new memory API So it's up to you Hopefully you will have yet another tool in the toolbox that will help your work. So usual disclaimer. Don't believe about what I say so there are a number of Situations to why people may want to reach for a quick memory Probably the primary one is to avoid the all the cost associated with GC now. We have Shenandoah, we have ZGC, so we have much better GC that we did in the past, but still there are cases where For example, when you want to do a real-time application, you may just want to entirely avoid GC poses There are also other Circumstances where using of a memory may be necessary For example, when you want to share memory across multiple processes or when you want to share memory with a native library So it's not an accident that we landed on this API when we were working on project Panama Which is a smart show before all about kind of native interop the Java de facto API for using this kind of Off-heap access is the by buffer API. There are also other APIs hidden in the JDK Come Miscan safe is some miscan safe is one of those you can use that if you want it's fast But it's unsafe. So if the VM come crashing down, it's your fault What about by buffers? Well, but buffers were added in Java 1.4. So they were part of the big push towards buffer oriented Input output they are a rich and stateful API and the main one of the main driver of by buffer was to Make it simple for you to write idiomatic IO code so it has a lot of state internally that allows you to prevent buffer overruns and underruns helps you with things like charles at encoding and decoding and By buffer crucially can be allocated both on the Java eep, but also Of the Java eep so you can actually allocate a slice of off-heap memory and associated with a white buffer This is like a very typical example of by buffer usage Basically, we want to read the contents of a file channel into a bite buffer And then we want to do a loop to read all the characters that we've read from the buffer. So When we allocate the buffer the buffer will be empty at the beginning Two notable things. So we have quite a bit of variables here There is a position which is initially set to zero and then there is a capacity Which is essentially how big this buffer is in this case. It's 10 bytes and Then there is a limit which is another mutable Part of the state of the by buffer which will be initially set to the to the capacity The first thing we have to do is well We have to read some stuff from the channel which means we are actually writing into the by buffer. So Here we read some characters and now we have to start reading them into In our application. So the first thing that we have to do is to flip the buffer from writing mode into reading mode Which means the position here will be set to zero and the limit will be set to the maximum Basically to to to the position after the last character that has been read So I can start doing my loop and read all the characters one by one until eventually will End up in a state where the position is identical to the limit in this case The predicate as remaining were return false Which means I will go out of my loop and then I have to get ready for yet another read from the file channel So you have to call clear and what does clear do well It will basically reset the state of the by buffer to the initial state So position will go back to zero the limit will go back to the capacity value And so I can do another iteration and that's basically how you work with buffers Of course, if you wanted to this is a buffer that is allocated on it If you wanted to use a buffer of it You just change a single line of the code here use the allocate direct instead of using the allocate method So this is called a direct buffer and is associated with off it memory So with direct buffer, we actually have a new weapon as developers because we can write code that allows us to access off it memory Access to off it memory with by buffer is quite efficient because at the end of the day by buffer is implemented on top of unsafe So we can still get advantage of all the c2 Data movement intrinsics that we have The access is also safe because as we've seen the by buffer have all these Concepts of capacity limit position. So every access will be within the boundaries of the by buffer. Otherwise, we will get an exception But how good are By buffers if we want to write general kind of heap Programs, well, let's try to look at some numbers, right? Here. I have a benchmark which is essentially allocating a slab of memory 400 bytes and then is setting 100 in inside those Bigs above memory The benchmark has been kind of cherry-picked because I think this benchmark is characteristic of what happens a lot when you do native Interop which is something that we care a lot When we do Panama, so you allocate a small buffer of memory You feel it and then you have to pass a pointer to this memory to maybe some Native function and then you have to free the memory after the function Returns. So this is maybe not a Use case that comes up a lot when you're doing a yo But this is something that you do quite typically when working with native libraries So if you use unsafe you get a certain throughput So nine operation per microsecond. That's fine Let's try to replace this code by using the byte buffer API, which is supported API. We can see that the throughput is almost Yeah, 9x Slower compared to unsafe. This is due to the extra safety that the byte buffer API provides But it's also due to a number of extra factors Here we can see that there are at least two factors that are hindering performances The first is that I'm using the relative position in scheme So I do a put in and I'm basically relying on that mutable position field that will be incremented on every axis and That slows things down a little bit But the second and most important thing is that every a byte buffer has to be registered with a GC cleaner for the office memory to be Deallocated after we can prove that the byte buffer is no longer referenced by anything in our application and Basically the GC has to do a lot of work here and this work shows up in the benchmark. In fact if we Change the match the benchmark a little bit first to use the absolute put in method and secondly more importantly to use The unsafe invoke cleaner method would actually allows us to free the memory Explicitly without relying on the GC cleaner. We see that the performance rise a little bit It's not as fast as unsafe, but it's a little bit better to be fair here To this benchmark and to byte buffer in general Unsafe doesn't zero memory byte buffer do zero memory I'm not allocating a very big chunk of memory here So zeroing is not affecting performances too much But still there is an extra cost here when using byte buffer And let's look at what happens in memory. So We don't say for course the GC is basically not working All all the access are unsafe that are of heap and that's basically what you would expect But with the first byte buffer example that we've wrote so just the naive By buffer usage the GC was actually spinning for five seconds during this benchmark Which is quite a lot considering that you wanted to use Off-heap memory to get rid of the GC in the first place With the third Benchmark so these new version things get a little bit more under control and the GC time goes back to zero But still the performance is not quite as good as it could be and the biggest problem here Is that byte buffer allocate direct that first invocation is quite heavy because the byte buffer has to be registered with a cleaner even if we are not using the cleaner anyway and also there is quite a complex state in order to Track how much off-heap memory we are using there is a limit So there are a couple of atomic instruction in order to check whether we are Allocating too much And so this is quite expensive and it shows up in this particular allocation intensive benchmark So where does this leave the byte buffer API is this a bad API? No, it's not a bad API It's just that here. We are I think we are trying to use it in a way That wasn't the way in which white buffer was meant were meant to be used at the beginning The direct buffer work very well if you allocate for example a very big back buffer and then you keep sharing it and also because The cost of IO operation typically dominates every other cost all the stuff that I show you before doesn't really matter, right? Unfortunately though by buffer fail to scale when considering kind of more general cases because you have no way to deterministically release the memory. So you're basically Either rely on the GC or you use some unsafe operation in order But but you still pay a lot up front in order to allocate the buffer and then you have the 2 gigabyte limit Which is starting to hurt especially now that we have support for mapping persistent memory files So a persistent memory can probably be a little bit bigger than 2 gigabytes And we have no way to access it using the by buffer API because all the indices that we can specify are essentially in And then there is there are limitation with the expressiveness of this API when it comes to accessing The memory because you can either choose between sequential access. So but essentially one inch at a time or Another absolute addressing scheme where I have to pass the offset all the time. There was no support for Structural access. So if I have a struct in memory, there's no way for me to say I want to access that particular field I have to work Offset manually in order to get to these of that location into memory So we think that rather than investing more on the by buffer API The time has actually come to build a new memory API from the ground up And of course this new API will be interoperable with by buffer. So you don't have to throw away all your code But as I was discussing with Paul last week, we think that by buffer have reached their functional capacity there Some of these limitations are such as the 2 gigabyte limits or the Deterministic the location are very hard to fix in the current by buffer API It will require a pretty big redesign of the entire API, which is probably not going to be very compatible. So It's probably better to start from scratch and to design a new API here This is what happens when by buffer kind of fails to meet the expectations on net is a big client of by buffer It allocates a lot of by buffer and starting from version 4. They're all in their own version of by buffer called by buff no pun intended and This is based on a different allocation scheme So they have also a specialized allocator which reuses memory. So it's an allocation pool and It's essentially a j e malloc Implementation written in Java and with this they were able to get a lot more scalability out of their Buffer infrastructure and this is unfortunately something we cannot support in Java today So people have to reach out for different abstraction. So we'd like that code to come back to the JDK eventually or at least that's dope So enter the memory access API. It's a new API It's a safe API. So the goal here and we will see that later even more. It's a Absolutely no VM crashes. So you should never get a VM crash while trying to access of your memory using this API It's as safe as by buffer are There are three key abstraction. The first is called memory segment, which is just a region of memory contiguous bytes somewhere They can be only poor of it. The API is actually neutral as to whether the bytes are stored and Then there we have addresses which are essentially offsets into segments So is you can think of it just as along that points to some location inside the segment And then we have memory layouts, which are optional description of the contents of memory You can decide to use them or you can just decide to ignore them But we we will see what are the advantages of attaching a memory layout to a segment The if you look to the in the Zava doc of this API, you will find no method called get into or put in Nothing of the kind and we have received some question when the request for review went out It's not an omission Because we forgot about them it's because There are plenty of ways in order to get to the data of a memory segment You can for example take a memory segment and map it it to a byte buffer and Then you can use the old good byte buffer API to get in them floats and longs and Never look back at segments ever again or if you are If you want to reach lower level and you want to go down the Varendo rabbit hole you can actually create Varendos that are able to the reference memory using memory addresses and This is actually a good option that I will explore in the final part of this talk So the the main idea here is that we don't need to reinvent the wheel We don't need to add a lot of access or for our memory. We can just Leverage the good API is that we already have This is what a segment looks like let's imagine that we have an array of struct point where a point has two int coordinates It's a very simple thing So we can imagine this array memory to be flattened so that all the coordinates are Specially consecutive so x 0 y 0 up to x 4 y 4 This segment will have natural Spatial bounds so we will start from a base address which is the which will point to x 0 and then We will have a limit address which is the maximum address associated with this segment Actually is the the address of the first by that is outside the segment So as long as the access occurs within the segment everything is fine If I have an address I can add an offset to it and obtain a new address So for example if I add 16 to the to the base address I will obtain a new address that instead of pointing to x 0 will point to x 2 And if I have a segment I can also slice it This is similar to the slice operation that by buffer also provides so I can specify a new start address and a new length And I will get a sub segment which will be contained into the original segment So nothing too fancy here The the main thing may maybe to notice here is that this API is immutable So there are none of these bounds that you see here are immutable you Whenever a new address is created you actually create a new instance With a new offset. So nothing will actually Mutating memory which will hopefully enable for better situ optimization in the future the goal of The big goal of the segment API API and the big bet is that we want an API that is able to do the deterministic Delocation which means whenever you are sure that your memory is no longer going to be used You should be able to explicitly free it and the way this is done is that You you essentially have a segment use it and when you are done you close the segment of Course with power comes responsibility if you forget to close your segment and the segment goes out of scope you have a memory leak because now you have some memory of eep and That is not being clear To help with that memory segments implement the auto closeable interface So you can use a memory segment with a try with a source construct Hopefully that will reduce the occurrences where these leaks will occur Other things we could do in order to improve on these is to do something similar to what neti has done Which is to add a debugging kind of mode where we actually register a cleaner and we keep track of when a segment goes out of scope and The closed method is not being called So the way you work with segment as I said you don't need to do a lot an awful lot You can just allocate your segment of the right size So here if we want to allocate a segment that is big enough to contain our abstract of point array We have to well do a little bit of computation. There are four bytes for each inch There are two inns for each point and then there are five points in the array And then I can just derive a byte buffer from the segment and then pretend that the segment doesn't even exist I will just use the byte buffer API to put the int for the x-coordinate and the y-coordinate into a loop and Then at the end of the try with resources when I close the brace a Close a close operation will happen the the memory associated with a segment will actually be released so Did I gain anything by doing this around trip? between memory segments and by buffer well actually again quite a bit because I Got rid of that expensive allocate direct operation the byte buffer that we are creating now It's just a view of the memory of the segment so it's a much cheaper operation to do and We also have a deterministic the location at the end So we no longer have to rely on the garbage collector to Go in and free the memory we can actually say when the memory needs to be freed And so if you write this code you will get the same performances that you could get with the benchmark But they show you last that was using an unsafe method to clean the memory Actually, this should go even faster because you are paying a lot less for the first Allocation of memory with the memory segment of native There are problems in this code though. I'm not gonna lie for example, we have to compute the size of the memory that we want to allocate manually and then there are all these offset and Constant spread all over the code and this is very fragile if I change for example the coordinates from in to be long For example, probably on 64 machine. This example is no longer going to work So how can we make this code a little bit more robust? Well, our idea was to introduce a to introduce an abstraction called memory layouts and the goal of this Abstraction is actually to be able to replace that comment And they show in the previous life So this thing at the top with an actual object creation So you can actually create an object which specify what is the layout of this? array of structs The advantage of doing that is that once you have an object you can derive all sort of Important information out of it. So for example, how big is your layout? what are the alignment of some of the components in in the layout and Since layout can compose so you can basically nest layouts inside of the layout so you can use layout paths to ask Tricky questions such as what is the offset of the field y inside a point? Which will be normally Have been an end written constant and now you can actually ask the API for that And you can imagine that when working with more complex tracks They will this will actually be useful. So the big bet here is that by having more declarative code There will be less places for bikes to hide So this is how we model the point struct using a layout So of course we have to start from the outside we create a sequence layout We call it sequence layout so you have to specify a size which is 5 in this case And then you have to specify what goes inside the sequence in this case We have a struct so we just call memory layout of struct and then we have to specify what are the field of Struct and there are two in 32 bits I'm assuming the ins here map to 32 bit value and I can even attach names to the fields So that then I can actually perform queries on on this particular layout So here I've done a little bit of a simplification in reality if you try it out on the Java 14 code the Constructor for the value bits will also take an end in this because of course we have to specify whether you are big Indian or little Indian, but it didn't fit in the slide So that that's kind of what it is So let's say that we have this big layout here, which represents my Array and I want to compute the offset of the y field inside a point How can I do that? Well, there is an ending method inside a layout object, which is called offset without too much fantasy, maybe You have to pass a path that enable the method to find the Field that you want to get the offset for Starting from the outer layout. So here we are starting from the sequence layout The first thing I have to do is to choose an element of the sequence Let's say that we pick the element number zero because that's the one with the least offset and Then inside the sequence now I have to choose which of the two field structs I want For computing the offset in this case. I want the y Field So by doing this I was able to specify a path from the sequence down to the y field and now I can ask for the offset So as you can see, I have been able to obtain the offset without writing any number I'm just essentially querying the API and that's exactly what we are doing now in this or victim example We got rid of the comment at the beginning of the example that we had Before and we replace it with an actual Memory layout instantiation. So now we have an object that describes the layout of the things that we are going to work on And then I I'm able in the middle part of this lie to use the layout derive constant such as how big is a point or What is the offset of the y? Field inside the struct point and then I can use all this stuff inside my loop to get rid of all the numeric constants and Most importantly, I can use the layout directly into the allocate call For the memory segment, which means I don't have to write the size by end I will just delegate to the layout API to do the right thing. So this is much easier to read We gain quite a lot in terms of expressiveness. There is still a little bit There is still something that I don't quite like which is if we go inside the loop We see that there are two calls to the byte buffer put in method It's pretty hard looking at this code that one call is meant to set one field of the struct and the other Call is meant to set a different field of the struct. It's only the offset computation that kind of gives away that information. So That's yet another place where bugs can hide. So can we do a little bit better? Can we improve a little bit on that and the idea that we got maybe a crazy idea? was to introduce a new breed of verandals Called memory access handles for those of you who are not able or family Sorry, not not familiar with the byte with the verandals API You can think of a verandals as a java lang reflect field on steroids in the sense that a Reflective field gives you access to java fields A verandal gives you also access to Reflective access to java field but also to more variables such as array elements or byte buffer element So it kind of felt natural to also provide a new verandal that was also able to give you Access to for example of a memory by taking as a coordinate a memory address The big gain that you get with this API is that number one you get all the Atomic operation that the verandal API support. So let's say that the byte buffer API is not enough you you you are not Good enough with just a scene a simple get you want an atomic get or or something like that Oh, you want memory fencing because you're working with multiple threads Then you need to reach for the verandal API That's that's probably the best API to to do this kind of stuff with and the second bonus point is that if you are using memory Layouts, you don't have to do anything particularly fancy in order to get this verandal You can just ask the layout API give me the verandal for access in the field and you will basically get it So if you want to see how this verandal work There is a factory inside the memory access API that allows you to construct the verandals by hand Typically, you won't have to do that because as I said before you will derive the verandals from the layouts But let's say that you want to kind of go through the process and create the verandals Bit by bit. So when you create a memory access verandal, the first thing that you have to specify is a carrier type So you have to say to the verandal What is the Java type that you want to come out of for example of a get operation? So in this case, we want to read the value as int because the value are for bytes So we are going to pass the int dot class To the verandal factory and so we get a verandal that for example If I give the base address of this segment, it will give me back the value of the x0 coordinate Can I do more? Yes, of course If I want for example to read the y0 rather than x0 I can combine my previous verandal with an offset so I can essentially take the address that comes in Attach an extra offset to it move it and then read that the second address And so I can read y0 if I pass the base address I can get y0 out But can I do something more fancy like can I access all the y coordinates in this array? Actually turns out I can I can construct Strided verandal so I can pass in the stride in this case The stride is of course the size of the point and so I get back a verandal That takes an extra coordinate not just a memory address, but also along where the long is a logical index Which is which basically says which point? I want to get the y from so if I do for example I get with an index of zero will get y0 if I Specify one as an index. I will get why y1 and so forth Now constructing verandals like that may be a little bit painful So we integrated this verandal machinery with the memory layout API so as you can see in the middle Sequence of this slide we can actually derive all the verandals for accessing x and y with two simple calls to the layout API There is a verandal method you specify a carrier type Then you specify a path down to the element that you want to access and so you can Construct in two lines a verandal for the x element our verandal for the y element And now inside the loop you can see that I'm using the x verandal for Setting the x elements and the y verandal for accessing the y element So now the code is more explicit and if I change anything in the layout up there Everything will just flow. So I won't need to update this code at the bottom ever again So let's switch gears a little bit and talk about safety as I said at the beginning This is a safe API. So one of the main goal is to avoid any kind of VM crashes It is beyond the scope of this API to avoid the kind of silly user mistakes such as writing an Int and maybe a reading the form byte value as a float That's not something we want to protect you against but for example, there are a couple of conditions that we want to protect you against such as Accessing memory out of bounds which if the memories of hip can result in a crash and also Accessing memory after the memory has already been freed The second problem is in particular Anasty one especially when you consider multiple threads accessing memory at the same time because you can have one thread Doing access and another thread doing the release. So how the act do we make this safe? It's actually pretty tricky. You could lock everything, but that basically kills the performances instead what we decided to do was to By default enforce a strong confinement model so that whenever you create a segment the segment is Confined to that particular thread who created it So only the thread has access to the memory associated with the segment any other thread that want to join in can But it has to do an explicit operation called acquire this acquire will call it will create a view that is specific to that second thread and You can only close the original segment after all the acquired views have gone So we still have deterministic the location even in the presence of multiple threads But if you are working with multiple threads, you have to be explicit on who is accessing what? So how does this translate in terms of performances? This was the best result. We could squeeze out of the byte buffer API We had to cheat a little bit by using unsafe and this is our these are the numbers that are coming today Out of the memory segment API They are still not as good as the unsafe numbers, but they are a little bit better than the by buffer once The main contributor I think for this number is the fact that the allocation got a lot less expensive Compared to the by buffer allocate direct Method but there are also other things. So all the bounds are final variables. So she too can hoist them Twill almost And there is still a little bit of a difference also We have to keep in mind that we are also zero in memory with the memory segment API So we can never go quite as fast as unsafe here But we are also kind of trying to look ahead a little bit and we are We don't want us to Provide you something that looks a little bit better than byte buffer We want to give you something that is actually more scalable as I was mentioning at the beginning So we are working on a different allocator Jim Lasky is doing a bunch of work on this new allocator called quota based allocator and The numbers so far and kind of teasing you a little bit are pretty impressive in the sense that after it is some Experiment to plug in this new allocator on the memory segment API I was able to reach better performances than unsafe even though I was still zeroing memory So this allocator is doing a lot of the tricks that the net the allocator probably does But there are a couple of tricks that I think are new for so for example We instead of Committing memory eagerly we pre-reserve a big bunch of memory like 4 gigabytes And then we only commit when the client requests memory and doing this saves the performance is quite a bit But the big save is that we don't need a native core for doing a mall lock each time We allocate a new segment or each time we have to free it because the allocator will be able to recycle this memory segments that Are being ended out and released So this is kind of where the future of this API is kind of heading and I think if we deliver something like this maybe or at least hopefully some of the Alternative API to bite buffer will over time disappear or maybe people that are going to write new API will decide to stay on the main JDK API How hard was it to get there? Well, I'm gonna be honest. It was a little bit hard Andrew did a some benchmark of this API and found some issues there are Indeed some issues with fix some of the stuff on in Java 14 already. So for example All spot was very conservative with respect to memory barriers every time there was an unsafe access it will immediately add barriers after the call now situ is Behaving a little bit better and it's removing barriers when the access is provably of hip and This also improved the performance of the base by buffer API. So that that's actually a good result Tread confinement checks were not very well treated by situ most of all because the thread dot current thread was not being a Perceived as a constant by situ. So we did some work blood is some work in order to To fix that and now performances are a little bit better. Although we have to disable this optimization for on the loom Branch because this of course creates all sorts of havoc with with fibers But we are not done. There are a lot of issues around escape analysis This API as I said is immutable every time you call base address You create a new instance every time you do add offset on another as you create a new instance and sometimes these instances go in the middle and perturbed some of the situ optimization that situ is not so able to kind of see through some of the allocations There is also another problem and this is probably the main problem with this API the API accept long as indices and This is good because it gives more room for this API to grow But at the same time we are running into some bottlenecks with situ in the sense that situ is optimized To remove bound checks on loops that work on int. So as soon as you step out of hints You are into an heap of trouble and the about check elimination no longer work Loops are no longer unrolled. You don't get vectorization any of that. So right now. We are doing some heroics in order to try to kind of generate the right set of code But we think that the right approach longer 10 will be to fix this big performance gap And at least let the situ to see whether a segment is bigger than 2 gigabytes or not And if it is not then refer to the logic and optimization that we already have In other words, there's more work here to be had So I think the memory access API is a great alternative to the by buffer API or a great complement It is a fully immutable API. So it should land over time to better G to optimization There is deterministic the location that you didn't have with by buffer and that makes quite a difference in terms of GC load The other thing limited the other thing scheme is not limited to 2 gigabyte Which also makes a difference if you are using persistent memories or things like that And there are there is the availability for doing structural access with the verandals memory layouts So I think it's a very Compiling alternative to buy buffer if what you want to do is on say is off-heap The memory access API is safe. It's a safe by buffer So it's a good safe replacement to for the unsafe API There are spatial and temporal checks on every access and there is a robust ownership model Which allow you to remain safe even if you are working on multiple thread and still attended the deterministic the allocation So where does all it fit in the bigger Panama picture? I'm not going to talk about Jini, but I just wanted to give you a Taste of kind of where it all fits together. So as mark showed earlier We want to in Panama. We want to give you tools so that you can start from another file Do some work and derive a set of Java bindings and these have a bindings initially with all well Maybe they are interfaced with some annotation and then there will be a runtime component that reads the notation And will generate some code on the fly. We actually realized that there's no need for that We only needed two pieces one is the memory I Memory access API piece which gives you a bunch of Varendos for accessing the memory For example struct fields of particular offset and things like that. The second bit is a foreign function API bit which we are going probably to deliver in 15 also as an incubating API Which allows you to map foreign function as method handles on top of these two bits You are able to create a very low-level Set of bindings and then you can also build on top of those if you want you can add the plugins to the basic tool that we will Provide and generate higher level bindings, but the low level bindings as mark showed before are not so bad We are generating static wrapper, so they are relatively usable So as I said at the beginning of this talk these API is actually available in Java 14 So my recommendation is to kind of try it out and report back Performance portals for usability issues or whatever you can find Next steps of course are to round up the performance work. We know that we have to do better here We want also to finish the work on the Allocator because we think there's a lot of room for improvement there And we have to polish and finalize the API right now is an incubating API There are probably methods that need to be polished for name or whatever and then we have to integrate this API into the overarching Panama story so you can follow the progress on Panama dev But if you are more familiar with core Libs, you can also report the issues on core Libs I will be looking for both. So thank you