 My name is Erwin de Souza, I'm a JIT compiler developer at IBM and I'm also a committer at the OpenJ9 project and I'm here to talk to you guys about an exciting project that we've been working on called JIT Completion of the Service. But first, a slide to make the lawyers happy. So what is OpenJ9? You guys might have already heard about it before on the previous talks but it is an independent implementation of a Java virtual machine and you create a complete JDK by combining it with the class libraries from OpenJDK and actually it came from the IBM J9 VM and in fact the J9 VM right now the IBM SDK actually currently uses and is built with OpenJ9 and so the project I'm going to talk to you about is actually prototyped using OpenJ9. So these are some of the things you might already know about a JIT or just in time compiler. JIT Completion takes CPU time and memory from the application in order to compile and optimize code and the idea is that if we run long enough we usually break even and while this has been true for big monolithic applications it is less so for smaller distributed shorter cloud running applications and the reason for that is because JVM sometimes need to operate in small spaces. If you take a monolithic application that has been broken down into services which is what you want to do if you want to run in a cloud each of these services might have their own JVMs and in which case they might have their own JITs and so there's a lot of duplicated work happening in these services. In addition now that the services are smaller you might decide hey we don't need that much space let's just take a machine for example in this case 8 cores 8 GB and split it up into virtual machines and in that case you're now running a JVM in a constrained environment so not only do you have duplicated work we're compiling the same methods over and over again for example string methods in the string class or methods in hash map but you're also limited by the amount of space you have in these machines and so the question we asked was what if we could JIT compile out of process or on a different machine or on a cluster of machines and then maybe have the JIT have its own CPUs and memory resources and serve multiple applications and share code between the different applications and so the answer that we came up with was JIT as a service. The idea is to treat JIT compilation as any other cloud service we ramp up and down resources based on load and load here is compilation request and by doing so we can containerize it and manage it manage it intelligently using Kubernetes or something and we can share the results across the components and across applications and what we're hoping for by doing this is we have better resource control for compilation and better scalability because the compilation or the JIT traditionally runs in process it has to steal resources from the application and it can only steal so many resources if it took everything then you wouldn't get any real work done by moving it away from the JVM we have the option of scaling it as much as we need to based on its requirements we also get amortization of the cost of compilation because we can share the code across the different applications it only has to be compiled once we get improved reliability from the application because in the case where the compiler crashes it doesn't bring down the entire application it crashes somewhere else and we can also now have better provisioning of resources for the application itself the users don't have to worry about do they need to size their containers to take into account memory needed by the compiler or what not of course there are concerns that we had as well network latency is a big one traditionally compilation occurs at a time when the application is sensitive to ramp up by that I mean the time it takes to reach peak throughput but now that we're moving it off officer off process or on a different machine we're at the mercy of the network security obviously is another big concern we're now communicating on a channel that was previously completely internal to the JVM but now we're speaking outside of it there's also network reliability issues what happens when you're in the middle of a compilation and the network breaks down what is the fallback plan and of course the last one was would all this work actually pay off would offloading it to a different machine work and so where are we right now with the current prototype that we have it's available on github and I put the two links up there for you guys it is also there at the very end if you don't get a chance to take a picture now it currently supports all the optimizations that almost all the optimizations done by the open J9 jet it communicates fully over the network by using raw sockets and proto buffs and currently it's a single server multi client model and it also supports open SSL it can also run all the open J9 test suites as well as many enterprise benchmarks and so let's get a bit into this at a very high level it is a client server model as I mentioned before with bi-directional bi-directional communication what happens is a complete when the completion begins when the completion request is sent to the server and it ends when the completion server returns the compile body back to the client and during the compilation the server can make an arbitrary number of queries to get information about the class info profile info hierarchy info and the reason needs to do this is because we've only moved the compiler off-process we haven't moved the entire JVM environment as the compiler has no notion about what is valid in this particular JVM's environment and so in order to make optimization decisions it needs to query the JVM the client about it so let's get into it a bit deeper there's eight steps involved and I'll go into each of these briefly but I want you to notice that steps one two and eight are essentially the same as what happens in an in-process JIT we request a compilation it gets processed it gets compiled and then we install it in the code cache and so from the JVM's point of view nothing has really changed if the JVM doesn't need to know whether the JIT is in process or out of process it's the separation happens at the JVM JIT interface so in step one we generate a compilation request which gets put in the compilation queue step two the call the request is processed by the compilation thread on the client and completion thread here is a bit of a bisnomer because it's not actually doing the compilation it more facilitates the compilation but anyway once it processes the request it sends the remote request to the server where there is a socket listener thread that's listening for these requests and when it receives such a request it will place it on the compilation queue on the server side and what it does here is not just place the request but also the socket descriptor so that when the compilation thread on the server needs to make queries against the client it doesn't need to go through the listener thread you can just directly communicate with the client so in step five the completion thread on the server will pick up the request and start compiling and as I mentioned before it will need to make runtime queries every so often and then finally when it's done compiling it will return the compile body back to the client at which point we relocate install it and then the JVM now has a compile body that it can run and so that's the general process of how it all works how does it perform so I'm going to go through four different areas of performance that we measured throughput memory footprint how does it perform in a constrained environment and also how does it perform with the cloud because after all this is a service so the setup we had was we ran it with two x86 Linux machines where we connected them with a direct Ethernet cable on machine one we had the actual JIT server MongoDB and Jmeter each of them running in their own Docker incident containers and that on machine two we had a Java e ACME or benchmark running on top of Liberty app server ACME is a benchmark that simulates flight reservation system and Liberty is a dynamic composable server application runtime environment and so here's the graph we see for throughput and by throughput I mean throughput over time the blue line is the baseline open J9 with the in-process JIT and the orange line is the open J9 using JIT as a service and as you can see throughput is about the same ramp up is a little bit worse because of the network latency that occurs naturally when you're communicating over a network and by ramp up here I mean the time it takes to get from start to peak or steady-state throughput and so we don't have to sacrifice peak throughput and the network latency is tolerable and can be mitigated for optimized next we look at memory consumption memory footprint and here specifically we measured resident set size the blue line again is a baseline open J9 where you see all the spikes that occur because of the compilation memory that needs to happen whereas the orange line is much more stable because we don't have the compilation overhead all of the memory is dominated by the application and the heap and so what we can get take away from this is that the applicant the user doesn't need to worry about the JIT they can just size their containers for their application and just move on next we ran throughput in a constrained environment and by and by this I mean we changed the ACME container to run it with only 64 megs of RAM and half a CPU the green line here is the baseline open J9 and the blue line is open J9 with JIT as a service and you can see that the JIT as service run here performs much better the reason the J9 run performed so badly is because it only has 64 megs of RAM and so as the applications heap fills up the compiler has less and less memory to work with to the point where it can't compile anything and we're stuck interpreting whereas JIT as a service run doesn't have any of those problems because all of the compilation has been offloaded it just gets the compiled body and can move on we include our hotspot here because we wanted to verify to ourselves that open J9 wasn't just performing badly it basically just shows what a tough environment a constrained environment like this is for a traditional JVM if you're curious to spike certain GC pauses and then finally we ran it just as a service for IBM Cloud Private which I refer to as ICP ICP is a platform for developing and managing on-prem containerized applications and so we had four worker nodes here one of them was open J9 with the in-process JIT another one was running with the remote JIT server we also had the actual JIT server and then we had MongoDB and so here is the performance we see for those that run as you can in the top left we have throughput over time and again as you can see throughput is a lot it's about the same year but ramp up is much worse and the reason for that is unlike the other experiment where the two machines were connected with the direct Ethernet link which means that the ping time was about 0.2 milliseconds here you're completely at the mercy of the network and so the latency is much more obvious in the bottom you see the bottom left sorry you see the memory usage and again it looks about the same as the previous slide where the JIT as a service line is much smoother and represents really the heap or whatever is in the resident set size of the heap on the top left you see the CPU usage of both the in-process JIT and the JIT as a service client there's a big spike in the beginning of the baseline because you have all the compilation occurring but you don't have that big spike with the JIT as a service client you do however have more memory use more CPU usage than one would expect and the reason for that was because of all the network communication it was kind of a shock to us that network communication took up that much CPU but it's something that we're now looking into mitigating and at the bottom right you see the CPU usage of the server which you can see is more spread out because one of the network latency aspect and it's also more than it should be because along with the compilation it's also suffering from the network CPU usage problem so that's where we are with the JIT as a service model right now where are we going with this well first we want to now work on sharing compilations between clients the prototype as it stands right now is a one-to-one model where if you have two applications and you connect to the the server it will compile the same method twice but that's not something that needs to be done and really it may help in a constrained environment but you don't get the cost of the amortization benefits of doing so we want to finish the implement implementing the remaining optimizations so that the open the JIT as a service run performs equivalent to the baseline run we want to experiment with mixing low-color and more compilations so that as a mitigation factor for what happens if the network goes down I mean maybe there's maybe it takes so long or maybe the network in the server is so bogged down with requests you don't want to wait too long to get your compiled body maybe it's better to just compile it locally we want to improve latency of course for the CPU partly to improve ramp-up partly to deal with the CPU resources but also the experiments that we did before we ran without encryption the moment you enable encryption as the more prototype stands right now every single completion request will do the handshake which means that we reduce ramp-up even more and so that's another thing that we want to look into where we can try and maybe cash the the handshake once it's done so that we don't have to keep doing it over and over again once we've established a connection that we trust and finally we want to merge this back into the open j9 master currently it sits in the clips open j9 but in its own branch and a lot of work has been done to sort of separate it and and have it run independently but eventually we do want to move it back so that it's one big happy project here are some links as promised the top two links are the source code you can get and take a look at it is very much active development and and an area where we're more than happy to get contributions and suggestions or whatever ideas you might have feel free to message us on the open j9 slack channel I included the open j9 website so you can get instructions on how to join the slack channel the link is way too big to put here and finally if you're interested in a demo which I was not brave enough to do one of my co-workers did a demo at code one and so you can follow the YouTube link and take a look at how all of this works and if you have any other questions feel free to ping me or the open j9 slack channel because people are actively listening that's it thank you for your time I'll take any questions if you haven't so thanks those are interesting so how many requests do you back calls do you do usually for a compilation so because I don't directly work on this I don't know the exact numbers I do know it's a lot and they they have been working on reducing it quite a bit because in the beginning it was just ridiculous I don't have exact numbers though but if you if you either email me or you message on the slack channel the people who are working will either email me and I'll forward the request or message on open j9 slack and they'll have answers for you I just I don't have it on fan and could you give some indication of what data you're passing in the initial request and what type of data you have in the back request so I'm not sure of obviously all the details but we do pass the bytecodes obviously as well as some other I'm not sure in the beginning what metadata is sent for sure the bytecodes and I think I think the bytecodes of all the entire class of the method just in case it needs to inline other things when we send it back we send metadata about how to relocate it in the new space as well as stack maps in line table GC maps all of that information that it needs so that it can install it properly and do all the right when the GC occurs it knows what's going on and whatnot so that's the information we have to send back to the client so profiling data we request I believe in between so the profiling happens in the client and if we need profiling data we just sort of normally we have an IProfiler we'll just say IProfiler give me a request but now that's been separated over an API so we'll just say give me IProfiler data on the client because the compiler doesn't really know it just says I'm a private file a data goes over the network client gives it back and it does so with proto-buff so that it can maintain the structures and whatnot did you look at security issues for the compiler itself because now the code can trigger better overflows in the compiler yeah so I mean security is something that is obviously going to be a concern right now it's still very much in the prototyping stage we've sort of like looked into the for the security of just the communication so you don't have any problems there but in terms of the server the compiler you're right there's there's gonna have to be eventually in you know more time spent into making sure that we are mitigating every single possible because we don't want to send obviously code that can now disrupt the server I mean or bytecodes or whatever you have to have some some notion of this the client is a trusted client and but obviously it's still prototyping so we haven't got that far into the specifics but you're right security is still very much a concern that's not gonna go anywhere especially when you go distributed I have two quick questions here one is have you considered the use case of using it locally because it seems like you're very much focusing on over the network but it seems to me that with the current way of having containers if you have like one container which is the compiler and then money containers so you definitely could it's talking over the network but if you defines that the whole idea was remove it out of process so that way you're right you could just have if you change the the communication protocol you could easily have it so that yeah you have one container and then talk over talk to the other ones I guess that this is a more general problem of like if you're if your if your containers are running on one provision to set up resources and maybe it's better to move it somewhere else but you definitely could do it that way too yes and my second question is now that you have the compiler completely separate from the the VM have you considered playing with like changing the compiler a lot or plugging in other compilers now that it's quite separate so because this is done in open J9 and the compiler JVM interface is quite tight you would need to reimplement all the you could do as long as you implement the I guess the JVM JIT interface levels I suppose theoretically you could try and plug in a different compiler if you had it it's just you would need to talk to the J9 the J open J9 VM in the same way and know how to interpret the data in the same way but again theoretically yes you're right it could be done have you look at check record and are you signing the codes that you are sending of his unit work right now I don't know if I'm actually not sure about the answer for that because it's still very much in a prototype space we just sort of have a network call and I guess the network call could be augmented to do what you're saying but at the moment we don't we're still trying to get it you know going to a more better standard than it is right now we're still improving it so but if you want a proper answer you just email me or ask us to slack channel and the people who know more about it will be able to answer you for sure would it be make any sense to actually kind of replace the on-demand compilation was kind of ahead of time compilation because then when you start the app you technically can send like the whole application to the clouds did compile and yeah so I mean we open G9 does have a OT after all but one of the big problems they OT is if you specialize too much you lose the ability to share meaning that it's very like all you need is you know one they should say you profile so you got profiling information in one run and that happened to work it may not work in the second run in which case you can't share that code anymore and so with this one you you get a one-to-one mapping for each one now of course when it comes time to share code you're gonna have to deal with the same problem where you need to find you know what is the best answer but but that's that is that is the other area of you know what you can do in this case this is again a general problem for when you have constrained environments for example where you maybe you you want to have all your services you can't afford to sort of pre-populate your cash on the machines that are running the service so maybe it's just it makes more sense to or you're running in an app in a sort in a cloud environment where you don't know where these things are and where the cash is gonna sit you just have machines generated for you just like they just pop up so in that case it's harder for you to say I'm gonna put the cash over here it's it becomes a it becomes a problem almost what's all over here with have a have a thing that'll do it for you and at the end of the day with us with the sharing there's you know you can share on the client side you can share on the server side it's what what we're gonna do is still an open question and I mean suggestions you have are more than welcome including if you want to you know if you have any ideas or whatever we're more than happy to listen to you as in taking to what you have to say no problem all right thank you very much no problem