 Thank you Mark who gave an admirable introduction to some of the motivation behind Project Loom. I'll go over that a little bit just to add a little bit more but I'll also talk a bit about how this stuff actually works. So one of the marvellous things about Java, well there are two marvellous things about Java and its early incarnation. Firstly that there was essentially no innovation in Java. There was nothing new. It was just a practical synthesis of a whole bunch of stuff that already existed in other programming languages. A lot of them were just research languages though. The genius of Java was to come up with a synthesis of some of the best ideas in computer science in a practical language that could be used by normal human beings. That was a tremendous achievement and one of the things that Java did was concurrency, multi-threading, which was to actually get a portable definition of what a thread actually was which didn't depend really on any particular hardware implementations as threads was a considerable achievement at the time but as Mark said we're 25 years down the road and the cracks are beginning to show and servers in particular but all sorts of Java applications are running a lot of threads but manual use of threads i.e. using Java.lang.thread by hand is clumsy, it's error prone, you can end up with a whole load of threads hanging around not doing much and also that they're a very heavy weight because they're required on kernel lightweight processes. Now a kernel lightweight process in Linux is not really our idea at our end of what lightweight actually means even on a 32-bit system it can sometimes allocate a megabyte of address space just for the thread local heap. So what we actually need again as Mark said is some more lightweight representation. We don't want the kernel to have to preempt threads, we want to do it in user space because those of you have measured it will know that on any kind of unixi-linuxi system it takes an ungodly long time to get into the kernel and back out again usually about a microsecond or so and preempting threads is quite expensive. I should give Ron credit at this point these are a couple of the slides that I stole from you so thank you. I'm sure you recognize them and Java threads don't need a lot of this stuff. So again as Mark said programmers have responded to this by using reactive programming so you would like lots and lots of little tasks and the task would all run asynchronously and they don't block so whenever you need to block what you actually do is send a message out so you will have your code and it will send a message to the database server saying can I have a lookup for this please and the database server would then send a message to somebody else who would in turn respond to the result of the querying. There are people who like doing this stuff but it's quite difficult to write it's very difficult to understand debugging is hilarious and so but this this this was done a while back this was done in this small talk 80 user interfaces back in well 40 years ago where the system would be forever sending you messages which you would have to handle some way but it's difficult to do and despite it being pretty efficient. We don't need to have so many threads we just schedule everything onto thread pools. Now if you if your thread pool is the same as size as the number of actual cores you have in your physical machine that works beautifully because the operating system never needs to preempt any threads so yes yes I've got a got a bit ahead of myself now oh yes unit tests are difficult to maintain so why not reduce the weight of instances of threads the other threads don't need very much context really is certainly an awful lot less than kernel threads do and we should be able to get the total footprint of a thread down to a few hundred bytes this has been done in the past I mean the word systems going back even as far as the 1960s I believe with super lightweight threads to do this sort of thing so there is a fair bit of prior art here but how can we bring this to Java so yes there are OS threads heavy weight megabytes I've said all of that this is what we can do we can get user space threads down to a few hundreds of bytes the stack that gets okay gets allocated sorry the important thing here is that the stack for a user space thread is just going to be a few hundred bytes whereas the smallest stack that a kernel thread can possibly have is one kernel page and these are allocated lazily as you grow your stack clearly if you can fit several thread stacks into a single page which you certainly can I mean pages are typically 4k but they may be as big as 64k you really are winning big time yes yes yes yes right but thread locals and thread current thread are used all over the place so there's some parts of the API that are no longer a match for this lightweight way of working and thread could be cleaned up by moving removing long deprecated methods but as we all know it's extremely difficult to remove anything from Java so we really need a better abstraction for all of this so here we are virtual threads let's implement just what we need let's switch between threads in user space let's only keep as much stack as we need now we won't be able to block in native code because that requires a full kernel task switch but there is a way around this there have been a few changes in the project recently but most particularly a virtual thread is now a subclass of Java length thread now when a virtual flash is running it is mounted on a carrier thread this is very important so your carrier thread will just be a thread out of your usual thread pool and when we switch virtual threads we have to dismount the virtual thread from its carrier thread and yeah these slides are in a slightly funny order and the problem is you've got your stack which is sitting on a carrier thread this is this is the actual stack that's been provided to us by the kernel and the stack is the record of execution of your Java program what we actually need to do is to detach the stack frames from this stack move them into the Java heap somewhere and then mount another virtual thread onto the same carrier thread and to do that we have to copy the stack now when I first saw this I was appalled you mean every time that we copy every time we dismount a thread for any blocking call at all we're having to copy the entire stack around how can this possibly make any sense shouldn't we just allocate stacks for virtual threads on the heap instead rather than I mean it's the obvious other way of doing it rather than allocating them in the stack space now it sounds like copying this stack is going to be a fabulously expensive operation but it it actually isn't there's a couple of reasons for this one is that our computers are tremendously good at book copying anybody who's spent a while observing the computer programs that actually run on our computers every day will have observed that they spend most of their time just moving stuff around that's the nature of computers that's the nature of how they used and therefore the people who design the computers that we use have gone to extraordinary lengths to make just moving stuff around very fast particularly with cashiers and particularly the accessing dynamic RAM sequentially is very very quick and it does prefetching and so on but we don't actually have to copy the entire stack when a virtual thread is unmounted now all we actually have to copy are the frames that have been altered since last time we unmounted the virtual thread the details are precisely how this works are kind of gnarly and this is return barriers thank you if you want to give it a name but at that point you're starting to see why this is a cheap operation the other thing to observe is that Java stack frames are small and they're small because you don't have local strings you don't have local arrays you don't have any of this stuff all that is in a Java thread are your local variables and your local variables are always either scalers or references to an object somewhere else so the threads are small copying them on and off when we unmount is pretty cheap but we can't get away with that with native code we can't unmount stacks with native frames the reason reasons for this are quite complicated but the problem is that native stacks often contain pointers that appoints isn't pointing into the stack frame and we'd have to do some really fancy footwork of relocating the stack frame in if we wanted to unmount a virtual thread from a stack over here and copy it to a stack over there on to a carrier thread over there all right so let's say we've got an unmounted thread over here somewhere what do we do about object pointers because we know that saved on the stack there will be a whole bunch of object pointers and the garbage collector this is Java is moving stuff around all the time how will the garbage collector be able to do that because if you look at the structure of this class here Java line continuation I should just explain a continuation is basically just the running context of a Java program a virtual thread is composed of its continuation plus a bit more stuff so when we save the stack we just copy it into this Java array here that's just an array of ints but the garbage collector is going to want to continue to run and it's going to move objects around which is going to invalidate some of the pointers in that interay that's the copy of the stack and what we used to do was to scan the whole stack find out all of the words in the stack that were in fact object pointers or oops and copy those into a separate object array and we'd expose that to the garbage collector and then when we remounted the virtual thread it would copy them all back now the problem with this is that actually finding out which words in the stack are object pointers and which words are just integers is really quite an expensive operation you have to draw through the metadata of all the methods that are on the stack granted the result is just a bitmap it's either this word is either an object or it's not but actually scanning and unscanning and so on it was really quite painful considerably more painful I have to say than the business of just copying the data into and out of the array but we have a new algorithm which I think Ron implemented two weeks ago or something where the garbage collector can actually scan what's in there as long as it stays in the new generation if it stays in the old generation sorry if it gets if if the virtual thread gets promoted into the old generation I think then we have to do the whole thing of scanning the stack and setting up the pointers and so on I think that's Ron nodding or or not good enough right okay synchronized blocks now those of you unfortunate enough to have actually written Java VM at the very very low level will know that the way that synchronized blocks work is some very very hairy handwritten assembly code which makes assumptions that this really is running on the native stack and you can block and you can call into the operating system and so on and we can't do anything about this with virtual thread if you actually say synchronized and the synchronized has to block then you are going to block the carrier thread which you really really don't want to do because now you've got one fewer thread that you can use to do some work but people are more and more these days using the locks from Java util concurrent rather than just synchronized blocks and these work perfectly well with loom virtual thread state on mount the virtual thread if they block so that works fine so we've had to go through the Java IO library replacing these synchronized blocks you know it only has to do it once and thread yield hands off to continuation yield on mounting the virtual thread or good. Now the next bit is it's called possible futures here that's not really right. The first one is structured concurrency which I think is definitely going to happen. The second one is scope locals which may or may not happen but I want to talk about that because it's mine because I did it and I think it's interesting. Okay so let's think a bit about structured programming. The traditional structured programming is all your control structures have an in at the top and an out at the bottom you can reason about programs much more easily if you use structured programming everything nests nicely. When you think about what's actually going on with threaded programming it is the most gloriously unstructured way of programming you can possibly imagine not only is the not one in one out at the top but there's one in and there's many out and they spawn over here and they start running over there and then they send messages to each other and if with project loom you're going to have tens of thousands of thread or hundreds of thousands of threads which we can do because there are only a few hundred bytes we somehow have to find a way to constrain that complexity in such a way that we can predict how a program is going to work we can analyze it and so on. So simply firing off thousands and thousands of threads and passing messages back and forth and so on probably isn't going to work all that well. Yes all spaghetti Fortran programs have got nothing on this. So here we are structured concurrency this is the idea and it's a very very simple idea that if you have a thread and it splits into a whole bunch of other threads then we want to have a join at the bottom when all of the other threads terminate and we carry on. What do you do? This is structured and what's more if the threads have just done some purely functional computation for you and they all join together at the end what you've actually got there is a function you can analyze it you can reason about it. It has no side effects but you've been able to use concurrency to make it more efficient. So here we are here is an example of a structured concurrency construct. This is your executor service here you submit two tasks for to execute and this is a try with resources so when you get to the end they both join together and it won't terminate until they're both finished. The executor submits returns a future that can be queried for the result of whatever computation you were doing there. I don't handle it there but you would need to assign it to something. Our handling and cancellation works much better because everybody joins therefore it's the responsibility of the join point to handle anything that went wrong in any of these threads that you just learned out and also thread cancellation is pretty cool as well because you can either cancel one of the child threads or you can cancel the parent thread in which case it'll all get propagated and so on and so on. Now this doesn't require virtual threads this structure concurrency works perfectly well with any kind of thread but it's very nice for this kind of work and so I've been looking at how far back this goes and I'm fairly sure that the Boros large systems of the 1960s worked this way so yeah it's good. Okay now thread locals. Java's thread local it's kind of heavy weight it's slow and all the rest of it. Now I'm going to try now to open a link. Let's see. Here's one I opened earlier. This is an analysis that I did a couple of years ago of what actually happens when you say thread local get and there's a tradition of my talks at Fosden none of them are complete without some assembly language. Okay thank you. This is what thread local get actually does. It does a whole bunch of reads from thread metadata, read thread locals, thread local table, length field, thread local hash code, look it up in a table, do the garbage collector magic because it's a weak reference we now have our thread local and we're done. So thread local get is 12 field loads five conditional branches. This is not by any standard a lightweight operation. So this was a couple of years ago and the question is can we actually do any better than this and I kind of hope we can. So this is the proposal for scope locals and the idea here is to do something rather similar to structured concurrency. You will declare sorry you will bind the scope local at some point in your programs nesting. It will then be visible to everything you call by doing your scope local get and it will disappear when you exit the same scope. They will also be inherited by the your child threads in your structured concurrency. Now it makes sense for them to inherit it. It doesn't make very much sense for local these scope locals to be mutable. I don't think it makes very much sense for thread locals to be mutable either frankly but I don't think anybody understood that when they were done 15 years ago or whatever it was. Okay and here's what it looks like. You would launch your child tasks but you would bind a value to your integer there and then in this is it executes from bar and then in food you would say my dot get. So it's like a thread local but we've made it better by making it less powerful. This is a crucial observation about a lot of this stuff is that we've got interfaces and constructs and so on that are tremendously powerful but in many ways they are too powerful and the sheer versatility of these constructs gets in the way of doing really efficient implementations and it also gets in the way of the program of being able to reason about invariance and so on. So scope locals a fixed size local cache is the magic is what I've done is that every carrier thread has a fixed size cache of 16 entries where the most recent scope locals that you've asked for are stored and from this we load a point of the locals cache. C2 is plenty clever enough to hoist scope locals into registers. Now that means the 12 loads and five conditional branches that I showed you for thread local can be reduced to just if scope locals are used in a loop they will be hoisted into registers at the start of the loop and they will stay there. So this is what it looks like you've got a carrier thread here which has this 16 entry cache for scope locals you've got virtual threads every one of the virtual threads will have a scope local hash map for its local bindings. When your program queries a scope local it will search through the virtual threads through their parents through the structured concurrency essentially it looks like a cactus if you can imagine with multiple branches load the value into the cache and the next time that that cache is queried it will find your scope the value of your scope local and lift it and that's very very fast worst case is basically just a couple of instructions if you've got a cache hit for your scope local. This works because scope locals are immutable this is absolutely crucial immutability is fantastic because once you've guaranteed that something is immutable you can copy it you can cache it you can do all of these wonderful things. So by making thread locals less powerful giving them a fixed lifetime doing it in the simplest possible way what we've got is tremendously improved performance. So we don't the effectively the cost of a scope local is about the same as the cost of a load of a field from an object and I'm done.