 Good evening everyone. Thanks for coming. My name is Petr Chalupa. I'm with ORAC Labs and I work on project called JRubyPlus Truffle there. Let me first just make sure that you understand that JRubyPlus Truffle is just a research project at ORAC Labs. It has nothing to do with ORAC products and any opinions expressed in this talk are my own. So let me start by introducing two projects which we'll touch in this presentation. The first one is called Concurrent Ruby. It's a basically unopinionated collection of high-level and low-level concurrent abstractions. And unopinionated means that we are not trying to force any solution on you who can choose the best solution which works for your particular problem. We support MRI, JRuby, Ruby News, and we're also working on support for JRubyPlus Truffle right now. And this gem is already used in the Rails 5, Sucker Punch, Dineflow, and many more projects. And we've just released the 1.0 version. So it's a great milestone for us. From the high-level abstractions we have, for example, Chainable Futures, Go Inspired Channels, Actors, Clojure Inspired, Agents, Software Transstructural Memory, and more. We have Atomic References, Threadsafe, Data Structures, and some low-level synchronization primitives like Countdownledge, Sima4, Cyclic Barrier, and stuff like that which helps you to create another concurrent abstractions more easily. And the second project which was already mentioned is JRubyPlus Truffle, which is an experimental backend in JRuby, and it's part of the same repository. So it's open source. It's basically an AST interpreter, but with the key thing that it's self-optimizing, abstract syntax interpreter, which means that as this interpreter executes your code, it can profile which branches were taken or what types it saw, and based on that the nodes can specialize. And after some time, this tree of nodes, we can assume that it's stable and we can feed it to Grail compiler. It can then produce highly optimized machine code for us. If some of these optimal assumptions fail, then we can invalidate the compiled code which was rallying on those assumptions. Go back to the interpreter at that point, let it run off some more, specialize again, and again compile. So this is basically the gist of it why JRubyPlus Truffle is quite fast, and we will see some of the results. We support the JRubyPlus Truffle supports whole Ruby without any restrictions. We also support the tricky ones like debugging, setrace, function and object space. You don't also need any options to turn them on. They are always on and they don't have any overhead if you are not using them. We are at about 90% compatibility based on Ruby specs. And as I mentioned, it's part of the same repository as JRuby. So it's also part of the distribution already. So if you add option x plus t, you can already try some micro benchmarks or very small projects. And we already run some of the gems which are listed at the bottom. So now let's move to the main topic of this talk, which is implementing a concurrent abstraction. So let me start talking a little bit about why concurrency is actually difficult. So the main reason is that processors and compiles are free to reorder in which your code is actually executed. As long as they keep sequential consistency and you don't tell the compiler that you don't want some pieces to be reordered. And do you think that it would be good to just prohibit this behavior? But actually this is very desirable because compiler and processor may do lots of optimizations due to these reordering. So you cannot just forbid it. But the result is that another thread may see very strange values because on a thread A, the code is actually executed in different order than we wrote it. So thread B sees the rights to the memory in a different order. So to be able actually to reason somehow about this, we have to consider all the possible variable, sorry, all the possible valid orders in which your code can be executed. And to be able to construct these orders, we have to have some kind of framework to be able to figure them out. Just call it memory model and I will talk about that a little bit later. And of course, this has a greater impact on JRuby because, sorry, on JRuby Plastrafo, because JRuby Plastrafo is able to optimize your Ruby code into much less instructions so you can actually observe these effects of reordering more often. So let me show you some example of what can happen. So this example, we have a simple class with just one field value and we will be assigning a new instance of this class to the local variable. So we have here two rights to the memory. This is the first one and here is the second one. So and we are reading them both here on the line 17. So if compiler tries to optimize this, you can see that it doesn't matter in which order these two rights are executed. So we can actually switch them. But if that happens in this right, you can actually observe that the instance was already filled, but the value wasn't because the order was switched, so this can actually rise an exception. So now let's move to the actual implementation. For this talk, I picked a simple abstraction called future, which is actually just a reference of some computation which was not done yet. It has a very simple API. You can create new future. You can fulfill it with some value or you can pick the value up with the value method, which can be blocking if the future was not fulfilled yet. Or you can check if the future was completed or not, which is non-blocking operation. So let's look at some example how this can be useful. So with future, we can build this simple background processing. At the top there is just a simple helper which prints timeout outputs. And here we have simple background processing. We do it by creating one shared queue for the jobs. And then we have two workers here, which is basically just two threads, which are in a loop popping up pairs of job and future from the worker queue. Then it computes the result and fulfills the future with the result. At the end, it just prints out some time to result when it was computed. So with that, we can create this simple async helper, which allows you to execute Ruby blocks of code in a synchronous manner. So this method will be returning immediately. It will be returning a future instance. So you can then query the future and call value, which actually blocks your thread until the jobs was computed. So then we can create an array of five jobs here, which will just multiply the index by two. So this calls returns immediately five jobs. This is just to check that the array actually contains the futures. And at the end, this will block until all of the values are computed. So we can quickly run it. You can see that it takes some time to compute because I put sleep here just to slow it down. At the end, it prints all of the results. So now we can look at first implementation of future. So for the first one, we use just tools which are already present in a standard library of Ruby. So we use mutex and condition variable where mutex is just basically an entrant lock in condition where it allows you to block threads until some condition is met. In our case, it will be until the future is fulfilled. And we also use instance variable to store the current value of the future. And the lock gives you basically two things. The first one is the critical section, which is you can create a synchronized method here. And this basically ensures that only one thread can enter into this block of code. And this is also another property that when one thread makes some changes in this section, another thread entering the same section always sees all the changes made by the first thread. So with that, we can implement the first future. Let's start with the complete method. We first read the current value. We have to protect it with the synchronize. And we can compare if it's pending or not. In the value implementation, we have to use, again, the critical section to make this atomic because we have to make sure that we are going to block the thread here, putting it to sleep only when the future is not completed. So if the future is completed here, we just return the value. Otherwise, we continue here, and the thread will be blocked until it's woken up here by the broadcast method on the condition, on the same condition in the full fill implementation. So that's the reason why you need these critical sections here. So now let's look how this performs. On the different implementations, the scale is in seconds, and it's five million operations for each of the microbench marks, complete value, and two and a half million operations for the full fill part, which is trying to simulate that. Usually, you are reading the value more often than fulfilling it, which is just fulfilling it once. So now let's think about how we could improve the performance of this. So we'll start by looking at MRI implementation, and the first observation is that synchronization is actually expensive going through the critical section. So we will try to avoid it. And for that, we use the fact that MRI has GVL global VM log. And if you look into the source code, you can find that there is a CMutex in GVL release and GVL acquire. The CMutex has same properties as the RubyMutex I was just talking about. So then this basically means that when one thread releases the GVL and another acquires it, that another thread sees always all the changes made by the first thread. So this implies that MRI instance variables are effectively volatile. In a Java sense. And volatile means that if you write a variable to a volatile variable, all the readers will immediately see the current value. If you read from a volatile value, you cannot get a stale or some old cached value. And I have to warn you that this is undocumented behavior actually, even though many of the Ruby code and libraries are depending on it intentionally or unintentionally. So let's look at the specific implementation for MRI GVL. So it's pretty similar. We again need a mutex and condition variable. But here in the complete implementation, we now don't have to protect reading from this instance variable because we know that this is a volatile read, which means that I always get the most up-to-date value. So I can just read the value, compare it with pending and that's it. And in the value implementation, we can actually again, sorry, in the value implementation, we can avoid going through the synchronized block by just checking first if the value was complete or not. And if it is, we return immediately. If the future is not fulfilled, we'll still have to go through the slow path. We have to recheck again because the future could be completed just between this check and going here into the critical section. So we have to recheck the status if the future is complete. And the fulfill method is same. You may ask why this check was not moved out in the same way as in the value method. It's because this is actually an exceptional path, which means that if you move it out, all of the correct calls to fulfill would pay the price that it would be checked twice. Okay, so now we can look at the performance improvement. And as you see, the value, microbench part and complete are much, much better. So let's look at what we could do for JRuby, JRuby plus Truffle and Rubinus. For JRuby plus Truffle, we will use Rubinus implementation because JRuby plus Truffle also implements some of the Rubinus APIs so it will work on it too. So on these three implementations, instance variables are not volatile. And also method calls are not protected in any way, including initialize. So this actually means that if you remember the first simple mutex future implementation, the constructor was actually not correct. Because if you, again, remember the first example of reordering I shall do that you can observe objects, instance object with initialized, uninitialized instance variable, you can get it here. So for that, we need to somehow fix it. And we can do it with final instance variables, which is variable, which we will by convention only assign once in initializer. And we somehow ensure that it's always after the new instance is published and shared with other threads, that it cannot happen that the other threads will see initialized variables, final variables. So let's move to the example. So we start with the Rubinus implementation. Again, we need log and condition for the slope path. And to store the value, because as we talked about the value actually is not a volatile variable. So we here use atomic reference, which basically holds just one variable with the volatile semantics. So we will use that. And to protect these variables from being seen uninitialized by other threads, we insert full memory barrier here, which tells compiles and processors not to reorder anything from down to up or from up to down. So now after we are sure that this is never seen uninitialized, we can look at the rest of the implementation. The complete method we just read with volatile semantics, the value, which is done with the get method, the rest is the same. And the value implementation, it's again the same algorithm. We just first check if the value is already set or not. If it is, we return immediately otherwise. We go through the critical section and we will block the thread here until it's woken up by this broadcast here. In JRuby implementation, it has the same form. We just have to switch the Rubinus specific parts for JRuby specific parts. So for storing the value, we will use atomic reference from Java until concurrent atomic package. And to insert a full memory barrier, we use full fence from JRuby class, which is the same. We will not be using new text and condition variable here because Java objects has these utilities on themselves. So we can use through JRuby these methods. So we will switch instead of having here a log and calling synchronize on it. We will use actual Java object representing this future, which you can get with this reference method. And we will use right here to block the thread if the future was not fulfilled. And we will use notify all to wake up all the waiting threads in the value method. So now we can look at how this performs. As you can see, the specific implementations improved for all of the Ruby implementations. For JRuby, it looks pretty good, I think. But now we have a problem that we have three different implementations. And this is quite error prone. Imagine doing that for all of the abstractions in concurrent Ruby. It's not really maintainable. So to solve this problem, we need some kind of layer which can solve these problems for you and you can create your abstraction just once against this layer and it will work for all of the implementations. So for that, we need memory model and memory model extensions. So memory model is basically the framework which allows you to reason about your program, how it behaves in concurrent environment, about all the possible orders it can be executed in. So we've constructed still the command in progress, but it helped us to reason about our abstractions in concurrent Ruby to make them correct. And the way how this is constructed is that we've took behavior of all of the Ruby implementations and combined it together. So for example, if one implementation has volatile variables and another doesn't have volatile variables, then we say in the memory model that they are not volatile. Because then we can make sure with other extensions that the code we wrote against this model will work on all of the implementations. And one of the things this memory model defines is, for example, if the variables are atomic, volatile, or if they have serializability. So as you can see, the instance variables are only atomic and serializable. So this still doesn't solve the issue with the initialization. You still have that problem. So for that, we have some extensions in concurrent Ruby. And there is a concurrent synchronization object which provides three methods. The first one is the class method safe initialization, which just marks the class and all of its children to have safe initialization, which allows you to construct the final fields there. Then there is also attribute volatile which creates for you volatile reader and writer to some instance variable. It cannot actually modify the behavior of the instance variables. So you have to use these writers and readers. And also there is attribute atomic which besides the volatile reader and writer, it also creates some atomic methods like compare and set and swap and I'll be talking about them later a little bit more. So this is implemented by providing different implementations for each of the Ruby runtimes or different versions. So this is quite flexible that we write our abstractions once and we can then evolve this layer for new versions. For example, if some of the implementations decides to support volatile variables natively, we can just reuse that and drop our Java extensions or whatever we have in the implementation from the given platform. So now we can look at the example how we can write against this layer. So we have the other implementation and we start by inheriting from the object from the concurrent synchronization namespace and we mark it that we need safe initialization so we can be sure that these two variables are visible when this new instance of this class is shared. And we also need one volatile field to store the current value and we need to be volatile so here we always read the most current value. The rest of it is pretty much same as before. Again, we return early if the future was already completed otherwise we go to the critical section and we block here the thread until it's completed. And the full field part again is the same, we just have to use the writer here which was created for us by the attribute writer method. But now let's look how this performs because it's important that if we have this abstraction that it performs pretty much the same. It has summed up, it's not all the same for jruby plus so it's very good because jruby plus is able to optimize away all the abstractions provided by this layer. For Ruby news the value and complete is good. There is some issue with the full field part. jruby is good on MRI. There is a little bit of overhead because MRI doesn't do method inlining so we actually do more method calls here so it has slight overhead but I still think that it's worth it. Now we can improve the final part we left there which is the full field method which still always goes through the critical section. So for that we will need the attribute atomic method which creates the volatile reader and writer but also some of the atomic operations like compare and set and swap. And the first one basically you have to supply what you think that the value is and if it still matches the current value in that field it will be swapped set to the new value and you get out a boolean if it was actually set or not. So the way how this works usually is that you construct a loop where you first read the current value from a field then you compute a new value and then you try to set it with this atomic operation compare and set at the end and if it succeeds you break from the loop if it fails you repeat. And if there is a contention like three threads are trying to do the same thing then one of them succeeds and the rest of the threads repeat until they succeed too. So these are essential to building log 3 abstractions and we can use this to get rid of the last critical section in the fulfill method. So let's look at it. So this is slightly more complicated and the trick is that instead of having a single condition variable where we would block all the threads we will construct a list of threads which is which are blocked on this future. And to represent that we have this node which holds the thread it represents and it also has a flag to confirm that the thread was woken up successfully. In the future class we again need safe initialization we need atomic attribute atomic to store the current value and also to store the head of the list. So in the initialize method we set the head to 0 because there is obviously no thread blocked by this future yet. And then we set the value to pending. So the complete implementation is again quite same we just re-volatile semantics the current value and compare it if it's pending or not. And the value is differs the first part is the same again we check quickly if the future is complete or not and then we return but if the future is not completed we don't go through any critical section but we try in a loop to insert a new node which is representing a current thread into the list. So this is as I was talking about this is the loop where I read the current value here I construct a new value and I try to compare and set the new head until it succeeds. After that we will block this thread until the future is complete. When it's woken up we just return the current value as it was read here. And we also have to make sure that our node is and its flag awake is set to true and how this works is that the fulfill method actually wakes up only the first thread in the list and this thread when it walks up it wakes up the next thread. So before it was that fulfill a method was walking up all the threads this is changed together. So because of that we here need to wake up the previous head which was read here which is the next node in the list. And the fulfill implementations here avoids the critical section by comparing and setting the atomic value so we check that the value is still pending. If it is it sets a new value and wake up the first node in the list of waiting threads. Otherwise we know if the operation was not successful then we can raise an exception here that we are trying to fulfill it more than once. And this is just a helper how to wake up the threads. So first it checks that this there was actually a node if there wasn't it just means that we are at the end of the list. And after that there is a loop again which just checks that the node or the thread was successfully awoken and it does it just breaks down the loop otherwise it tries to wake up the thread. So now we can see how this improves the times for the fulfill benchmarks it's better for all of the implementations again because we've eliminated the path through the critical section. Now we can look so at the final comparison of all of the implementations and you can notice how JRubyplus is actually quite fast. It's actually more than 15 times faster than MRI in this case for the last implementation I just shown. So in conclusion if you need a concurrent tool look first in concurrent Ruby gem because there is lots of implemented already so it's probable that you find what you need there. Otherwise please let us know we can cooperate on adding a new one or we can add a new one for you if we have a time. If you are writing new concurrent abstraction try to use this layer because it gives you portability between all of the Ruby abstractions sorry implementations and we will not also have to care about new versions and stuff like that because we will be obviously maintaining this. And keep an eye on JRubyplus truffle because I think we have we cannot look forward to the performance it will bring to Ruby world. Here are some links if you want to find out more about concurrent Ruby or truffle or if you want to follow some news you can follow me on Twitter. And that's it from me thank you for listening and if you have any question I will happy to answer them. That's the meaning in C if I understand correctly sorry the question was what the volatile means in this presentation. So the volatile in C means that it's an external change usually if I understand correctly from time different hardware but what I actually meant is a volatile in a sense of Java. It just means that it's the guarantee of the visibility that you cannot get stale reads and stuff like that and also there is you can use it for because when you write to a volatile variable then when another thread reads the value it's guaranteed that it also sees all the changes which led to the value which falls into the variable. The closure mainly protects itself sorry the question was if I work with another languages which are dealing with this differently. So for example closure deals with this by making all of the values basically volatile and there are just few special references which can hold different values and these are protected against this. What was the another language? Elixir is based on Erlang and Erlang as I understand correctly has actors and all of the state in each actor is isolated from the other actors and then there is a message going from one actor to another is actually copied so you don't have to deal with this because you don't have to really share objects per se. Okay thank you again if you want to talk to me later about this please find me.