 Um this is a talk about the rage conditions in Erlang programs and how to conquer them. More generic than Erlang programs, I realized after writing it down but that's the title, so thank you for being here. What I'm going to talk about in brief is what are concurrency errors and why they matter. What do usually people do to detect concurrency errors? Και τώρα μπορείτε να εξηγήσετε συγγραφές και εκεί θα δημιουργήσω ένα παιχνίδιο που έχω δημιουργήσει, called Conqueror, και θα δημιουργήσω για πώς αυτό το παιχνίδιο έχει been used industrially, με δυο παράδειγμα. Έχω ένα για το δημιουργικό παιχνίδιο και ένα για τη δημιουργική εργασία εργασία. Είχω δημιουργήσει πολλές εργασίες, αλλά αν δεν έχετε δημιουργήσει, μου name is Tavros, I am Greek and I live in Sweden. Έχω δημιουργήσει εργασία και πρακτικά μόνο εργασία από 2011. Έχω δημιουργήσει μόνο εργασία από 2011, γιατί είμαι στην Ακαδίμια. Και μετά να δημιουργήσω από την Ακαδίμια, είχα δημιουργήσει σε εργασία. Έχω δημιουργήσει ένα εργασία very clear path for me. Έχω δημιουργήσει το παιχνίδιο σε αυτή η εργασία. Αν εγώ είμαι ο αγγένδρος της Erlang και τα παιχνίδια της Erlang, έχω κάνει την παραλή εργασία του διαλύσσου. Έχω δημιουργήσει ένα αγγένδρο, και είμαι δημιουργήσης στο Stockholm as a senior developer consultant and trainer for Erlang solutions. Ας πούμε με την Αγένδα. Πολλές εργασίες, τι είχα, τη σοκουρία έχει πόρες, τη σοκουρία έχει εργασίες. Πολλές εργασίες είναι τα καταγωρία του εργασίου. things happen, but sometimes you run your program in kind, sometimes you run your program in kind and sometimes you run your program. Symptoms of concurrency errors are clearly not unique to Erlang. If you have been using a synchronous programming in Java and other languages, you might have run into a case of trying to use the language and you wanna use that things, Προσπαθείς να χρησιμοποιήσεις ένα βαρειότητα, ένα βαρειότητα ρεφέρανση, πριν έχει been initialized or after it has been freed by somebody else and you were not expecting it, very typical case of error. Ατομιστοί βιολασίες, όταν είναι κάνεις x equal to x plus one in an environment where other processes might be doing the same and one process is doing x equals to x plus one and another process is doing x equals to x plus one you might end up with x equals to x plus one and not plus two as you might be expecting because of the order of this read and the right back. And another very common category of concurrency errors are deadlocks when you have several processes that should be collaborating but each one is holding some locks and not yielding them and the other one is holding the rest of them and they cannot collaborate. The way concurrency errors look schematically if you are running a program is a bit like this. You have a multi-traded application, a multi-process application, whatever you want to call it and you run it and it gets scheduled in a particular way this process does something, another process does something at nearly the same time then you have this interleaving of operations because you have multi-courses, because you have concurrency in the sense that one process is yielding to the other so that they cooperate and then they check some invariance and everything looks fine. However, if you were to run the program again it just might be the case that instead of these nice strains interleaving that you got before you get this process doing this step as before then changing over to the other process, then this does something and then it does more than before instead of yielding back here and then you have this ordering of events and what is important here, let's say, is that in this execution scenario this time it just so happened randomly by the schedulers that this event happened before this one so you run your test, you have everything fine and then you try to deploy your code and everything is failing A different way I have seen this being expressed exactly the same idea is exactly that you have developed your application, you have run your tests and then you go to production so let's see how it goes you are in developer in the middle here is my test, I'm good from there I'm good from there, I'm good from there too ok, let's deploy What happened? Seriously, I have been testing this, I have been playing with it I know this, what's going on? What about Erlang? Erlang is a programming language that is famously sharing nothing you have processes, you have isolation, you have your hips nobody is touching anybody else's stuff there is clearly you are in a much saner environment and yet if you want to do anything useful with a concurrent program then you must collaborate, you must share you must send some results back you must write in some location you must do something, for example in Erlang you have message passing if you were at the workshop you might remember or you might know if you know Erlang the order in which you are receiving messages depends on their delivery if you have a lot of processes trying to communicate with a single process then the order in which messages are delivered in the mailbox can affect how these are received because it's always the oldest method that is picked out of the mailbox now as much as people don't talk about them Erlang also has some global data structures that are shared for example the registry where you are assigning names to processes so that you can use them after they have been respond and there processes can try to assign the same name to different processes or they can try to assign different names to the same process and this is some form of sharing and there are errors there another one is when you have the most straightforward shared memory in Erlang applications which is the Erlang term storage the tables that you are reading and writing as if they were a database table a key value store that is the Erlang process and a lot of other ones now let's try to see how a race condition might look in an Erlang program let's pick the Erlang registry as an example and if you have never heard anything about it before here is a short summary by the way feel free to interact with questions at any point if you have in the Erlang registry you are allowed to map an atom which is a symbolic constant to a process identifier and then you can use this atom to send messages to the process that corresponds to that name when a process dies when the process terminates when it exits this association is automatically removed and that's how you don't get a lingering name to the Erlang and there are two kinds of errors that are interesting in this example first of all if you are sending if you are using a name to send a message and that name is not registered right now either before it has been registered or after the process has died that's an error another error is that if you are trying to register a name for a process that is no longer running you have to be registering a live process and in Erlang because of the collaborative scheduling and the software real-time guarantees the way they are achieved is that pre-emptions can happen at any time now in small systems they won't happen at any time you will execute a long series you will never have this kind of inter-living but if you have heavy load, if you have a lot of operations whatever pre-emptions can really happen at any time so let's see what happens if we have a program like this this code is executed by the main process, your first process and what it does is it's spawning a secondary process that will execute this code that is here which will wait for a message for a while and after a timeout has expired after a short while it will just give up and give up, go away now after the main process has spawned this process it will try to register it with the name child and then it will try to send a message to this process I want to make this slide a bit more the previous slide a bit more cryptic by hiding the error now some of you might have already seen have already seen what's wrong here but let's try to just run this program and see what happens let's see, I am a developer I have my code here this is my code as it was on the slide I have placed it inside a module no differences, I hope you believe me and then I go into Erlang and I am compiling my FC 2019 example and then I am running my function which is called test and my result is ok and ok and I run it again and I run it again and I do my iterative deployment everything is fine so this program is fine let's try to deploy it but instead of trying to deploy it for real what I will do instead is try to simulate an error anybody sees what the problem is here I didn't give too much out perhaps, yes what happens if there is an interruption there what happens if I get scheduled out let's see, we can simulate that by adding a slip my indentation is horrible FC, you are right and now I am running my program now notice that what I did in the program shouldn't really matter preemptions can happen at any time including between these two operations that I had before like, this is not a special point in my program I could have been scheduled out anyway I just added the slip there so that I can really trigger this execution and imagine now that I am a developer and I have deployed my code and I see it crossing maybe I have some logging maybe I know how it's going on what's working, why it's triggered I'm running it, I'm trying to test it locally that's probably the thing that I would try to do add some yields here and there slips, whatever you want to call it try to control the scheduling in a way that will really make it crash in order to debug this ok how does one really go about detecting, debugging, fixing doing something about it usually people try to do more when they are trying to find this kind of error so they spawn more processes they have them all they generate a load for these processes they have their system busy they do a kind of stress testing and the assumption there is that if you load the system sufficiently then interesting schedules will arise naturally you will drive your system in weird states you will see crosses if you are using error you will see restarts and then you get some information that ok this condition can happen let's try to fix it and maybe you can do that and it's a very valid way to do it you spawn a lot of threads run them all together they will be scheduling now whatever who knows but you can detect these things in this way another way instead of doing more is doing less and not really caring about race conditions not really caring about extreme scheduling scenarios and just every time that something is failing you like this that you know are a bit flaky you just hit the restart job this is from Travis by the way do it all the time if you know that there is a race condition there and it will be painful to fix it it's like it doesn't matter it will happen never that's also a valid way now if you want to do it right and you want to really debug it what can you really do well you can start tracing your code I mean you see that this program fails in production and you have an idea about what is going on and you can try to start tracing adding some interrupts here and there the problem with concurrency errors is that by doing these little changes you will affect the scheduling yourself you will unless you are really good at guessing or you know really what's going on trying to debug it will change the behavior of your code and this is why these errors are called Heisenbugs you know that they are there when you are trying to find where they are they are not there so what are you going to do as a developer really it's like ok how you clearly need a better approach on that one way to approach this in a better way is to do concurrency testing and the plain flavor of concurrency testing is the randomized concurrency testing now the idea there is to really enforce different scheduling but in a controlled way and this for example you are using Erlang and you know about QuickCheck there is a tool there called Pulse that can do that for you it can schedule your code after you configure it correctly to find to exercise different schedules and what you can get back from there is the trace that really leads to each one of these errors and once you get faced and you have a QuickCheck flavor tool you can shrink it, you can pinpoint exactly where the problem is and if you are interested in this kind of thing with this kind of approach and some math you can get probabilistic guarantees you can be certain that there are no errors with this level of confidence under some assumptions and you do your randomized testing and you run your code and your randomized testing will probably find everything but what happens if the randomized testing doesn't find any errors really so what can you do in that case like are there errors do you care about them do you still care about them if you do care about this kind of errors what you can do is something called systematic concurrency testing instead of randomized concurrency testing and the idea in systematic concurrency testing is to explore all the possible scheduling instead of a random set of scheduling and you do this in a systematic way you are enumerating them and if you don't find any errors then there are no errors to be found and you are good to go but the way systematic concurrency testing works in the most naive way to describe it is the following assume that you only have a single scheduler where you can run your threads and then using that one scheduler pick any scheduling of your program and run it now if you have been in prologued talks you have heard about the idea of backtracking going back to the latest point where you had a choice among things to do and there you do something else in exactly the same passion in systematic concurrency testing wherever you choose an arbitrary scheduling to do your first run through your code and then at some point you have to choose you have multiple threads to schedule and you picked one of them for whatever reason then you backtrack to that point and you make a different choice and you continue until the end again and then you backtrack and you continue until the end and you backtrack and you continue until the end until you either find an error and then you have the trace exactly that led you there or you have explored all the possible choices of scheduling that you can get there and then you have explored your entire search space there are no other possible schedulings and you are done but the things that all these approaches are test-oriented you should know in what kind of scenario you should be able to simulate the rough conditions under which a concurrency error can occur so you need a test and if you are expecting this to terminate you need a finite test you need a server and two clients and that's it and it's good to do it in that way because we will see why in a while but if you go back to the example that we had before this looks like this schematically you have your threads and now I'm going to schedule them all on one scheduler and I will be doing arbitrary choice at any point so first one I get an operation from the red thread and then I get an operation from the purple thread and red and purple and red and purple and checking my results and everything is fine and this is one scheduling this is an arbitrary scheduling and I would backtrack to the... my animation went too far good, now it didn't and I will go back to the latest point where I had a choice and it was this point actually because here I had both the last operation of the red and the purple but if we get back to the previous example this point I will get back to this point and changing your animations on your slides on the last moment doesn't help and instead of doing these things I will do these things so instead of going this path I will go this path and the alternative choice is between these two and I will continue going down and up and down and up until I have explored everything and if you want to play with such a thing in Erlang you can see what Conqueror is so Conqueror is a tool for systematic concurrent testing which runs a test under all possible scheduling it can detect abnormal process exits and deadlocks and that's the error that's supporting this process crashed at the end of the scheduling these two processes are waiting for messages so they are in a deadlock this kind of error and so the corresponding trace now if we go back to our example and we have this program here again and remove this line because that was the original code that we had we can just run Conqueror on this file and if the gods of demos are with us what happens is that we get a printout kind of visible let me make this a little bit larger it doesn't matter much here is the invocation line the program will start it will print a lot of things that I thought were helpful for the first time that somebody is using it and at the end it will tell you that I found an error after exploring 4 out of 4 interlavings let's see what the error is like a lot of options Conqueror says that on interlaving number 4 at step 5 a process, the main process that symbolically named P exit it abnormally with the bad arc on the Erlang register and it shows you exactly the stack trace lines and everything that you are interested in and then it gives you an event trace which contains all the Erlang operations that could have that matter from a race condition perspective and it shows you exactly how the program went there so you see that the first thing that happened in this program was that can I highlight? yes I can highlight is that P spawned the trial process which is named P1 and then P1 was scheduled in so the second thing that happened was that P1 was scheduled in its timeout expired and it exited exited normally and then P was scheduled back in and it tried to run the register with child and P1 and because P1 at this point was dead and that's an error an exception was raised and P exited abnormally and that's what stopped this exploration in one go nothing, you didn't change your code your Erlang code it matters a lot that it is pure Erlang code and the exploration is done you get a message that conqueror stopped at the first error if you want to find all of them you can continue a lot of chatty things for people that are not familiar with this it would have got exactly the same by the way if anybody is wondering if I had left the other operation in even if you have your and I gave it again FC example, blah blah so and now in the report you can see that now the error happened at step 6 because P's timeout that I had in there expired before we went to the other thread any questions so far? yes yes 2 to the power of n keep that in mind it's in the next slide other question? I will get there in the next slide really yes, back there right the treatment of time in conqueror is if there is a timeout I consider it possible at any point I'm not being fair conqueror is not fair it's not trying to do this kind of thing the treatment there is that if there is a timeout there is a way to say that timeouts after larger than 5 seconds are impossible and it can expect that but other than that no it doesn't do this kind of thing another question? is there a way to print the scheduling for each test? yes you mean even the correct ones yes the easiest way to do that is if you are a conqueror in the end then say explore everything so a really expected question here is that ok this is you have two threads they have some preemption points isn't that going to explode and the answer here is that systematic doesn't mean stupid and if you try to really explore all the scheduling you will very quickly find out that there are too many so what the conqueror really does is try to figure out which events really are interfering what are the things that can give different results and each scheduling that it explores should really be different and it does this with state-of-the-art techniques and that technique that it's using is called partial order reduction and the idea there is that you detect interferences between events in a scheduling you see that if these two were to happen in a different order a different result could arise now you don't do full logic the error there might be fine when you send in the sending methods case for example it was fine to schedule them in the other order because the exception was called but this one wasn't and you explore additional scheduling as needed and you avoid scheduling that are equivalent you are only focusing on the things that can really change the behavior of your program and you do this dynamically but you don't have any concrete data now one might say again that this is Erlang nothing is shared what is possibly there to go wrong in Erlang what should we be careful about and we already saw the registry as an example as a very concrete example and I talked about the mailbox and I talked about the edge tables as well one of the things that I had in my dissertation was a study about what are all the really possible race conditions in Erlang and it turns out that there are plenty more cases that one can be beaten by some of them are kind of exotic like the leader of a process which has nothing to do with consensus or anything else it just determines where this process ends its output or the unique generation of references and things like that but whether a process is alive in which order a signal arrives and makes a process exit there are some cases in Erlang as well but clearly okay we know Erlang is a very simple language it has a lot of primitives that you use for that and you then build upon them and you build OTP you use OTP code you might be restarted, you are safe, you are fine however even in OTP and that's a win of conqueror you might have a program like that which has a server we are implementing a server and you might have decided that if you want to stop your server what you will do is you are going to send it a stop request and this will make the server stop this is your Erlang behavior syntax that is you are only writing the code that is different for your server and this says that if I call the server with the atom stop then the server will stop this stop is irrelevant with this one this one means server please exit now if that server was registered because this is a call now because this is a call you will get a reply back you will get an okay but here is what you will get back you will say server stop and you will get an okay back and at that point you might assume that the server has exited but no the server has to do cleanups so if you were to immediately try to start the server again this same server and you were trying to give the same name again you will run into the same registry concurrency error this server has not fully exited so the name is still taken but again this one cannot start because the name is taken so it will fail now this was reported a while ago and was like yeah but this kind of I mean you know this kind of case it doesn't matter much that was in the mailing list in the next release somebody added gen server stop which does this synchronously like instead of calling with stop you can now stop and you can do this in Erlang in the same way you can set a monitor and when the server really exits you have to stop your server and you are safe a more recent win December 12 about supervisors turns out that there is a case when you have supervisors supervising other supervisors where if you get your scheduling precisely wrong the children of one child of one supervisor can stay alive and these are again exotic cases this is OTP this has been battle tested since the 80's when other people were thinking I don't know what but still and that's why I threw a comment on my tutorial the other day that the race conditions in Erlang exist but they are hard so this is all nice and academic how people are using this really in the industry what matters there so OTP in the industry will get you really really far you can if you structure your code correctly you will have your restarts it will be fine you don't really need to care about concurrency errors however sometimes you have more complex problems to solve you don't just need a node you just need a distributed system that will stay up you are trying to prototype ideas you are trying to make simple protocols that do something for you and you can try to implement a protocol that has been written by an academic they have proven the correctness fine for some cases and may have not proven the correctness for all cases because these were strange cases and only an engineer could think of what happens in that case an engineer at the same time in industry can also try to sit down and prove correctness for something or try to have some fun instead one of these cases happened this is their own animation completely one of these cases happened in 2016 when Scott Fritchie at that time working at VMware was trying to to see failures failures and areas of of a distributed algorithm and he was working on that and at some point he tweeted that I thought that this works but DPOR with conqueror found and in valid case the important thing that you should notice on this slide is this one this is my avatar on twitter which means that I saw this and when you see a user you run so what was he trying to do apparently there is an algorithm called chain replication that gives you redundancy on a database and the idea is that you have a chain of servers and you want the same data to be in all servers and a simple way to achieve that is to do the writes at the head of a chain and then those are propagated downwards and the last one in the chain tells you that ok the write is now completed and you do the reads at the tail so only once a new operation has reached here is it fully committed and in this way you can get correctness and replication and everything that you need but what Cot was trying to do was trying to implement a variant of this where the client is doing the writing instead of the servers themselves and he was thinking about a scenario and he was trying to see what happens if you want to add a new server in this chain and the assumption there was that if you add it in the tail and you wait a while for values to flow in there and then you continue, you are safe so he wrote down a little recovery scenario you have a new server you copy your data to it you add it in the chain if you try to do this at the same time as two clients are writing two different values to the same key and at the same time a third client tries to read the hit twice then you might run into problems with your whether you see all updates but this is the updates in the same order so if you try to add it beyond the tail what actually Cot did was sit down with some friends of his in whiteboard, discussed a little bit spent a while proved that this cannot go there is an error there takes 17 steps, I have a link there if you want to see the error and people could do this by hand ok what happens if you put it a hit of everything else now Cot at this point has spent already a day thinking about it and he knew Erlang and he knew about Conqueror so what happens if you put it in the hit this happens you think that it's working it doesn't, you find an error the final choice that he tried was try to put it in the middle of it try to insert it somewhere in between so it's not touched by reads it's not touched by writes, it should be ok but is it really ok? who knows let's try to see it in the model that we write with Conqueror and indeed if you try to check the scenario where you put your new client in the tail you can see that there is a bug and you can find the bug very quickly if you are using bound exploration which doesn't explore absolutely everything even with the constraints that Conqueror is imposing but once you know that there is a bug it's usually very easy to find it the second case was even simpler one would claim the bug was found by the tool much faster than the engineers could sit down and reason about it's fine however when you didn't know when you put it in the middle you cannot find the bug in the bound exploration and the unbound exploration takes more than 750 hours and doesn't find the bug and there is more to go 750 hours by the way is the time between you decide to start the experiment and deadline for the paper it's chosen very scientifically so what do you do then well you submit the paper anyway and then you spend a while working on the problem and then you figure out that if you were to really take a look at your model and think about what kind of races you can refine the model and you can focus on better cases so if you refine the model a little bit and you remove some scenarios because it don't matter then you can figure out that the first method can find with the first bug method you find your bug much more quickly and that's good and more importantly you can verify this if you put the node in the middle you can verify this in 48 hours and back then or in 4.8 hours after you spend a while out of academia you can really profile your code and see what you are doing with your function so fprof is a great tool in Erlang please use it whenever you find bottlenecks anywhere for any reason now another case you have other companies that are using Erlang heavily and these companies are really having good Erlang systems and you want to redesign a part of these systems and you want to make it better so you rearchitect you change the order of things you simplify, you merge you try to make something better and you want to know that what you have written now is good, is better has fewer race conditions has better failure modes however this is distributed in Erlang application these are the only single node programs so far because the time to develop other features was not available so what can you do? well you build a model again of Erlang distribution and you say somehow that this process is special and all these processes all these processes belong to this node and because Erlang is transparent it's not very far from reality like if you send messages across nodes or if you send messages within a node that's the same and this process governs this node and then you can bring nodes up and down if you wish to model this kind of failure and then you model the connections between nodes and you put some proxy processes here and there and you can really make a model of Erlang distribution and get another paper in Erlang works 2018 that's very recent but with this you can simulate your algorithm there and you gain more confidence that what you are doing is good any questions so far? wrapping up Conqueror is a tool that you can use to play with you can download it from there it's open source it's developed openly full request exists I have my own pork where I have my work list of crazy ideas if you care about the techniques for stateless model checking that Conqueror is using and the research behind it you can find my dissertation there and the summary is more or less approachable for programmers I have some kinds of dissertation with me in case somebody really has to fix this now and other than that try to play with Conqueror and encourage you because you will really understand how race conditions are tricky you will have fun modeling problems and seeing how different constructs in Erlang behave and these things usually map to real concurrency distribution failures and I claim that in this way concurrency testing is easy because you just write your code you point a tool at it and if you pay attention and if you keep your model reasonable you can even verify things and that's great so that's all that I had a link again for the 20th time thank you very much yes in what happens is that it's a mix situation this tool exists for JVM too there's a tool called Pathfinder I think that does this kind of thing for Java programs it's just the case that people rarely go into this level of detail and I have really tried to make Conqueror very easy to use whereas other tools are I would claim harder because you need to do instrumentation and Erlang has lots but relatively clear set of primitives that you need to be aware of whereas in JVM any kind of synchronization variable that you have has to be instrumented and you have to be interrupting and you have to do this kind of thing also another reason why it's not in JVM is that I like Erlang more than I like Java other question ok in that case what happens in the code example here for example here the idea is if the PID here is dead you can still send messages to it and it will be fine it's an error if you are using a name for it that doesn't exist so if you have a server that is restaurant under a name and it dies and it comes back again and it's again registered by the name then there are many race conditions there you might be trying to send a message just before the server unregisters the message will be eaten by the old server and lost you might send a message after the server has unregistered that's an error because the name is not there you might send a message after the new server is up but before it's registered Conqueror will explore all these positions there is a race between unregistered register and send with a name Conqueror will exactly the solution here accept solution is to have this registered statement in here have an initialization phase here that is and after this has happened you send the message back and that's when you try to send messages to it and maybe have a monitor there as well you just use OTP is what the solution is one, yes, yes, yes, yes, yes the ok maybe we can take that you mean this one it's a case where you have a supervisor that wants to kill its children that you have told to exit and one of the children is also a supervisor so you tell that supervisor to exit and supervisors are nice to each other and there's one that says please exit and the supervisor below will do that but will ask the first child to exit and this might take a while the way the supervisors are exiting are they are unlinking from the children and then they are sending them a signal to kill them and this might take a while and when it gets the confirmation it might unlink from the second child and at that moment the parent supervisor is like you took too long die and now this dies anything else dies, the unlinked child stays alive awarding an OTP documentation these are not the cases that you will run into unless somebody has told you I think there is an error with supervisors and I sit down and I write a pipeline model and I'm like I didn't find your error but I found another error that's it and you tell people yeah you have seen processes being alive after we kill the supervisor we wonder why, that's why I claim they can be they share memory on its tables they share their registry and they share their mailboxes in a way because anybody can write on anybody else's mailbox no but it's Erlang has a clear point of where things are said and that's why you can make tools like this happy to, thank you very much