 Yeah, thank you. I hope it will be fun, but this is kind of a new presentation setup I'm doing So everything was basically run into a Jupyter notebook and so I'm going to present quite some code I tried to keep it concise. It's okay. If you just don't see the whole screen I also have some bullet points and I will talk you through it and My hope is to make that available a bit later also as a blog post with a bit more description You know we can come back and follow follow on it So I'm working for Agus, which is a Dutch company operating in the agricultural and finance and financial tech sector Where we are developing a supply chain measurement platform and we are going into too much details I will use this experience as a context to run my example for his dog So the heart of supply chains is the goods and how they are processed by chain participants and How they move from one chain participant to the next? so it sends to reason that a supply chain measurement platform then Chucks the goods as they move through the chain Right and when they are not in movement the goods are basically stored somewhere. They have a location and that's The concept I will use to illustrate race conditions here So how can I represent? In a nutshell a location Well with Python I will really focus on a few attributes here I will give it a name. I will give it an owner with a chain participant or just a reference to a chain participant here and also a location can have Child locations right or seen from the other angle a location can have a parent so we can have a hierarchy of locations Like we have a generic location. Let's say like, you know a big warehouse space in which then you can have like more Smaller locations located within I'm also adding here an identity fire field so that we can reference Locations between them without having to load their thing at memory. So we only work with what we need here so that's a Model for a location, but obviously we want to work with more than just one. So we will need also to store them and In this case, I'm I just like the Python interface interfaces and the dictionary is one of my favorites so I'll just basically use that abstraction to To handle my my store of locations, right? So so I can basically just retrieve a location by its identifier Insert or update With the set item of prediction interface and also, you know, we move identifier so that's basically Imagine the location the store of locations is just a dictionary of locations to which we we also add a Name index so that we can then look up by name and I will come back to that in a bit. It's it's just to support an invariant with unique names basically so that's that's just like a Well an implementation of repository we don't really have to look at it. It's basically just using a dictionary to store locations and And piggybacking on that to also have an index on names. That's just what it is So throughout the demo, I'm going to just use a singleton here, right? To to to instantiate and some locations and I will actually not use the singleton directly I will hide that behind a context manager so that we We have a nice abstraction to work with especially if we had enough time to replace the repository with With a wrapper on the database. I had slides of that, but that's just a bit too long So we still have a nice interface of a context manager. We basically Get a repo Repository instance and then we work on that as it was a dictionary applications basically so that's That's locations and then bundle of locations, but I don't want to just work with that. That's a bit too much Well, low level let's say so what is the API I want to expose here What is the business function that I want to to use or you know to expose as a back-end API if If we were doing that here so briefly the read-only API would be how to Get a location name how to retrieve the location parent and that's just That just exemplifies the concept of free repository I just mentioned We acquire a repository and within that context Yes, and within that context like once we are within the context manager here Then we just access the repository of locations as a plain dictionary So nothing should be too surprising here. We get a location and then we just return the attribute that matters So I have a name for the first function and the parent ID for the second function Obviously, that's only read-only. That's a bit limited. We still need to add a mutate data in there So we will also have a creation Yeah, also have a creation a create location function Again, you don't need to look at the whole code here. What matters is so within the context of repository first We will validate the invariant check. So is my Is the location name I want to use actually unique in the repository and If not, then I will just raise an exception and fail and if it is then it's fine I'll just move on and actually create a location and store it in the in the repository the Next mutation is pretty similar. We have first an environment check which is looks a bit Just slightly different Because we don't have the context of the owner. We have to read it from the store but otherwise is the same idea and then the logic is We update the location and we basically put it back in the repository to To commit the change So as a quick example, that might be a bit more a bit easier to grasp How do we use this API then? Well, we simply create locations and here we we pass the first location name and then the company owner of Twitter chain participant and we can also pass afterwards parents to have a hierarchical Relationship so the main warehouse is the parent of a freezer and some of a storage area and then we also have shelves that are nested under and Yeah, so we can use we can use our API here. We can just get the parent which might be Non-existent much been for the root location and we can also nest calls to then just retrieve for instance the location name of the grand parent of a location So that's just what it is here So it's a bit of a higher level API that allows us to to basically simulate how Well, imagine if we had a web API that would be the controller parts, right? We don't deal directly with the entities. We deal with business functions in which we are encoding and our enzymes So here the uniqueness uniqueness name of a location Yes Right, so this is a this was a quick setup basically To labor labor the scene for the further examples But I need to introduce a few concepts before that So we will look at concurrency race condition and critical section at first concurrency is basically the concept of interleaving Multiple tasks on one processing unit So that is each tasks get executed for a short period of time before moving on to another tasks and our task And so on until all tasks complete So in Python and while the global interpreter lock is still a thing Basically all the threads executed in one Python process will run concurrently. So it is something that you have Every time you use threads The concept is very similar in I think IO with coroutines because it's also you basically have one event loop in which You give coroutine like a bit of processing time After each other and that's different from parallelism Which would be the execution of programs on different processing units and then And then we actually have Real-time concurrent not concurrent real-time parallel executions That's not for this talk. That's a different set of of problems So quick example of concurrency if we use threads. So what matters here is that we basically create threads here and what they will run is a counting Counting loop so they sleep a bit to simulate some more some more work than they're actually doing and Then they just print who they are and in which part of the loop they are We start both of the threads Together, so basically they're now running concurrently and we wait for completion with the join join method and what we can see in the log here is so we have one one thread here which ends with 84 and It's it's it's output. It's basically interleaved with the output of the other threads So we don't see that the first thread run and executed everything and then the second run That would not be concurrent that would be sequential So this is just how those two threads are taking turn in execution basically Next the race condition Kind of the crux of this talk So formally race condition is the non-deterministic behavior Which code bath was followed? Basically all the outcome which data made it first which data made it last of a given program Race conditions are caused by the timing or sequence of events and That means which which beats of tasks get executed at one time in what order? So I will also show an example here Again, we use one common location two threads We start them concurrently and the idea is that both threads will try to rename the location the same location with a different value and The problem here is that we just don't know what the end name of the location is because Because that's that's a thing with race condition either one of the two threads can't finish first We don't know which one we don't have us. We don't have synchronization here When we rename the location So and that's that's our problem But a race condition cannot just happen randomly right it has to be somewhere and we call that somewhere a critical section That's the part of the program when we access a shared resource in a concurrent manner and if it's only about concurrent reads then we don't have any problem because you can just always read the value and Whether you do that at the same time or you know within 10 seconds of each other. It doesn't matter as long as you don't have writes or Side effects When it becomes interesting is when we have at least one mutation that happens concurrently to over calls because that's where we can have a race condition It's not systematic and that's that's the whole exercise here is is how to trigger a race condition so that we can then fix them and Well not have some probable So all of our heads when we are on production So let's apply those concepts to the API functions. I just I just define Our only focus on one function that we can use the definitions obviously on at least create location as well So that's that's that's the code I showed earlier for renaming location And I'm just going to look at a few parts So the first is the shared resource here First mark which is the repository. So the repository if you remember is just a plain dictionary So there is no Synchronization mechanism in place. It's a shared resource because if we have several threads accessing the repository then they are they are Are sharing it without Without saying hey, I'm I'm going to use that now. So please let me be And and when I'm done, I'll tell you and then you can go on So that's something we miss and that's why we then have a critical section once we are in the scope of the repository so from Mark one to mark three That's the critical section. So that means that's where we might find race conditions and actually we have two race conditions here One is when we first read From the repository to check if we have an existing location with the same name Because again, we don't know who else is accessing the repository and who else is doing something with it So what we read now at mark two Might not be the same value that is actually stored when we are Looking at the guard because it's possible that we stopped after reading the value Then somebody changed change the value in the store and what we end up with now is just still value That's the same the same approach we have on the third mark, which is basically that if I'm Overwriting a location name and somebody else is doing it at the same time. I am not aware of it So whoever does it Last basically wins, right if I wrote it first then I guess it gets overwritten and I'm not aware of it So that's they're basically losing data here. So we have we have an example basically a static example that I showed you in the function, but that's That's not very Right, that's not really helpful Let's let's just try to run the code and see what it looks like in terms of output here So again, I will just use two threads one location and they will call the rename location function at the same time with different values and The point is then to understand what value we have in the end who wins the race and basically who wins the Renaming because if you lose the race that means that you are the last one to write the value And so in the end you got to say what the location name is so let's see Let's see here. So I'm Basically copying the same code I had earlier to illustrate what a race condition is So bear with me as I walked you through it So I'm going to use functions when I have piece of code so that we can basically replay them a bit and What we do is we define two threads and they both Apply the same the same method So by default would be the rename location with different values thread one was here or thread two was here But we can identify when we look at the final value of the location who was the last one to write so again, we will start the thread concurrently and then let them run and eventually retrieve the value that was sold in And that was written on the location So we can just run that on a new location and yes, okay Sure, we get the value of the name of the location, but that's just one run So we don't see that that can be another value. That's always just one So we need the concurrent aspect here. We need to have like Yeah, we need to have more basically more runs to see that we have different values so I'm just introducing here a an orchestrator to basically play out the concurrent and Renaming of the location a hundred times and then we look at the results out of other times Who basically won and who could write their own name in the location name and if we then just apply this then we have a hundred times the same thing that we had before so 100 times out of 100 location to one and basically left their mark in the location Which is not what I want to see because here. I'm just I'm just confident that my code doesn't have a race condition I mean look that's only one value. So nothing can go wrong in production. He and no, that's actually not true The problem is it within our tests. We are not Exorcising the race condition. We are just hoping it will exercise, which is not the right way to test obviously So what we want to do for now is for the computer a bit to To help us and trigger the race condition every now and then and we will just in ten minutes In a minute, we will look at a better way to do that and exercise the race condition ourselves so I'm here defining a Decorator so I'm just so basically this is a function that I apply to a function to replace the behavior and What what I will do here is whatever Logic of the function I'm decorating was I will first Do some busy work just to keep the processor busy a bit And let the operating system switch between the two threads so that we can have a bit more of concurrent basically concurrent run between the two threads Otherwise, I just run too fast and we don't see anything anything happening So now if I run my my test run again, I can have Finally two different values. So that's great now I can see that sometimes fed one finished first and sometimes fed to finish first or last with the values here So that's that's better because here I can see that the race condition is a real thing And if I call twice the rename location concurrently, I don't know what value I end up with Which is obviously not a good place to be in So what we want to do is we want to be able to understand when and How many times of threads will basically Get interrupted so that we can then I can then Understand who will win eventually race condition or rather how we can get rid of the race condition in the first place So here it was trial and error to know how much busy work I need to have so that I can switch between the two threads and that obviously doesn't scale well It just worked on my machine. That's you know the trademark works on my machine So the the problem is that We really need to have a way of triggering the race condition programmatically every single time and We will now look at this and in general I need some more definitions briefly and then we can look at the methods that we can have in Python to do that so The whole the whole crux the whole problem here is that we have concurrent access to a shared resource the very first story and How can we protect that resource? How can we enforce that only one person can access it? Or one thread will access or mutate things at a single time We have two ways of doing it what I call implicit implicit concurrency management and explicit concurrency management and the implicit way is basically when we can delegate to a Library or to a lower level so for instance if you delegate to The database and then the database can then lock the whole table for you or just a row that you want to modify so you don't have to think about the actual locking mechanism the actual guard against the shared resource you just work with this frame within this framework of You know that when you want to mutate these data it will be protected It will be serialized and only one only one thread will mutate it at the same time. So that's very general purpose It's a great abstraction because you don't have to worry about it. You just let the system do it for you and Most of the time it's just the right approach just delegate but sometimes it's Sometimes it's not so nice, especially if you are really looking for You want to squeeze a performance out of it, right? And then locking the whole resource for instance like guarding against two threads accessing very possibly at the same time would just be To damaging in terms of performance in this case and you will want to well take the matter in your own hands and then use some Some specific concept to then protect the shared resource against against multiple access and here we call this synchronization primitives And that basically offers you a Really tailor-made approach that you can well, obviously Hopefully you can make it more performance than the general approach But well, it takes a bit of bit of care and and craft to do that and this This this these two concurrency management's Well, it would be easier to actually show the first one with an actual database implementation of the periphery Which is what I mentioned earlier the SQLite repository But well, that's a bit bit too much for that one talk So we'll we'll first focus and only focus on the explicit management and I will just briefly note about the implicit concurrency management later So what kind of tools do we have in the Python standard library to help us with that? We don't need we don't need actually anything external. We just can just use everything that is built in here So the synchronization primitives I mentioned the basically the tools in the toolbox Well in Python, they're just Python objects and most time can be used as context manager for an easy scoping Easy scoping writing I Think I will be a bit quick on those So those are your staples in computer science, but it's not specific to Python What is basing to Python here is just the API, but it's Pretty much tonalize over all of them So first we have the semaphores with RV which are one of the oldest synchronization primitive But basically count how many things there are in a pool in an atomic fashion So you can try to acquire something from the pool and if there is nothing left Then you are basically going to block unless you have a time out and You can then try to release something from another thread in the pool to give access to more things in there So that's the general concept of semaphores we also have related concepts, which is a bounded semaphore and that's more for assertion and let's say prebug Prebug purposes is just saying that you cannot go past a specific Number of things in your pool otherwise you get an error and that's that's more of safety check than anything else The lock which is a very basic Concepts it's you can get this in as a binary semaphore So you just have either zero or one thing that you can get in your pool and That basically gives you a single access to a shared resource So this can also be called mutex from mutual exclusion But in some of our programming languages, it has a bit more properties. That's we are not looking at here So the typical use case would be to basically protect a singleton against for instance double creation So when you initialize your singleton You want to make sure you've got only one So you just acquire a lock to guard access to this and then nobody can do anything with it until you release the lock Related concept will be the re-entrant lock or recursive lock Which just means that whoever owns the lock can then acquire the lock again Without blocking and so on and so forth Of course the the principle is that you release the lock as many times as you acquire it Otherwise you will end up with a broken lock and you might just hang in your process The idea is that it's useful when you have like recursive function calls in which you need a lock because the first time you actually have a lock and then the other times you just like Reacquire it for free kind of you go in and so forth in your recursion tree The event can be seen as a signal that one thread broadcast to many threads that are looking for it Basically imagine like a it's just a flag flag propagation Condition is Is a bit linked to lock event and predicate. I'm not going to go into too much details now It's just a thing. It's a nice thing in Python and Barriers which are not available in I think IO but in threads can be seen as a checkpoint where You know how many threads you want to reach that barrier and once everybody is there when you just let them go so if we apply those those Those primitives to to our race condition Example here, which is again To change the location name from to Fred's currently Yeah, just pushing a bit on that. Oh, no, sorry So, yeah, so the what I want to do now is actually trigger the race condition Programmatically so I don't want to have busy work and hundreds of iterations I just want to have one iteration in which I I know which red will run to the race condition And then I can basically make sure that both threads are Executing through the race condition and then causing an error and then I can fix that in the system Once I have a reproducible test and here we'll just apply the uniqueness of a location name as our Invariant that might be broken due to a race condition when we redeem location from two different places Yeah, so the example will just be again, we have We have two locations and We will just try to rename them to the same name concurrently. That's that's the code here so the The way this is going to work is we will first start both tasks Renaming a and then renaming B concurrently will let them execute But we'll stop them before they actually persist the change to the repository So that means that they actually pass the invariant check So both are allowed to write the repository, but we just stop them before they actually write and And then once we are there we can decide who we want to let go on and write repository and I chose the Repository update because that's the race condition. I'm looking at The expected outcome with the current code base would just be that we have basically Both locations end up being named the same way because they both passed the invariant check And they were both able to rename it, which is of course not what we want in the end But that's that's the test we want to reproduce the race location And we want to make sure that we actually have a bug in our system So the steps I just mentioned I will briefly apply the right synchronization primitives of them and Here it's it's gonna be First a barrier to basically get everyone up to speed We create the two threads. We make sure that they are initialized and they are Running or rather almost running and once we are there we can Yeah Then we can then let begin the task which will be to rename the location Because remember that I want to do this programmatically So I want to be in charge of when the fed starts actually Right potentially I can have more setup in between between the two steps here Once once they begin they will carry out the environment check they will Change the location. I mean memory and I don't want them to just persist yet I want them to be ready to do that because I want to again Be able to decide Which of the threads will try or will actually write the location to the repository so that I know Then what to expect in terms of in terms of failure So here the update to the repository is just set item method and the dictionary so that's where I will basically add another The primitive which will be another event per thread so It will it will basically look like I can then tell either Fred one or Fred two to then proceed once I Set that event when I said once I have set the flag and then finally the last two steps are basically Letting them proceed with the with the event I just mentioned above let them proceed and write the repository and then we wait for them by joining the thread Which actually is using a lock also internally so we have a barrier a Few events and then a lock in this example and those are really I mean those are those can also use other primitives But then the semantics is a bit different, but they work pretty similarly some of them are interchangeable here So if I just apply now those or transfer them to these steps to code the first task The first task is pretty similar to the second task And it's basically we are waiting for everybody to be set up. So that's the barrier, right? We need to work Then once everybody is set up we go on to the next block which will be When the when the test runner will tell us okay now you can actually Start your logic and try the invariant and try to update and Once that is set once once we've got the signal then we actually you know called rename a and rename b Which will then rename both locations to the same name the Mute is mutation seam. So the first step what really matters here is that I'm patching the set item of the repository. So that means that I'm really inserting a new block a new a new block which is actually an Event a thread Before I allow them to call the real set item on the repository So the yeah the exact the exact logic is here The intent is clear And if we wrap all of this together we end up with this long piece of code But basically what we have here is we again Well, we defined first the objects that we are going to work with synchronization primitives We initialize both threads with their own Logic function that we defined above and we pass the rates right primitives We start both of them and then we wait for them to be actually ready to do work So that's that's the barrier that all three threads the runner and the two task threads Going to wait for once we are there We know that both threads are ready and we can just signal them to go on and actually try to do the infinite check so after a bit of more Initialization that's basically step two then we can let them we let them know that we can actually start and so now concurrently both threads are checking the invariant Getting the change ready in memory and then they are stopping again Before they can persist to the repository because if we didn't stop them if we let them go further Then we are actually potentially in the race condition, right? Because then if Fed 1 did write if it just executed everything before Fed 2 then Fred 2 will just read the value from the which is already Fred 1's value and then we are not triggering our risk condition So step 3 is basically to make sure that all threads Have passed through the invariant check and were able to perform the change in memory so that's when we call the Final two steps, which we first let's task one which is the first thread one so renaming a to see To run and we wait for it to complete by turning the thread and then we do that also for step second thread location B and Well, I mean we can just execute this against the repository here I'm actually calling it and I don't see an exception That's I guess that's good news and If I now try to look at the actual values the actual names that were written for both location and location B I can see that they have the same name for the same owner So my invariant was broken and that's exactly what I wanted to see I could reproduce the race condition Programmatically step by step and now I know that I have a bug and I need to take care of it Just to make sure that I'm not saying anything stupid if I try to use the same name again Then yes, I get an invariant check error. So my invariant check is working. That's not a problem here All right, so basically I will give you a couple of solutions here as to how I how we can solve this now that we could reproduce it and It's yeah, it's just a few lines every time. That's that's a nice thing So the first the first solution which actually could be seen as the an implicit concurrency management would be to Lock the whole repository every time we try to access any location in it So basically would say I have exclusive access to rip repository now So I can do whatever I want nobody will see what I do until I leave the scope and then some over thread only one can Take the hand of the repository and go on there So in this case, I can just use Well, if I had a database, I could use an exclusive connection mode Or in this case, I just lock my single tone and lock. Remember, it's just a binary Semaphore, so only one thread can hold the lock at a time The others will just wait until thread until the lock is free and then only one will get it And once I have a lock, I'll just have access to the Repository which means that in my functions When I am in the scope of the repository context manager, it's now a synchronized access I am the only one accessing accessing the repository, which means that all the code that is after that It's not a critical section anymore I cannot have a risk condition on locations because nobody else can do anything with locations besides me So that's one way to solve it. It's like it's a bit brutal Well, it works and unless you are looking for More performance, you know simple is it's really better. You can just stick with this and then yeah, so The problem I would have had a problem if I run my test again I will just leave that as a In there, but basically I would need to change how I exercise my risk condition now Because since I have a lock, I cannot have both threads Being inside the rename location function at the same time because only one has access to the lock So only one can execute the environment check at one time So we need to account for that and introduce a few more synchronization primitives which we don't do that here or I add a bit of a bit of a time out to Let basically the other thread do their work But obviously this value here of one second and when I give some time Up to your thread to do something is again the same problem we had before it's well It's not scalable. It's just runs on my machine. So that's just a quick way to to make it work with a with a global look and Indeed now if I reset my locations and I try again To access the in-memory To access. Yeah, the in-memory locations from two threads. I get an error in one of the threads Which is what I wanted to see Because now that means that one thread will be aware that it failed to actually Carry out the operation. I didn't charge anything in my test. I just added the lock So that's good. I could fix it and If I look at the values of both locations Yep, indeed only one was changed in this case location a because it was first it was the first thread that I release from It's a single first We can have a few more solutions, but I will be just briefly mentioning them. So we could also only lock The mutation operations instead of locking the whole repository So that's that would be a bit more a bit more focus. It's basically Remember that the rest condition is only a problem if you have at least one mutation, right? If you only have reads With access you don't have any problem because you don't have any side effects. You don't have any changes So by only locking the mutation Then we can we can just Scope a bit the scope and hopefully let the other read access be more performant and not blocked when I want to carry one mutation operation So that's the nice thing. It's only about a subset of calls. Hopefully not the main course and I mean, well, that's just one example way of how I would Model that here. I would just you know, add a flag to the repository saying oh, I'm about to do the mutation in this context so please Get a mutation lock for me and then I would carry the work Alternatively, we could also have some final grain even and come up with a lock per location name because that's that's the invariant that we want to check and We are very much into a hand-crafting Tailormade solutions here because I want this invariant to not Fall into a critical sections, but there is no race condition that can happen there So I designed the minimal lock to basically Free that that invariant from the from the critical section It's a bit more involved because then you have to be really precise on which name you want to do But the intent is much clearer. It's more explicit. I'm saying I'm about to mutate only this location name So then I mean in the in the back end I have to do a bit more work because I have more locks to maintain and and potentially more more exceptions to to handle but from the use of of the code here It's much clearer. It's not it's not left to To the whims of the database as to when which exclusion exclusion mode you are isolation mode you are So to conclude You don't always have problems if you don't have concrete mutations, you don't have problems. That's fine You can have as many as you want as long as you have only read access And it's not because you identify your race condition. It's not because you see a critical section But it will happen and that's the problem here that I wanted to alleviate It's hard to reproduce. It's hard to test if you don't even spot it in the first place You might be well aware you might be unaware of it from your test suite because Unless you have enough data as I showed you with busy work or unless you have a more programmatic approach It will likely not show and you might just end up one day in production Realizing that you have a bug no idea what comes from no data to investigate no forensics and then that's that's the best Well, no, that's the worst place to be in As much as you can I would advise to delegate to local levels So let the database handle locks at role levels or table levels If you if that's not enough or if you want more performance double check your architecture So the example here was obviously a bit contrived And we could have come up with an alternative way of modeling things to not have such a high dependency on the and the repository being in a single axis and When all of this is not enough just Just exercise the risk conditions yourselves. Just make sure that you can reproduce them If you see them and then that you have a failing test and then you can fix it and keep that in a test suite Don't let that be happen since all the chance come into play And that's basically that's busy for for my talk. Thank you