 Welcome back to 353. So we had a nice detour last lecture. We talked about sockets. I will not ask you questions about sockets or anything. That was just a fun lecture to set you up for networking and distributed course and distributed so it wasn't really anything new. Aside from just looking at the boring system calls, doesn't really matter. So today we're getting back on track with our regularly scheduled programming. So, oops. Remember before we had this example where we created four threads, we all had them execute this run method and in the run method, they were all incrementing a global counter and while we would expect eight threads that all increment 10,000 times, the final result would be 80,000. But when we ran it, we got like 61,063, 68, so on and so forth. It was just kind of random numbers. So today we get to explain exactly why this happened and how to actually fix it the quote unquote right way. Cause before, while we fixed this by just like making a local variable in the run function that every thread incremented independently and then we returned the value to the main thread and then the main thread was the only thing that added all the results together, but in general, we might not be able to do that. So, what actually happened is called a data race and it is a very important term to know. So data races occur when sharing data between multiple threads so you can just have concurrent accesses. So this definition definitely we'll need to know. So it's when two concurrent actions access the same variable and at least one of them is a right and that is the condition to have a data race and then have that problem where we have some type of inconsistency that we saw before. If we're having concurrent accesses and we're only reading, then it's fine. We're just reading the same value. Doesn't really matter, order doesn't really matter in that case, but if we have two concurrent accesses and one of them is a right, well, we might see some inconsistent view of memory. So to actually talk about that, we need to, or to actually fix it, we need to talk about what atomic operations are and this kind of gets to the hardware level of what either happens or doesn't happen. So atomic operations like atomic particles, although I guess there's like weird physics that isn't true, but for us it is atomic operations are indivisible so they either happen or they don't, there's no in between. So that means we can't preempt in the middle of an atomic operation. Like I said, it either happens or it doesn't. So between two atomic instructions, you may be preempted, so it might be able to switch in between two things and that may or may not cause you problems, but if something is atomic, it will either all happen or all not happen. So with variables or with integers, there are some atomic operations. So like that, when we incremented there, while there is actually an atomic version of increment we could have used and that would have got us consistent results, but for bigger pieces of code that might have interactions between different variables or something like that, while we have to have a more general solution. But again, to illustrate this, we can actually dive into a bit of a spoiler of your compiler course if you're interested at all in compilers. So when compilers take your C code and before they generate machine code, there's usually some type of intermediate representation it uses to actually analyze your code and actually help generate your code. So most compilers of intermediate code it generates, so like C to some intermediate code and then that intermediate code will get compiled to machine code. In GCC it's called three address code. So it's mostly used for analysis and optimization. It's like a very standard form where only one operation happens at a single time. So it only allows statements and the statements are just one fundamental operation and for us we can assume that each of these fundamental operations are atomic and it's useful to use them to reason about data races and because in general, while it's just a slightly higher level assembly, it's a bit easier to read. So all the statements are super, super simple. This is as complicated as it gets. So it will just assign a result to the result of doing a single operator with up to two operands. So I could just do like a unary operator that's just like operator and then an operand, but this is as complicated as it's going to get. So like some register equals one plus two and that's as complicated as it gets. So the three address code used by GCC has a stupid name. It's called GIMPLE. Don't ask me why, they like making stupid names. So if you wanna see the GIMPLE representation when you compile your code, there's different compiler flags if you're at all interested in seeing it. So you can do like fdump tree GIMPLE and it will just spit out all the GIMPLE it generates. And then if you wanna see the three address code generated by GCC, you can just dump tree all and it'll show you a whole bunch of information. And like I said, it's a bit easier to reason about your code than just low level assembly, especially now like we have ARM processors or common NX86 and you don't really need to actually know both of those. So for example, in that data race example, when we had plus plus count, we would argue about it in terms while the compiler will go ahead and like create space on the stack to hold that count or, sorry, not create space on the stack. It will create some space in memory to actually hold that global variable. So the compiler will reason about like, oh, okay, well, we need a pointer to that global variable. And the GIMPLE that would be generated is in here GIMPLE doesn't have the concept of registers. It assumes the machine has an infinite number of registers and then just part of the analysis to generate machine code, it will do register allocation. But in terms of GIMPLE, it just assumes we have an infinite amount of registers and we never do more than one assignment statement. So in here, I just take some temporary register D1 and then here is a D reference to the pointer of count. So this is just a memory read. So I'm reading the value at that address. And then after I read that value at that address into a register, this is the increment step. So I will take that register, add one to it and then assign it to a new temporary register. And then finally, I will go ahead and update the global variable so I will do a memory write. So I'm writing to the location with the address of P count and I would write the value of D2, which is my incremented value. So assuming that I even have two threads that execute this at once and initially P count equals zero, what are all the possible values at that memory address? Well, so each thread, let's say it had two threads. So we have thread one and thread two and just assume both threads are doing plus plus count. So if both threads are doing plus plus count, they're essentially doing D1 equals reading whatever is that count. Then D2 equals D1 plus one and then writing out that result. So both of these threads would be doing that. So in order to argue about like the correctness, you have to assume that preemptions can happen between any of these lines. So all combinations that could possibly happen. So if initially the value at P count is equal to zero, well, within the threads are going to execute one statement then the next statement, the next statement and we'll assume the compiler won't reorder these statements but we don't know if we have any concurrency between the threads. So what might happen is thread one executes first. So it does this memory read. In this case, it would read into its own register that P count is equal to zero. And now if we kept executing thread one, well, then it would go ahead and just increment D2, or sorry, increment the variable and store it in D2. Oh, I destroyed that. All right, that's ugly. So it would increment D2 is equal to just one and then it would go ahead and write out the value and update P count equal to one. And now if we context switch over to thread two, so let's draw a little arrow here. So now we start executing thread two. What's gonna happen is, well, it will read the current value at P count. So it would read one into D1 and then here while it would just increment it so D2 would be equal to two and then it would write it out and then P count would be equal to two kind of like we expect. But since these are threads, we don't know the order between the threads at all. We have to assume that anything could happen. So what might happen, let's erase everything. So what might happen is thread one executes first. It reads the current value of P count into D1, which is zero and then we immediately context switch over to thread two and it does the same operation. So it would read the current value of P count into its register, which is independent for it. It would get read the value zero since we haven't updated it in actual memory yet. And then at this point we're screwed. So it doesn't matter what executes next. If I switch back to T1, it'll be D2 equals one and then say I switch right back to thread two. Well, it's going to update its register also to D2 equals one. Doesn't really matter. And then the next thing they're both going to do is update that value to one. And then this would update it to one that they can both see. So any questions about that? So we're, it's pretty much the only way that we get to as a final result out of this is if just one thread just executes those three statements in a row before switching to the other one. Otherwise we're probably going to get this result where we kind of, something weird happens in the middle. So it makes it really hard to analyze data races because you have to assume every possibility. So in this case, the only things we have to care about are like the accesses to the memory that both threads are using. So the only thing we care about are the reads and the writes to that global variable. So they'll always be in the same order. So I'll say within thread one, it will always do a read from thread one and then a write from thread one. And then for thread two, I'll call them R2 and W2 for read and write. So assuming we have no ordering of instructions, there is, we have to go over every single permutation between the two threads. So I can either start with thread one. So if I start with thread one and it does the read, okay, now I have to argue, well, what happens after that? So after the read, I could either continue executing thread one, which would be the write and then I could switch to thread two and then do the read and the write and then I'll get the final count as being two. Otherwise, if thread one reads first, well then thread two could read next and then at that point, either thread could write next. So I could have switched back to thread one or I could keep it at thread two in either of these cases, p count is equal to one and then similarly in this case, in this case, thread two executes all the way and then thread one executes all the way so we get the correct or the expected value. Otherwise, thread two reads first, we switch back to thread one, it reads and then we're screwed in either case. So in this case, because we have ordering between threads, it's not like four factorial because while read always has to happen before write in this case, so we have to argue about six cases. Yep. So some compilers can reorder things if it deems that it can do it. So yeah, there's like weird interactions with the compiler with several things that like is a whole nother course where essentially if you lie to the compiler and it reorders stuff and then this stuff happens to you and it's like in completely different orders, it's your fault and you will never, ever, ever debug it and you'll probably yell at your compiler even though you lied to it. But yeah, so this is only, the compiler probably wouldn't reorder anything to do with this because this is a single variable but if I was using two variables, right, it could just reorder like if I access one variable first and then the other versus like X first versus Y first, doesn't really matter if you have a single thread application but if you're sharing it between threads, it might actually matter. So this would quickly get more complicated the more operations we have. Yeah. Are each of these six cases equally likely in practice? No, because the scheduler like has time slices, right? And these are like fundamental operations so they're really, really fast. So probably not. And likely the reason we're seeing like such large numbers when we're like in that example we have, like that this race example we have, probably what's happening is it gets context switched, it like reads context switches to another one and then another one just does a whole bunch of operations before the next one goes and then the other one just overwrites every like the last 10,000 that did. So that's probably why we get this, but depends on the scheduler, depends how fast your computer is, depends on a whole bunch of stuff, you can't really guarantee it. But in practice, if it's like slow or if it's really, really fast then you might get away with it for a while and when I ran that a few times, what we saw is like if I had fewer threads or I was doing fewer increments, it was like the same all the time and it looked like there was no issue. Like oops. So I think when I changed this to be 10,000 instead just had it like 10, it was like 80 every single time even though that issue was still there but I just got lucky. But the underlying issue is still there. All right, so in order to solve this problem, you, there is some mechanism we can use to actually prevent some concurrency from happening. So if we did switch between threads, we can prevent the other thread from running until we switch back to the original thread and it makes progress. So what that is called are mutexes. So mutexes are short for mutual exclusion. So they're used to only make sure that one thread is running a certain piece of code at a time and we can create them statically or dynamically. So if you want to create them as like a global variable, you can just be like global variable or local variable equals P thread mutex initializer and they're still in the P thread library. Otherwise, if I want to create them dynamically like threads while we had P thread in it, there's also a P thread mutex in it that looks pretty much exactly the same. You give it the address of the mutex and then some attributes which we won't use. So we just give it null. And if you want to set attributes, you have to use the dynamic version but we'll just use the default mutexes. There are different mutexes that behave slightly differently depending, but for this course, we're just going to use the plain old mutex with all the default options. So how they work is mutexes essentially have two functions we can use, a lock and an unlock. So if we have some code and then we have P thread mutex lock, the code that happens directly after the lock can only be executed by the thread that actually calls this lock function and returns from it. So if this thread goes ahead and locks mutex M1, it's kind of like an actual key. Like, I don't know, probably more common here, where you know there's the bathroom things, where they give you a key and then you open and you go in and then no one else can go in after you. Same idea here, it's kind of like that bathroom key. So if you get it, you lock it, you lock the door, you bring the key with you and then only that thread can execute that protected code because I guess it's in the bathroom and then when it's done, it unlocks and then another thread can go ahead and acquire that lock and get that lock and run the protected code, but only one thread can execute this at a time. If another thread tries to, well, it would be essentially waiting for the key to become available again. So everything between the lock and the unlock call, we just call the critical section because only one thread can execute at a time. You might just call it protected and there's going to be some cases where there's dead locks if we have multiple mutexes, which we will get into later. So don't worry about it now, but essentially that problem is, oh, if you have a key that I need and I have a key that you need, how do we make any progress here and you can't. So in this too, there's also another variation of lock. So lock is like a blocking call that will wait until that lock is available and you can get it. There's also a P thread mutex try lock that will just say whether or not you successfully acquired the lock and it's like the non-blocking version of that. And yeah, then there's a question. So if another thread has a super high priority and tries to lock but another thread happens, it can't run and yeah, in this case, thread priorities don't matter. In this case, the priorities only matter for scheduling, but if another thread has a lock, even if you're a higher priority process, that only matters to the scheduler. Doesn't matter here, you wouldn't have the lock. So if we went back and we try to fix our code, let's try and fix it. So here we go with the exact same code, but the only difference here is I will create another global variable just called mutex because I don't know, I have very creative names. So I will initialize it and then I can use it here and here the little piece of code that was causing all my data races because while it had concurrent accesses because this memory location here for counter is a global variable, so every single thread can access it. So we have one condition for a data race. We have concurrent accesses and because it is plus plus counter, at least one of them is a write. So it does a read and then a write. So in this case, we have a data race involving counter. So if I wanted to prevent that data race, well, I can do a lock call right before it. So I'm guaranteed that only one thread can acquire this mutex at a time. So the first thread that makes it here will go ahead, acquire the mutex and then let's say we had that same ordering as before. So let's go back here. So in this case, this is where our ordering went bad, right? We had thread one, that read and then switched to thread two and then it read. So now in this case, we have a, oops, turn that into a box somehow. We have a mutex, bad for me. We essentially have a mutex lock as the first thing that each thread does. So now if we got the same ordering between the threads, okay, thread one executes first, the mutex initially is unlocked. So like there's a key sitting around. So if thread one executes first, the first line it's going to try and execute is mutex lock. You can assume that it is atomic. So it handles all that stuff for you. So thread one would acquire the lock. So it would have the lock. So it has acquired the mutex. Then it could go ahead, it could read from pcount and then update its register. But now if we switch back to thread two, well, thread two, the first thing it's going to do is acquire the lock, but it can't because thread one has a lock. So it would just sit here and it would get blocked at this lock call. So it can't make any progress. So this cannot happen. It cannot do its memory read because it can't get pie the lock call. So it'll either have to go to sleep. It'll have to yield or something like that. Yeah. So that's a good question. So question is, hey, if we have like seven other threads waiting for this lock, who gets it when thread one's done with it? So that is a problem we will get to in the next lecture, but that is a problem because maybe you want it to be like scheduling. Maybe you want it to be fair. Maybe you want it to be fast and you don't really care who gets it. Yeah. Yeah, locks that mutex. Yes. Yeah. So here, if say I had somewhere else in the code, I was also incrementing counter. I could have another lock and unlock with that same mutex somewhere else. And that's what I would have to do if I also modified counters somewhere else because I need to make sure I have no concurrent accesses where at least one is a write. Yeah. They don't have to be the same operation. As long as it's a memory write, that's it. Or even if I had somewhere else where I was just like using counter, so I did like this. Well, I technically also have to protect accesses to this because somewhere else another thread could be writing to it, right? So I would have to protect this as well, even though by itself it's just a read, but because there is a write possible somewhere else, I would have to protect it, in this case probably with the same mutex. So I'd have to do something like that just to make sure that I get some, in this case it won't matter because maybe I can't argue that it should be one value or the other if something was like in the middle of it, but for more complicated cases where it's like updating something that needs to be done several steps to be consistent, if you had something like this without the lock, you might like get it somewhere like halfway in between an update and it's corrupted and then you read it and that's garbage and then your program crashes or you let someone exploit you or whatever. Yeah, if you don't unlock it, what happens? Okay, yeah, we can just do that. So if we don't unlock it, what do you think is gonna happen? Yeah, so even worse, I assume I know what's gonna happen, but let's see. So I would argue that, yeah, the first thread that gets the lock would do like plus plus counter, it would go from zero to one and then it goes back and it tries to lock it again, but it has a lock so you can't get the lock from yourself if you already have the lock. So all the other threads will be waiting for you to unlock and then you are also waiting for yourself to unlock, which is probably not good. So I imagine if I run this, it will just simply just die. Yeah, so like if I just did this, yeah, so if I did this, then the first thread that acquires lock would increase counter from zero to one and then return, but it technically still has the lock. So no other thread could make progress and then that thread's just gone, but that mutex is in global memory and it knows that, oh, some thread has the lock, like they don't get any locks you have, don't get like automatically unlocked if you just terminate the thread or something like that. So yeah, the lucky thing about this is this is a bit easier than like malloc and free because well, you should be able to see unlock and unlock and they should always be paired like this. Yeah, yeah, that's just part of the library. So it just like the mutex is basically a big struct. So it just populates it with all the default fields and in the next lecture, we'll go into what the implementation of this would kind of be like, yeah. Can another thread just like unlock the mutex and just be like, yeah, screw you. So the answer to that is you can do it, but in C, what's the magical thing that they always say when they just throw up their hands and be like, yeah, no, you shouldn't do that. It is undefined behavior. So yeah, only the thread that has acquired the lock is allowed to unlock the lock. Otherwise, it's technically undefined behavior and I think, yeah, the best part of, the best implementation of undefined behavior I saw is like one that'll actually just delete all your files if you use undefined behavior and then technically it's allowed to do that and then suddenly you get a little bit better and more careful about it, but yeah. So only the thread that has acquired the lock is allowed to unlock it. Otherwise, undefined behavior. But I mean, nothing's preventing you from trying it, but just like double freeing and just like use after freeze and all that fun stuff, probably gonna have a bad time. And yeah, we'll go over when we would do the non-blocking version of this. In general, it will always look like this. It'll always look like lock and unlock. We'll only have to see try lock in later lectures. All right, so we actually didn't see this work. So here, now if I had the lock and the unlock, only one thread at a time can do this read followed by right. So no matter how many times I run this, boom, it's 80,000 every single time because well, I don't have a data race in this case because only one thread can execute this statement which is actually like three operations, all three operations, one after another at once. So this is how to properly solve that problem. So this will always be consistent. I don't have any data races. Yeah, yeah, so that's a good point. Isn't this just a whole lot slower? So in this example, I mean, it's a bit silly because incrementing a variable is not something you would probably want to protect, but we can see how much slower. So that happens pretty much instantly. And let's see the version that is fixed. It's slower by like what? Like four times or something like that. Yes. It's slower than increments. Yeah, so it's fast enough and the great thing is, well, if this was calculating your grades, like I said, probably not a good thing. You probably want it to be slower and correct than fast and fairly useless. So for all software, pretty much you want to prevent data races all the time. Otherwise you're gonna get some unexpected results. There are cases where some data races are called like benign where it's like, yeah, it doesn't really matter about the data race. Like I don't really care about it. It doesn't actually affect the correctness of my program, but in general, you have to make it very, very, very, how many varies can I say? Very convincing case for that, that that doesn't matter and you can actually allow data races to happen. Okay. So like I said before, critical section means only one thread can execute there at a time and it has the following properties. So there is safety, so that's mutual exclusion. That means only one thread can be in the critical section at once. And then we have liveness. So if multiple threads reach that critical section, AKA that lock call, it must be the case that only one of them proceeds. Otherwise, if two of them can proceed past that lock call, while we're in the same boat as before, we could have a data race. So that critical section also shouldn't have to depend on any outside threads. We'll see this later, so don't worry about, but like the case of deadlocks where no threads can make progress anymore. And then there also needs to be a property of bounded waiting, AKA like with scheduling, there should be no starvation. So if a thread is waiting at a lock call, it must eventually proceed. It can't be like scheduling where it might just sit there forever. Yeah, so eventually means same as for scheduling, just it has to be possible for it to make progress at some time. And generally for that, like we'd probably want some ordering. So you get past the lock call in the same order you got it, so it's not possible that you just get unluckily shoved at the back of the line forever. So other properties we want is that your critical sections should have minimal overhead, which means I'm not just wasting time trying to synchronize, I'm trying to get things done. So you want them to be efficient. Whenever you're waiting on that lock, you don't want to consume resources because well, if you're stuck at a lock call and you can't make any progress, there's no point burning CPU cycles when you know you can't make progress anyways. You also want your locks to be fair. So each thread should wait approximately the same amount of time and you want them to be simple. So easy to use, hard to misuse. So similar to libraries, you have like multiple layers of synchronization in your program. So there's going to be like some hardware provided low level atomic operations that we will get into in the next lecture. And then like those will be hardware instructions that are guaranteed to be atomic. And on top of those, you can build like the high level synchronization primitives and mutexes would be an example of one of these. And then using things like mutexes, you can have your properly synchronized application, which means your application does not have data races and it's always correct in all that fun stuff. But these again, our data races are the types of bugs that live for like seven years in the kernel because they are very, very, very hard to actually detect and solve. But we'll see some tools also to help us with that. So back in the day, this was actually really easy. It turns out multi processors like having multiple cores makes this way worse because well, parallelization kind of looks like concurrency. So it gets even worse, but assuming you only have a single core on your system, your implementation could be as simple as this. So if your only source of concurrency is through interrupts, otherwise I have a single CPU core and it'll just execute instructions just one after another after another. It won't have any concurrency. All I do to implement my critical section is my lock call can just be like disable interrupts. So I just get rid of concurrency and then suddenly I don't have any data races. So I just disable interrupts and then at the lock call, I just re-enable interrupts. So that just completely disables concurrency. Like it would ignore things like signals and like hardware interrupts and everything like that. But this only works if you have a single CPU core, obviously doesn't work if you have multiple processors well just because you disabled interrupts does not mean another core has disabled interrupts and they can also run in parallel. So we can try to implement a lock ourselves in software and this might be the implementation of one. So let's assume that, okay, I'll just represent the state of my lock with just an integer. So maybe I just point an integer somewhere in memory and then these are my functions. So to initialize it, I will just write the value zero to it. So I'll use the value zero to indicate that the lock is currently available. It's unlocked. And then in my lock call, my implementation is just going to check. So it's just going to infinitely loop while the value of the lock is one which should indicate that another thread has it. So just infinitely loops while the value is one waiting for it to turn back to zero. And then when it turns back to zero, like there's a semicolon here so just infinitely go over and over again. So whenever some other thread unlocks it and turns the value back to zero, then I can go ahead and I can change the value to one to indicate that I now have the lock. And then in the unlock call, my unlock call can be, okay, I just change the value back to zero. So let's look at this implementation for a little bit and see if we can think of any issues with it. So remember our properties. So one very important property is if two threads call lock, only one of them should be able to make progress. Like only one of them should be able to pass lock. And two, while it should be efficient. So if one thread actually has a lock and another thread's waiting for it, it shouldn't just kind of burn CPU cycles. So does this implementation have either or both of those issues? So take a look of it and kind of see. Yeah. Yeah, so one thread definitely burns CPU cycles while another thread has a lock because it's just an infinite loop. That'll just burn CPU cycles, just reading over and over again. What about the other issue? Can two threads call lock and only one proceed? Yeah. Yeah, in this case, we could have a data race within our lock call itself. So you're arguing that it's possible for two threads to call lock and both to progress. So yeah, for data races, you pretty much just have to come up with a scenario where you get the result we do not want. So in this case, let us assume that the initial value of the lock, let's say it's just zero. So it's initially zero, it's unlocked. And then I have, let's use purple. I've thread one and I've thread two. And they are both doing, oops, it's still purple. They're both doing the lock call that's essentially just while the value of the lock is equal to one is loop infinitely. And then they set the value to one to indicate they have the lock. So what we would like to happen is thread one starts executing this first. It reads the value of the lock. It's currently zero. So zero does not equal to one. So it would just go ahead and then update the value of the lock to one to indicate it has the lock. And then now if thread two goes ahead and calls this function, it's just going to be stuck in this while. It can't make any progress. So initially it looks like, yeah, maybe. So two threads called lock, only one made it buy. So that seems okay. Is there a case where both threads make it buy and like more concretely, what are the order of concurrency between the two threads that causes something very bad to happen here? Yeah, yeah, yeah. So what could happen is thread one executes first and it would essentially read the value of the lock and let's say it just reads zero. Okay, so it's unlocked. Oh no, we contact switch over to L2. Okay, well L2 does the same thing. It reads the value of the lock. And guess what? We haven't updated the value yet. So the value is zero. Now we're screwed. Doesn't matter what thread executes next. They both read the value zero. So they're both going to break out of this while loop and they're both going to write the value one to lock and they're both going to make it buy the lock call. So essentially they both called lock and they both made it through. So essentially we just wasted our time doing that, right? So any questions about that being like a bad thing that this clearly doesn't work, right? All right, any arguments that this does work? Okay, good. So the issue with this is not safe. So both threads can be in the critical section at once because they could both call. There is a scenario where they both call lock and they both proceed. It's also not efficient because it's wasting CPU cycles even if it is correct because the thread that's waiting for the lock is constantly just reading over and over again and we will fix all these issues in the next lecture. So just remember, pulling for you, we're all in this together.