 Okay, apparently we're still figuring out where classes are, that's not good. All right, so today we are going to further talk about threads and their implementation. So this should help you understand Lab 2. Quiz one, I still don't know what it looks like. So in a lot of ways, we're ahead of the other class, so we're at least two lectures ahead of them minus some other stuff. So if you look at on Ashwin slides, there's some questions at the end, that will probably look like the short answer part of it. I look through them, we can answer pretty much all of them, I think, except for anything that has to do with an MMU. We don't have to know that because that's how you actually implement virtual memory and we'll see the entire implementation when we talk about virtual memory. Yeah, the rest is multiple choice or true, false, probably be conceptual stuff like system call stuff, threads processes, what's an address space, stuff like that. Yeah, some threads stuff will be on, but not like super implementation detail. So you should know because what we're going to talk about today, we pretty much already talked about with processes like context switching and all that, so we'll see. So all that stuff you should know. Yeah, it's online, so how am I going to stop you? It's open internet. I'm sure the TAs aren't going to watch you or do anything, so I don't know how on earth you would stop someone. So let's talk about threads implementation. So there's a few multi-threading models that you can take when you're implementing threads, and the question pretty much lies in where do we implement threads? We can either implement them in user space or kernel space. So what you'll be implementing in lab two are user threads. So they are a complete construct only in user space. The kernel doesn't know about them, hence their name user space threads. So the kernel doesn't treat your process any differently according to the kernel. You have a single process that has a single thread, and what you do with that single thread is completely up to you. So if you want to put threads on top of it, which is what you'll be doing in lab two, you are free to do that. But kernel threads on the other hand are implemented completely in kernel space, so the kernel actually knows about them. So the kernel manages everything for you, so it allocates the stack for the threads, which is what you'll be doing in lab two, and we'll take care of all the context switching for you because essentially it can just reuse all the process context scheduling because it's the same thing, except it has to do even less. So it just has to swap the registers, which would also swap the stack, and that's it. It doesn't have to change the address space or anything like that, because we all know now that threads are in the same address space as the process. So yep, yeah, and all the registers. So it's pretty much the registers, which one of the registers is a stack pointer, so every thread needs its own independent stack, so it can go ahead and execute normal C code and be independent from each other for only executing code. Other than that, right, it shares shared memory so they can see the entire heap. So no matter where threads are implemented, there's going to be a thread table somewhere, so it's similar to a process table, except it is less information because a process table will contain all the registers and stack, which is what a thread is. It's just like a subset of a process because the process also has the address space information, all the memory, the open files, everything kind of related to a running process. So this could be in user space or kernel space, depending on where your threads actually, where your thread support actually lives. So for you, it will be in user space, but there are kernel threads in the Linux kernel, so there are some section of that dedicated to storing information about threads. So unfortunately for you, for user threads, there also needs to be some type of runtime system to determine scheduling. Thankfully for you, at least in lab two, it's going to be quite easy because you're just going to have a queue of threads that want to run and they're going to request to run and then you just serve them in order. But yeah, so a process table is just all the information associated with a process. So it's just a big structure. So yeah, it was in one slide that pointed where in the kernel tree it was. It's like that, it's called a task struct, so it contains all the information about a running process. So it has all the registers, it has scheduling information, or memory information, open files, and that's where the process control block. Yes, the process control block? To a certain degree, okay. The process table is just a bunch of process control blocks. So that's how you keep track of all the running processes on your machine. Okay, so if you have user threads, you have to determine your own scheduling because you can't rely on the kernel, it will schedule your process, but you have to determine which of your user threads runs at any particular instance. But in both models, threads live within a process. So you could avoid system calls, that's one of the benefit of doing user level threads, or which especially was a benefit back in the day when you only had a single CPU anyways. So you can't avoid system calls because it's actually quite slow to transition from user space to kernel space, have the kernel do some stuff and then transition back from kernel space to user space. So for pure user level threads, you don't have to do any system calls. So they're gonna be very fast to create and destroy and there's no context switches that happen within the kernel. You of course have to do your own context switching, which again is lab two. But the drawback of this is that according to the kernel, you process only has one thread. So if your current running thread, if your current running like user level thread hits a system call, your entire process is going to be put to sleep while the kernel goes ahead and resolves that system call. So any of your other threads aren't going to be able to run. So by having user level threads, you will not have any parallelism whatsoever, right? Because it only will execute a single thread. So you can do things concurrently with just a single kernel level thread, but you can't do anything in parallel without the kernel's support. So for kernel level threads, they're going to be slower because creation involves a system call. But if one thread blocks, the kernel actually knows about them and knows they are all independent streams of execution. So if one of your thread blocks or is doing a system call, then another thread can be scheduled to run while it's waiting for that one. So, and again, if you have eight cores on your machine, for example, and you have eight threads, it can go ahead and schedule them all in parallel because it knows about them. But all threading libraries will at least have some user code. So the pthread library that we saw before, that has some aspect of library code that we'll kind of see what it does and we'll see through the magic of strace what system call ptrace actually uses, or sorry, pthread. So the thread library is going to map user level threads to kernel level threads, and then there's different ways of kind of breaking this down. So what you'll have in lab two is called many to one. So many user level threads are going to map to one kernel thread, which means your library completely exists in user space. So again, as far as the kernel is concerned, you have one running process, which of course just has one thread. In the one to one model, one user level thread corresponds to one kernel level thread. So that means the kernel handles everything, it does the context switching for you, it does the scheduling for you, but of course you're at the mercy of the kernel as to when your thread actually runs. And then there's many to many, which is kind of weird and is something they did in the past, it's not really used now, it's a bit complicated, and we'll see kind of a better way to do that. But many to many just means that many user threads can be mapped to many kernel threads, it's not exactly a one to one correspondence, and it's not exactly a many to one correspondence, it's somewhere in between. So the main to one is a pure user space implementation. Again, this is just lab two, so there will be no system calls, you will be managing everything yourself. So it's going to be fast, and it's going to be portable because it is purely written in something like C, which is basically just portable assembly, and you're not depending on any system calls, so you can reuse it on Windows, you can reuse it on Mac, Linux, doesn't matter, it's a pure C implementation of threads. And it's just a library, but drawback is again, that if one thread blocks, then the entire process is going to block, and another thread can't make any progress. So it can't do anything in parallel because again, the kernel only sees a single process. So back in the day, this may have been good when we only had a single core on our machine, but everyone probably has at least four cores on their machine, so this probably isn't a good thing to do anymore, although you still get to implement it. So one to one just uses a kernel thread implementation. So there's just a thin wrapper around the system calls, just to make it easier to use. We'll see that the system call is kind of weird looking, and it still has to allocate some stack space. So the kind of P thread implementation is going to malloc a thread for the new kernel thread, and then essentially tell the kernel where that address is, and then the kernel takes care of all the scheduling and the context switching, and which would be swapping the registers and everything like that. So this is the way to exploit the full parallelism of your machine, because the kernel has full knowledge of everything, and it can schedule all the threads to run as it wants. But again, this has a slower system call interface, and in very, very niche scenarios, you might want to control the scheduling yourself. If you have threads that maybe have higher priority, or you know one always needs to happen before another one, so you essentially give up that freedom of writing your own scheduler by just using the kernel scheduler, but thankfully for all of us, the kernel developers are quite smart, and the scheduler is quite good. And this is the actual implementation typically people will use, and for Linux with P threads, this is the actual implementation. So the many to many is a hybrid approach. So the idea is that, well, I might want a lot of user level threads, and more of them than kernel level threads. So if I only want to ensure that I have parallelism, and I don't and say each thread is really, really expensive to create, well, then I'll just create the number of kernel threads equal to the most parallelism I can get out of my machine. So if I only have eight cores, I'll create eight kernel threads, and then I can put like 10,000 user level threads on top of that, since they'll be cheap. So the idea is to kind of get the best of both worlds there. So you want to get the most out of multiple CPUs, and also reduce the number of system calls. But however, this kind of leads to a more complicated threading library, and then depending on your mapping luck, you may block other threads by accident, since again, the kernel doesn't know about them. So you still have that same issue. So if you map, you know, if you have tons and tons of user level threads, and you map a hundred of them onto a single thread, and they happen to all contain blocking system calls, well, the kernel can't schedule between those, because again, it can only schedule what it knows about. And it makes things very, very complicated, and we'll see just a technique that kind of uses this idea, but it's a lot easier to implement. But the threads get to complicate the kernel and make our life difficult. So the question is how should fork work if I have a process with multiple threads in it, right? It's kind of weird if it's an exact copy, should it be like an exact copy and copy all of the threads? That would get out of hand very, very quickly, and also make a fork bomb with threads, very, very effective. But so the way the Linux kernel deals with this is that whenever fork happens, it will create a new process like fork does, and that process will only contain a single thread and that will be the thread that called fork. So that way you avoid some potential issues of if you fork, right, it's just a snapshot at that point of time, and if there's multiple threads running, they might be playing with memory and like kind of halfway in between some operations. So after the fork and you have two processes that are now independent and then one only has a single thread, it might be in some weird undefined state because some other threads were monkeying around with memory at the time of the fork. So yeah, so the question is if I created 10 threads and within the run of all those 10 threads there was a fork, so if that happens, this rule still applies, but it would create 10 new processes and then each of those new processes have a single thread which would be a copy of whatever thread called it, right, so good question and we could fork bomb ourselves even faster if we just had our loop inside of that. Yeah, so that is like a really hard thing. So there's also this code that we won't cover in this course but if you want ever find yourself forking in a thread you're gonna find like weird inconsistent state but you can use this thing called pthread app fork to control what happens when a fork occurs so that you can leave your memory in some consistent state before the fork so that both the new process will have some, at least some consistent view of memory but that's not gonna be covered in this course, you don't have to know about that, that's kind of just a fun thing, I've never had to use it, thank God because that would be awful. Oh yeah. Yeah, so that new, sorry, that new process would be a single thread could be within our run, right? That normally implicitly calls pthread exit or you could call pthread exit so because it's the only thread now if you call pthread exit it's going to end the process and by default it just always returns error code zero but in the new child process you could of course just call exit yourself and return whatever error code you want. Okay, so signals are other fun complication so if I have a process with say eight threads and a signal gets sent to it, well what happens? So what Linux will do is just kind of pick a random thread and have that thread handle the signal so this makes concurrency kind of hard so you have to consider any other, or you have to consider any thread being interrupted at any time to run your signal handler code so you don't know if it's the main thread, you don't know if it's the thread you create, you have absolutely no control over it and this kind of gets very hairy. Yep, yeah so it would pick a running thread if there's one running or one that could run if others are blocked. So the question is is there any way to politely suggest that it gives a signal to the main thread? Not as far as I know, there's other ways around it that will come way later so you can kind of force the issue on that but I don't think through the default signal stuff that you can do it but you don't have to worry about that until lab three where you actually have to deal with signals but the nice thing about signals for you in lab three is you have user level threads so you only have a single process so you know that your one process is going to be the thing that gets the signal so it's a bit nicer for you. Okay so this is the kind of default thing instead of having many to many one common technique is to use something called a thread pool so if the goal of many to many was just to avoid creation cost well I don't have to support many to many in my threading library, I could have a thread pool which just creates a certain number of threads and a queue of tests so maybe like I said before you wanna get the most parallelism you want and you would just create as many threads as CPUs in the system and then as requests come in you would just have a thread wake up then give them some work to do and then as soon as they're done that work they'd go back to sleep so you wouldn't actually destroy it it would just go back to sleep and then when there's a new task it would go ahead and do that task so you only create your threads once you reuse them over and over again when there's no work to do and that's one very common technique and that's called a thread pool so you might encounter this with one of the labs way later in the course but since you'll be implementing many to one our process life cycle still applies because essentially your thread is just the virtual CPU it's the actual execution of this so this is your state diagram that all of your threads will be using and then of course since this is a this is all user level threads below this because it's all running in a process this also happens at the process level right so if it takes a system call the system or the if it does a system call then the process will be blocked and then your threading library will have no idea that a system call is being executed it would just consider it as running so the code you have to implement in lab two you'll have to create a function called thread create which essentially will create a thread allocate some stack space initialize some structure that will hold all the registers so you can context switch and whenever you're done creating a thread you can just have a double ended queue that is just full of threads that are waiting or that can execute and whenever you're done creating a new thread you could put it at the end of that queue and then as part of your scheduler which is not going to be terribly sophisticated your scheduler is going to be responsible for taking tasks from waiting to running and in that case the easiest thing to do is just to pick a thread that is at the front of the queue because you can just assume it's like a line at Disney World or something like that and it gets to ride next so that would be the thread to run and then in lab two remember we had cooperative processes or cooperative scheduling so you'll have the same thing in lab two you'll have cooperative threads so the only way for a thread to give up the CPU is that thread has to explicitly call thread yield and that will transition that thread from running to waiting so it'll essentially just put it back to the end of the line so that, yeah, it'll put it to the end of the line and then your threading library would just pick the next thread to run from the front and start running that, yep, yep so it depends where the context switch is happening so in our case we're doing user level threads so you're going to have to do the context switch yourself but there'll also be, right there'll also be another context switch under that because the process the kernel knows about will be context switched in and out too yeah, yeah, so you'll, you're essentially just creating fake threads, right so thankfully you don't have to, you don't have to take away the CPU from any thread in lab two, they have to essentially relinquish it by using thread yield but of course any thread can just not call thread yield until it is done in which case it would just hog the entire CPU so the other transition you'll have to make is from running to block and then you implement something called thread sleep which just basically makes the thread not runnable anymore so it can't be transitioned to the running state and then eventually some other thread will call thread wake up which will just put it back into that waiting queue at the end of the line again and then you'll have to implement of course thread exit so the running thread can call thread exit which would essentially turn it into a zombie thread and it would be terminated at that point and I don't think you have to implement like a thread join as far as I know there's a thread destroy but there's no join okay, so like I said the scheduler can just be some simple round robin thing so you just create a queue that always runs the thread at the front of the queue and then when it yields it just or a new threads created it's just always added to the back of the queue so you'll have to do the context switch and again you'll have to save the registers you'll, the first hurdle you'll probably encounter is saving the program counter you have to monkey with it a little bit and kind of really think about it because otherwise if you save the program counter at the state it's at it will go ahead and whenever you restore it it will restore it to right before it saved itself and it will just save itself again and you'll get into this very nasty loop so that's one thing you'll have to be careful of and again these are cooperative threads so they have to play nice and in lab three you'll do the preemptive version where your threading library will actually steal CPU away from a thread and force it to context switch yeah you'll be saving them and creating a stack okay so yeah okay so that's a good question the question is when thread is running what's the difference between sleep and yield the difference is that if you yield it will just go back to the waiting task and it could be run immediately again it's still eligible for being run while if you call sleep it's in this block state and block just means it's not eligible to run so even if the CPU is idle it still doesn't run so of course you can think you can brick your process if every thread goes to sleep and it's all blocked then you can't make any progress anymore because no thread will actually wake up okay all right yeah yeah so the question is if a thread puts itself to sleep how does it get unblocked and the answer to that is another thread has to wake it up yep because otherwise you can't run again so you can't wake yourself back up okay all right well let's get into our next fun complication then and kind of review threads a bit more and just play around with them so we'll create a program that spawns eight threads and each thread will increment a value 10,000 times so if the initial value of that variable is zero what should the end result be? 80,000 but not right so we'd expect to see 80,000 so let's go ahead and kind of practice some more threads and see an actual bigger join example so here's my main in main I create an array of pthread types which again processes and threads are kind of analogous to each other so if it's easier you can think of that pthread type as just a thread ID just like you'd think of the return value of fork as a process ID so if that's easier you can think of this as a thread ID so I declare an array for all my threads that has space for num threads which in this case is eight and then I have a for loop which goes ahead and creates the eight threads and gives it the memory address the second value is null so they're going to be a joinable thread because that is the default value we're going to have them all run this run function and we're not going to pass an argument because we don't care yep, yes, yeah so that's a good point he brought up the point that in total right after this point I will have nine threads because again a process is going to be created and there's going to be a thread that runs main so main will come here and then create eight more threads so in total I will have nine threads and that is such a good question it'll probably be on the quiz if the other class knows that by now which they might not okay so anyways let's see the run function oh first to start off who knows what the static means because I, yeah which sounds like a yeah so what's the difference between this and a global variable yeah yeah so basically without getting into compiler nonsense basically this is a global variable right this line and if you put static in front of it it still behaves as a global variable but you can think of it as you can only access that global variable from this C file and you can't access it from another C file so it's kind of like a global variable that other people can't monkey around with so you're reasonably sure that it won't change out from under you okay well to make that even worse so int i that is a stack variable right it's just allocated on the stack who wants to tell me what that means what what do they teach you in the C course alright they taught us nothing in the C course so in this course right knowing what's on the stack and what's in the heap and what's a global variable is kind of important especially if we're in the same address space and we can monkey around with things so annoyingly so everyone agrees this is on the stack yeah so this essentially makes it a global variable so it will be a global variable and the only distinction here is that you can only access that global variable from within that function but otherwise it behaves exactly like a global variable so if i did this it would exist for the entirety of the program but because this is now threads each thread would use the same variable since that's a global variable and we'd have a disaster on our hands yep so global variables are stored when you load your program so they're just as part of some information the l file there's going to be a section that has a data section that says you know x should go in this spot of memory and it doesn't change throughout so it lives as long as the process lives so it is neither in the stack or the heap it's just in some predefined memory space and that's it yep yeah so the question is if i if some thread sets this variable to 10 and then another thread tries to read it what will it get and it will get 10 because it is a global variable so it's just got one single memory address that it points to the memory address will never change and whatever the value is because all the threads are in the same address space they'll all get the same value yep yeah so in this case static int i i can only access this i within this run function yeah within the scope of this yep yep okay so someone should have taught you that in the first year because now maybe it didn't really matter back then but now it really matters right because if i change it to static every thread can see that and if every thread is going through this loop i's not going to increment sequentially it's just going to be like just random values all the time they're going to fight with each other and it's going to be terrible all right so in this case i did int i yeah this one oh yeah counter i want to be global variable because we're going to see the fight okay so then after i create all the threads i would join them because i want to wait for all of them to finish right and i actually care about when they finish and then after they all finish then i can print my counter and be sure that all my threads have finished execution and this is a case where you would want joinable threads because if i had detached threads what would probably happen if i just had detached threads yeah yeah so if i just have a detached thread here and then print counter it's just going to rip through it it's going to create all the threads and then immediately just print that and then exit the process which will kill all the threads so likely this counter would be some low value because some of the other threads might not have even run yet and then main would exit and the process would die yep yeah yeah so p thread join is the exact same thing as like wait pid so it's going to wait for that thread to die the main thread is going to be blocked until one of the other threads dies all right so anyone tell me why i wouldn't do something like that why do i have two four loops why don't i just whoops anyone want to guess why i don't do something like that yep yeah yeah so in this case i want to have give the kernel the opportunity to paralyze this as much as possible so if i have something like this it's going to create a single thread then wait for it you know create a single thread wait for it to finish create the next thread wait for it to finish create the same thread wait for it to finish so i'm essentially just having a like only one thing is executing at a time so i basically just made a single threaded program with a lot more steps so if i do this i create all the threads so now at this point in my program i have nine threads right the main thread and then the eight other threads that i'm hoping that the kernel will schedule nicely for me and then they start dying off at pthread join because at this point there would only be one thread and it would be the main thread yeah so the question is what about pthread0 exited took way longer than all the other threads while pthread join would the first time it goes through this it would wait on thread zero so even if the other threads are done before it would have to wait to join on that specific thread to clean it up they'd all be zombies right so we just have seven zombie threads until the first thread finishes and then we just wait on or essentially just wait on all of them yep yeah so the question is is there an basically is there an equivalent of weight where i just say i don't care what thread finishes first just wake me up whenever any thread finishes and the answer to that is by default no so there's a there's a non-blocking version of join so you can probe it state to see if it's a zombie yet so you could do that but you'd have to pull and ask every thread over and over again but as far as i know there's no equivalent of just plain weight on anything okay so now we get to run this so if i run it i'm expecting 80 000 and of course when i run it 77 000 not bad i can run it a few times 72 of 57 yikes 60 66 50 oh we're getting worse hey we're bug-free right works for me so this again is why programming with this is very very is more difficult oops all right so every time through except for times i get insanely lucky i essentially just get a bunch of random values right hey so one question is this a way that programs create random values and the answer to that is no that they do something much smarter than this okay anyone can can anyone think of ways i could fix this even knowing what we know now okay if i want this to run in parallel still yeah if i don't have parallelism or concurrency things are easy because it's back to our old old life right yep oh yeah yeah so if this wasn't such a ridiculous example that's one good way to do it so you would make you know you could make counter a local variable there so that's unique to every thread they could all increment it so say it was something better than that and then as part of their return value they'd have to malloc some space to point to their integer and then within the join right so this would return some pointer to some int and then within the join we could actually capture the return value and then since we wrote the thread we know it's pointing to an integer so we could just add everything up from each individual thread in the main thread and that way we wouldn't have any issue so that is one good way to do it and then we'll see more ways next lecture but so any other questions about this yeah okay so we got time so let's just let's return some values so my fix I can say int local count this is like a very ridiculous thing to do but let's bear with me so I'll change counter to be local counter so now that's on the stack each thread it's all independent now right so then I'm incrementing local counter at the end of this I'm done so I could return an integer pointer because I can just cast it to avoid I'll call it return if I malox I'll be nice and say size of int so now that pointer is in the heap so if I return that it's going to still be valid right it's not like I'm oops oh I just so if I return that it's still valid why is that a compiler oops return to oh yeah semicolon 2 and return to 4c issue okay so at that point it's going to malock and return because if we did something like we returned local counter like that right that's what we did last time that's very very bad because as soon as the thread's done that function will be return it's out of scope it's some invalid or it's some kind of trash memory so we have to use memory so we'd malox some space on the stack or sorry yes so no that's good feel free to show it to me if I say something stupid so we allocated some memory on the heap and then we'd return our we'd write the value of our local counter there yep no maloch always just allocates on the heap so luckily for you the authors of maloch if multiple threads are calling maloch they won't fight over each other so so you'll get if you call if a thread calls maloch it's going to get a unique block of memory from whatever thread called it so each thread is going to because we have eight threads there's going to be eight maloch calls that are all going to be unique addresses and then each one just returns one address right okay so now we need to play with the return value so in join we of course have to allocate so this is kind of our return we just have to allocate some space on the on the stack to hold our return value and then we can give the address of the return value I know it's an int and then I probably have to void star star yeah whoops oh I didn't I need maloch yeah so I haven't added them up yet I just had to include the maloch thing all right so that gets it so then return is now a pointer and then within here I can do counter plus equals dereference ret and add them all up right that is a good point and since I am such a nice person I should also free the memory so there that should be what we have oh Jesus what I do yeah I need to cast the void pointer to an integer yeah no it's an integer pointer yeah what I do is it consistent yikes huh well that was a crappy example yes because it has to be a pointer to a pointer so it can write the value there yeah what I do no it's local counter yeah it should yeah that's uh really weird oh of course local counter is not initialized whew there we go okay I'm not that bad programming I'm just I'm just forgetful so 80,000 every single time okay whoever caught that very very good yeah so I didn't initialize local counter so it was just some random garbage oh okay yeah so the question is the one without malloc why did I get such random values and I would got the random values even in the case where I only had a single CPU so if I go back assume I had the wrestling code so plus plus counter it's deceptive that's a global variable so that's actually three operations so it has to load that global variable from memory and then to a register or something like that then it's going to increment that and that's going to be local to a thread and then it's going to have to write that value back out through memory so you can't even get the wrong value if you have two threads and you only add to it once so because there's concurrency threads can get interrupted at any time and you can't control it so the initial value is zero one thread can go ahead load that value from memory to register which would be zero then it could context switch to another thread that value hasn't changed yet it would load it from memory which it would get zero then at that point either thread could execute they would both increment it to one and then they would both write one to that memory location so instead of what we would expect it writes a one and then a two it would instead write a one and then a one so that's the crux of our problem and we'll get into that next lecture unless I have to do quiz review yeah so just remember processes curl threads enable parallelization threads can have or processes can have multiple threads that we saw a bunch of implementations that probably are that are valid for the quiz and then operating system gets some more complexities and now we have synchronization issues but hey we've showed how to solve it at least one way so just remember I'm pulling for you we're all in this together