 All right, good after noon, whatever time it is. All right, so today's lecture might be a bit scuffed because here's my preview screen. It's a giant red cube, so I can't answer anything on Discord, so because I can't see it. Cool. So today, we're going to talk about thread implementation, which will help you a bit for Lab 3. Oh, yeah. So this week should be sufficient, so this will be like the broad overview, and then we'll get into some nuts and bolts thing tomorrow, and we'll finish up virtual memory tomorrow, and then that'll be it. So you can start Lab 3 right now, but it might be a bit, yeah. So by the end of this week, so one of the parts of Lab 3 is to write your own little test case to help you, and then if you want to get an answer of like, what should this value be at this point? I'm going to run it with my solution, and you would get to know, so you at least get to think about it, and you can think about it and write that without actually implementing anything. So let's get into it today, and then we'll get into some more fun stuff. This really sucks that I can't see my own screen. Okay, so before last time, we got into the weird situation where it printed in run twice, so I had to rack my brain for it because that shouldn't happen, and unlike what most people think, it's like, oh yeah, the system's broken, but it's not my code, it's clearly not my code, it's a system. This I think is actually broken in the C library because this actually never happened before, I updated and now it happens. So you can get an idea of what happens here that will go into more details today, but there are two writes from the same process or thread, and one writes in run and gets unfinished, but actually succeeds, and then this is the other thread that has a different number, and it calls the write for in run, so everything's in the same memory space, so there's clearly some problem here that is newly introduced, and I'm not going to fix it, but that should not happen, but that's actually a bug in the C standard library, which is a new one. So yeah, you should never see that in run, what we did yesterday, you should just see in run in main or just in main, and that's it, you shouldn't see in run printed twice, but it happens that the main thread is actually printing in run, which shouldn't be possible, so that's cool. All right, so we're going to discuss why things like that might happen today, so there's a few different multithreading models you might do, so the question might be, well, where do we implement threads, and there are two types of threads, there are user threads and kernel threads, and we're going to see both in this course, so user threads means they're completely in user space, the kernel doesn't care about it, it doesn't treat your process any differently, so back in the first year when you had a program, it started executing main, we now know that, hey, that's like one thread executing main, and it just goes through sequentially until it's eventually done, calls exit, or returns from main, and that's the end of the process. So if you implement threads in user space, that's still the same thing, according to the kernel, you have one single thread of execution, but in user threads, you're causing that thread to jump around, and you are writing all of that yourself, and that's what you are doing in Lab 3, so the kernel doesn't know what you're doing, Valgrind doesn't really know what you're doing, that's why I provided you a new, so one of the things that's in a thread is a stack, I provided you a new stack function that registers it with Valgrind, otherwise you're gonna get like 10,000 pages of errors, and you're gonna think you have memory leaks and really bad things going on, and it's gonna be impossible to debug, yep. Yeah, so in Lab 3, only one thread will be running at a time, sorry? Yes, yeah, so in this case, it would of course be concurrent and not parallel because there's only one real thread running. So it does make things easier to program if you want to do a lot of things concurrently, and if you switch fast enough, it looks parallel, so if you only had one CPU core back in the day, that's your only option, and it does make some problems easier to express, right? If you just have threads and you wanna work on little things here and there, well, using threads is good, even if you have a single core, but yeah, so what's more useful is kernel threads, that means the kernel knows about the threads and you can have more than one, and hence you can have parallelism, but of course that's up to the kernel, but for kernel threads, the kernel manages everything for you, you don't have to allocate a stack to your threads, it's all done for you, you don't have to allocate any bookkeeping information or do anything like that, the kernel does it for you, and you also get parallelism out of that because the kernel can schedule threads just like processes, schedule them across multiple CPUs or do whatever it wants, while if you're doing user threads, the kernel only knows about one, can only schedule one, so no matter what, the threads are going to require a thread table of some sorts, so similar to like the process table we saw before, but much more limited, so it could be in user space or kernel space depending on if you're doing user threads or kernel threads, so if you're doing user threads, which you are doing, you also can't use the kernel scheduling, you have to do your own scheduling because you have multiple threads, and I see a wide open face, but what's the easiest way to schedule something that we saw? First come, first serve, so luckily that's your scheduling for the lab, so first come, first serve, it's just a queue, put things up or take things out of the front, put them at the back, easy way to schedule something, so you'll have to do that, and yet in the lab it's first in, first out, so in both of these models, so your process can contain multiple threads, so what you wanna do is try and avoid system calls, or you just let threads block everything, so if you have user threads, that means the kernel doesn't know anything special about them, they're gonna be really fast to create and destroy because you don't have to do a system call to create them, you just allocate a stack, allocate some space to maybe store some registers, allocate, I don't know, some space to store status codes, something like that, and there's no context switches that are that heavy, you do them all, all the context switches involved is just swapping some registers and that's it, but the big drawback is, well, if one thread blocks, so if one of your user threads calls something like write, or read, or something like that, the kernel is not going to know about any of your other user threads, and that whole process is gonna be blocked, and you can't execute any other threads that you might want, which clearly isn't that good because one of the points of threads is to juggle multiple things, so if one thing's waiting on IO or something like that, ideally you'd just like to start executing something else, but if you have pure user threads, the kernel doesn't know any better, it thinks you just have one thread and it's blocked, so it can't do anything, and then there's the opposite for kernel level threads, while the kernel knows about them, but it's gonna be slower, it's going to involve at least a system call, even if it does the same thing, system calls are much slower because again, it has to transition the CPU from user space to kernel space, and now we know that the situation's probably a bit more complicated than that too because the kernel's using memory and your process is using memory, and your process has page tables and the kernel would also have to manage its own page tables and find out a way to swap those in without screwing up your program and do all sorts of fun stuff, so context switching is actually fairly involved, so all of these threading libraries, you're gonna run in user mode anyways, it's just whether or not the kernel knows about the threads, so the thread library is going to map user threads to kernel threads, so there's two main models, so there's many to one, which means all of the threads in your user thread map exactly to one kernel thread, which is like that main thread, and then the kernel only sees one process that has one thread and that's it. The way to actually properly support kernel threads is your threading library would be something called one to one, that means every thread you create in the library maps directly to one kernel thread, so for every thread you create, the kernel actually knows about it and the kernel handles everything aside from a little bit of bookkeeping in that library that says, hey, which threads do you have? And then there's some other models if you want to try and pick the best of both worlds, many to many, where you map some user threads, like multiple user threads to the same kernel thread to try and get rid of that overhead of system calls when you don't need it, and the idea here is that well, if I can, if I have only an eight core machine, well, only eight things can run in parallel, so if they're really expensive to create, I'll just create eight kernel threads and then I'll put like thousands of user threads on top of them, but turns out that's actually hard to do. So this is lab three, so main to one is that pure user space implementation. So it's gonna be fast, it's gonna be portable because what you're writing is just C code. C code, it's essentially a portable assembler if we like it or not, and it doesn't depend on the operating system just on the standard C library and if you have a standard C library, you could potentially use your threading library on that. But the drawbacks again are the kernel doesn't know about them, so if one thread blocks everything else blocks, you can't execute anything in parallel because it would all be concurrently because the kernel only knows about one thread, so you can switch between things you are doing, remember switching as concurrency, but you can't do anything in parallel. So if you have one to one, that means we're using kernel threads and there's just a thin wrapper around system calls just to make it easier to use, provide that layer of abstraction that we all know and love, so we don't have to actually do the system calls ourselves. And this way, you can exploit the full parallelism of your machine, doesn't matter if you have like an eight core machine, four core machine, or you're on a beefy server and have like 128 cores, as long as you have a good kernel, then it will try and run everything that you throw at it in parallel as fast as possible. But you do have that slower system call interface and if you call this a drawback, you do lose some control. So now you are at the mercy of the scheduler and you don't know exactly what it's gonna do, it could change from day to day, it could change from update to update, you don't know while if you do your own library and you control the scheduler, well, you can make it do whatever you want, you could make it, you could say, hey, give this thread most of the run time, you could do whatever you want. But typically, this is the actual implementation used and this is the implementation for P threads on Linux. So they do one to one, so every P thread you create is a kernel thread and they can all run in parallel or of course concurrently. So main to main is that hybrid approach and the idea, again, more user level threads and kernel level threads since they're more expensive to create and you cap the number of kernel threads to the number you run in parallel or maybe a few more than that just to give some leeway in case one gets blocked for whatever reason and that way, you get the most out of all your multiple CPUs and reduce the number of system calls but you can imagine this might lead to a complicated thread library because you still have that fundamental problem that if one of the kernel threads blocks, well, it gets blocked and no other user thread that's using that kernel thread could execute. So you might even get super unlucky where all the other ones scheduled on free kernel threads can't actually execute because they're all waiting on the one that's all blocked. So then if you wanna solve that problem then you have to go to that scheduling idea of stealing work from one but again, this is in user space so you can't really do that and like it's just gonna be super complicated and basically we're just stealing ideas from the kernel anyways. So you may as well just give up and just use the kernel implementation instead of doing this. This was like trendy and like, oh no, like the 80s or the 90s or something like that and then they just gave up because it produced way too many problems and also it would also be, if you think debugging now is bad, think about debugging for that where like, oh yeah, sometimes just my whole program just halts if this thread happens to get scheduled on this kernel thread and yeah, it's a complete disaster. So if you thought lab two was bad, the real world debugging of that would just be an absolute nightmare. If I really hated you guys, I would say go implement one of these and like yeah and then I'll see you all here again next year. So luckily it will be, you are not doing that. So now also the idea of even having kernel threads complicates the kernel a lot because remember all the threads are running within a process which is like within the same address space. So now we get a lot, a lot of problems and there's also questions about how they interact with different system calls now that we already saw. So if we have a process that has eight threads and then one of the threads calls fork, what the hell should happen in that case? Well, there's lots of things that could happen. You could create an exact copy of the process including all the threads and one of the threads would have called fork but you have no idea what the state of all the other threads would be, what they're executing or even what they're doing and that could get really out of hand especially if you have like thousands of thread. So what Linux does if you fork in a process that has multiple kernel threads is it will copy the thread that called the fork into the new process and your new process again will only have exactly one thread in it and it's a copy of whatever called it and the rest of the things applies it would have its own independent address space which looks like a copy at the time of the fork and but that's it. So that new process would only have a single thread. So now if your new fork process calls pthread exit, well it's kind of like the detached thread scenario before if all the threads happen to call pthread exit it's gonna exit the process. So now since our new process only has one thread if that one thread calls pthread exit and it was the only thread well it's gonna exit the process now and that's like caveat at least as far as I can tell from the documentation because as you also know that reading the documentation sometimes is ambiguous and kind of crappy. So as far as I can tell that is true and then there's also another complication of like well if one thread is calling fork then you don't know what all the other threads are doing say you have eight. So seven threads could be monkeying with memory all at the time of the fork and you don't know exactly if there's going to be like a consistent state after you fork. So if you wanna control that which is a nightmare to debug as well and I've never seen there's a function called pthread app fork that registers a function that will run whenever a single thread calls fork which tries to stop all of them and then make memory consistent so that way it can fork you know it's not forked. Yeah but we're not gonna cover that in that course but you should know that that's like a real big complication and pretty much impossible to get right. Then there's another complication of signals since we all love signals. So since we all love signals well it was bad enough when we had one process with one thread. So now what happens if you send a signal to a process with eight threads which thread should receive the signal should it be the main thread? Should it be all of them? Should it be whatever one's the least busy? What should it be? Well in Linux they just make your life difficult. They say that if you have multiple threads a signal will arrive on exactly one of them and it will just be random what it is and it can change from signal to signal swap between threads and then run that signal handler in that thread. And that makes your life a lot more difficult too cause now everything's running concurrently. Multiple signal handlers could be running also in parallel in different threads depending on how unlucky you got and you can see how this quickly quickly spirals out of control so people didn't even like signals when it was a single thread, multi threads it's even worse. But thankfully you're done with signals now so we can rejoice. All right so there's also some other techniques too that you might see that tries to use all the best things that you would get out of a many to many thread library implementation without actually doing a full blown library that tries to do something more complicated and that's something called a thread pool. So what it does, the main goal, the main domain is just to create things really cheaply. So one technique you could do is create a thread pool that creates a set number of threads of kernel threads whenever you start your process and never destroys them and keeps on reusing them. So the idea is well you have something that's just a thread dedicated to queuing some tasks and then all of those kernel threads stay alive for the entire duration of the process and all they do is wait for a task to come out of the queue and whatever thread is currently not doing anything just pop something off the queue does work with it and then when it's done it goes back to getting the next thing from the queue. So you only create the threads once and this works really well if you want that parallelism but you're trying to paralyze a lot of really, really small operations where the cost of creating a thread might take longer than actually just doing the operation itself so this is a solution to get around that so you just pay the cost to create threads once and you consistently reuse them over and over again when there's no work, put them to sleep, yield them, do something, yep. Yeah, so this is kind of like a mini library but it's pretty simple, right? I'm reusing kernel threads and I stick a queue on them so there's tons of different people that implement thread pools like I think it's in the C++ standard library now so there's many different implementations for this too but the idea is pretty simple, you could even implement it yourself. But so, back to the lab, so this is what you are implementing main to one so remember that like process state diagram we saw in lecture what, like for? Well, it's the same idea for threads except threads, yeah, threads have the same lifetime but they just share the same address space so there's going to be the same state diagram that you can keep in your head or maybe you want to track as part of your thread control block which is just the bookkeeping information associated with your thread. So this is the lifetime of a thread so when you call what create, also I'm proud of myself for that name, it's hilarious. It doesn't take much to amuse me. So when you call what create that creates a thread so it should bring it from the created state to this waiting state and again, you could call waiting just ready so it's able to actually get scheduled and execute. So this waiting state would be all of the threads that are currently able to execute and this is likely you would implement this as just your FIFO queue. So just a giant queue, we might go over a queue library you can use so you don't have to implement a link list again and we'll do that tomorrow and we'll, yeah, tomorrow we're gonna finish up virtual memory and then we'll do like lab three stuff. So eventually when threads get scheduled to run when you pick them from the front of your FIFO queue they would go from ready or waiting to running and remember because you are implementing pure user thread libraries only one of your threads can execute at a time because you're essentially just sharing the same kernel thread over and over again. So one of your only one of your threads will be running at a particular time and when it's running the model we're doing for this lab is they're going to be cooperative user threads so the only way you take the CPU away from them is if they actually consent to it by calling something called yield. So if they call yield that means it's going to give up its turn go to the back of the ready queue or the waiting queue, whatever you wanna call it and then the next thread is going to run and this would happen until eventually a thread either calls what exit which would actually terminate that but again like processes as soon as we're terminated and you'll get a good idea of this when we're terminated we can't actually clean up all of our resources yet. So we have to stick around and be in that terminated state essentially you're like a zombie thread at that point and you have to wait for someone to join on you. Now what happens on a join? Well the running thread might call join and if it joins a valid thread well it would be blocked such that it can't execute anymore so you should just make sure it doesn't go into your ready queue because it's waiting on some thread to actually terminate or it might just already be terminated in which case it would just read the status value and go to the ready queue again but if it needs to wait it would go in that block state where it can't actually execute anymore and then when whatever thread it's waiting on actually terminates then you can actually free all the resources from the other thread you waited on because it's now no longer use you can't run it anymore and it can't execute anymore so you can actually deallocate its stack and everything to do with it and then you can read its status code and actually return from join by just putting yourself back into the ready queue and then eventually you would get scheduled again. All right any quick questions about that or clarifications if people have already read lab three? That's the basic idea. Okay some people might not have read it which is fine but yeah so your scheduler can just be round Robin just a straight FIFO queue you just create a queue or a list run the thread at the front and when it yields throw it to the back pick the thread up from the front and that's it and in this lab you'll have to do the context switch and remember that's like just saving and swapping registers. Thankfully there's a library for that which will make your life much easier so you won't have to play with the registers we might see what you would have to do if you have to implement that library because it's a bit more difficult but there's a library that I'll show you how to use if you haven't read the lab yet tomorrow that will actually swap all the registers for you and this is gonna be again like I said cooperative threads so they have to be nice. Next would be preemptive threads but I probably won't do a lab for it so as soon as you do this well the way to make the threads your cooperative threads preemptive threads is well all you do is essentially force threads to call yield and suddenly you have preemptive threads so if you have a mechanism to force threads to call yield you can easily make them preemptive threads so most of the heavy lifting is just doing the user threads with cooperative scheduling and because you control everything and nicely since everything is defined as FIFO and you only have one thread everything has a deterministic order so it should be significantly easier to debug. So this is going to be our next complication that carries us into after the reading week and this will not be on the midterm this gives us an idea of like the complication of that threads brings. So let's create a little program all it's gonna do is create eight threads and it's going to do something fairly simple so each thread is going to increment the same variable 10,000 times so the question is well what should the final value of that variable be? So if the initial value is zero and we have eight threads each incrementing at 10,000 times what should the final value be? Yeah? 80,000 right? Yep, yeah so that's the complication what happens when two threads try and increment at the same time? So if I gave this to you you would probably say hey that should be 80,000 that would make sense because eight threads 10,000 times eight times 10,000 that's 80,000 so let's go ahead and look at it because guarantee that's not what's gonna happen because that's never what happens. Oops, I didn't open it. So, oops, so in here let's read the code and it's actually kind of looks similar to how you use your library a little bit except your library is a lot simpler. So what we're gonna do is we're going to allocate a bunch of threads just on the stack so an array of num threads, in this case num threads is eight and then we're going to just declare an i and then we're gonna have a for loop that executes eight times that creates all eight threads. So remember this is what the signature looks like so the first argument's a p thread creates a pointer to the thread. Thankfully in your library all you do is give it a function so yours is a lot simpler. Then some attributes that we don't use and then the function to run in the thread which that's all you're going to do in lab three so when you create a thread it's gonna have a function that should run whenever you start executing that thread. And then an argument which I'm not gonna give it because I don't care. And then this is our main thread so it creates eight threads so if it wants to actually properly wait and make sure all of them are finished and also clean up all the resources well it should essentially call wait or in the thread case it's called join so we'll create eight threads, they start executing whatever the hell they feel like it and we're going to wait for them all to finish by calling join on all eight of them again. Now we're going to print out the value of a counter and all the counter is essentially global variable that starts at zero and then this is what each thread runs. So each thread will declare its own I and then in this for loop it will execute 10,000 times and all it will do is increment the counter and that's it. So we think it would be 80,000 so let's go ahead and test that. So if I execute that 75,614 which is kind of close, 71,000, okay I'm worse, 76, 55 wow that was terrible, 64, 68, we're getting worth A so this is what makes things hard too so this is actually a case where you would commonly see the number be wrong but even in this case that has a very high likelihood of being wrong sometimes it's right. So if that happened to be my first run and I'm like yeah I'm done testing everything works perfectly, yeah that's probably not going to fly so I got 80,000 once but now I ran it again I got 42 which is awful so this would probably be a bad idea especially if this was something that was like computing your grades or something like that so yeah just sorry you get a 60% now I don't know what happened, yeah. So that is what we will solve when we come back from the midterm but any questions about that or like any guesses as to why that is actually happening. Yeah, yeah so what's happening here if you didn't hear that is basically the problem is with the plus plus counter so it's a global variable so if you actually like disassemble this program or remember assembly or maybe not remember assembly well all this will actually do is, let's go back that plus plus counter is actually well counter lives at some memory address so let's just call it, I don't know, let's call it beef so it's like some address and that's where counter lives and then out there is going to be or four bytes which would start at zero so when you use plus plus counter what actually happens on each CPU core is well it will read that address from memory like the value of that address at memory into some register and remember all the registers are going to be local to a core so this is what happens on a core memory is shared but registers are personal so CPU one would read that into its register and then the other part of plus plus counter is well say it goes into register I'll call it register one whatever well the next thing your CPU is going to do is increment register one so now if it read a zero it would have read a zero into register one and then increment register one which would make it a one and then it would write that back out to memory write one, two, whatever that address was well now we have to keep in mind what's specific to a thread and what is not specific to a thread so this is one thing that would happen on I don't know call it thread one well what might happen because we have even if you only have one CPU core that's concurrently swapping around the thread what might happen is this we have thread two and it would want to do the same thing so it would read the address of beef into its register one and then increment register one then write reg one, two, beef, whatever so if these are only two threads and they both just incremented once well because things can happen concurrently it can switch back and forth now what might actually happen is this so what might happen is thread one starts executing it reads from memory and it would read the initial value of zero someone needs to answer the phone all right so it would read the initial value of zero and thread one and then we might actually at this point before we even increment it swap over to thread two so if we swap over to thread two and read the memory location well it would have read a zero and now at this point it doesn't matter what happens next thread two could go ahead continue executing it read that zero into its register which is personal to it doesn't matter it won't change then it would increment that register to from a zero to a one and then it would write the value one to beef so now it would look like counter equals one and now thread one might it might switch back to thread one and it can finish executing so it would do the same thing it read a zero into its register and then we swapped away from it so when we swap back its register is still gonna have the value zero in it so it would also increment that register to a one and then it would write one to beef and then it would just write counter equals one again so instead of one thread we're writing one and then the other thread writing a two well they both end up just writing one because one saw kind of a still value yep nope yeah so the problem here is well I can't control when any switches happen and if you can't control it then you have to argue about all of them for correctness yep yeah so what we would like to happen is like prevents any switching from happening right so like if something is about to read it should be able to finish its right before it switches and we don't know how to do that that is a possible thing to do but you actually have to identify things like this and actually prevent things and we'll see how to prevent it after reading week so there is a way to prevent it but the first the first thing you want is you want switching to happen because if switching can happen well it could also run in parallel but now if I have a mechanism to stop this from happening well then I can't run them at the same time as well right and now it might not be great yeah oh yeah so it's yeah so that's a good point and are you trying to connect that back to like the numbers I see here yeah yeah so the numbers here well it also seems like if you're being interrupted that much it seems like it would be almost impossible to get that right like 80,000 say you have like a 50-50 shove getting interrupted after a read then no way I want to coin flip like 80,000 times so what's more likely happening here is each thread is you know just reading reading reading and then writes and then it switches to another one which already wrote a value that's way way way still so what could happen if we go back here if threads ran for a long time say we were at the stage where you know we were at like 20,000 and say this read read like 20,000 and then got interrupted here so it needs to do the rest of this and then say we only have a single core just to make things easier well this thread here could have executed could execute read 20,000 write 20,001 and because we have like it probably has a big time slice it might get all the way up to like it might even finish it might get all the way up to like 30,000 and then eventually this writes out 30,000 but yeah and then it would switch back to the original thread which still has a 20,000 it would just obliterate everything right so that's why we have gigantic kind of variances in the number two because each thread is doing a lot of things and likely being unlucky just wipes out a bunch of incrementing which is why less of them happen which is why sometimes I see 80,000 yeah let's see so yeah there's a lot of interesting things we can ask here so for volatile for what counter static vol like that alright who knows what volatile means or is it just one of those C keywords that what would volatile mean kind of so let's see if some magical C keyword will fix us nope yeah yeah but C doesn't have atomic that's called atomic yeah we'll get into atomic stuff next one but yeah so in this case volatile does nothing I I hopefully structured lab three in a way that you'd never have to use volatile because it's kind of a mess and you don't know what it means so what volatile means is okay if you have a normal program say I have a normal program I do in x equals two and then you know uh... whatever I to get an example this kind of gets into compilers but say I have something like this where I've x equals two and then I have y equals x plus three then z equals x plus six well every time I access x I'm reading from memory but I'm expecting that hey in a normal program if I set x to two and then read it for why it should still be two and then if I read it for why or z it should still be two so I actually don't need to read it from memory again I can just reuse whatever I wrote from memory right to compute both of these what volatile do will do is force a memory read every time you access it because what volatile is meant for is memory changes without you doing anything so that's what it's for so if you have like hardware like an Arduino or something like that you might be looking at some memory that something else changes and this assumption that if if I read x is two and I read it again it'll still stay two that holds true for normal programs but for like anything you don't control it doesn't hold true and that's what volatile is for the volatile force a memory read if you don't actually need it will make your program slower because if I was to optimize this I only need to read x once but if it had volatile I'd need to read it twice so if you don't know what you're doing with volatile you'll probably just make things slower and unless you're doing an Arduino thing or please not lab three uh... you don't need to use volatile I think with the live things I'm letting you do in lab three you don't need to use it and I guess I didn't need to explain it but I did alright any other questions so I think we have some extra time uh... tomorrow so tomorrow gonna wrap up that virtual memory lecture I left off on and we can talk about lab three so if you want ask anything about lab three good idea of how to do it will go over that tomorrow just remember phone for you we're on this together