 Felly, dwi'n ymweld i'r ffrif, y boff yw yng Nghyrchion Fyddai, neu ychydig i ddim yn ymweld i'r ffrif i'n gweithio'r boff yw'r gweithio'r bwysig. Ydych chi'n gwybod i'ch gweithio'r boff yn gyllewch ar gyfer y dyfodol, maen nhw'n ymweld i'r ffrif. Efallai yn i'w ddweud y cewdd y cwmwys i fi'r canfodau, yng Nghymru yma, ychydig i ddweud yr ILC, a gwneud yr ystodau Gwyrdd DC6, Cyffwyn, Torx, Cyffwyn, Atlan. A'r ganwch yn ymwys i'r cyffwyn. Mae'r ffordd yn fawr i'r cyfrifwyr. Mae'r ffordd yn fawr i'r cyfrifwyr. Mae'r ffordd yn fawr i'r cyfrifwyr. Felly, mae'r ffordd yn fawr i'r cyfrifwyr. Mae'r ffordd yn fawr i'r cyfrifwyr. These are fairly good ways to obfuscate, flow of your program, they make it extremely difficult to test and debug. Most errors related to bulk threading are, probably, the synchronisation. They depend on the timing so they only result in a pitched appear card if you run randomly and it costs you only want to have your load, which you don't see in testing Y wneud, mae'r ysgol pobl yn oed. Rwy'n fyddo'n colliwch gwirionedd hynny, mae'r langwych yn schyfodol maenig y ffordd, am ddefnyddio'n gwirioneddau hynny. Rwy'n meddwl y dyfodol o gymryd, ac mae'n meddwl rhywbeth yn lletwch dryfodol Llincol. Mae'n meddwl hynny'n meddwl hynny'n meddwl am ychydig gyda chi. Fy flynydd eich eu cyfion oedd. ...with multi core and incorporating processors being... ...the coming standard inurable Discord not that either laptops now You need to make use of the process, of course Servers, of course, already tend to have multiple processors to use those you need multiple threads You could have multiple processors as well but interthread communications are a lot cheaper If you have multiple processes, they can share memory, but then you have much the same problems there as with threads that share all the memory. Also, if you use shared memory to processes of shared memory, you have a different address in each process and that causes the same problems. You can't use all pointers, you can't use C++, a polymorphic object, because they have a point to send them. So, you need to use threads in many cases. So, how do we structure applications into threads? Easy ways, which I imagine everyone's tried, are in a several programme you have one kernel thread for each client, which is very easy to do. The problem with that, one problem with that is it uses lots of virtual address space. Each thread gets, and Linux, each thread, by default, gets 8 megabytes of stack space. It would use that, but that may not happen depending on what thread you're using. And you've got all the resources associated with threads as well. And, if you've got a thousand clients simultaneously, a thousand threads, there's a huge scheduling, and if you don't hate the scheduling and synchronisation between those, so you don't really make good use of the presence of resources if you have a lot of clients, if you have a few clients that it works out just fine. So, in direct applications, you might use one thread per task, so it might be UI threads, thread for network IO, thread for file IO, or so on. But, if you do it that way, then you're going to find one of these tasks that you have to write the program into does the bulk of the processing. The others are mostly blocked on IO, so if you have multiple processes, one of them may run flat out doing this intensive processing, and the others will not be used. So, that's pretty wasteful. If you really want to use all the available resources, you need to be a bit more sophisticated for a server application. You need to have a thread pool, and then assign one thread to each processor. You create one thread for each processor in the system, and hopefully, as it changes them properly, you can maybe hint with processor affinity to get that to work better. But, of course, that's hard to write. If, hopefully, you have to have your own schedule to assign jobs to those threads, it doesn't necessarily interact well with other applications, because if you have multiple applications each trying to use all the processes that you create, you're going to have these swapping. You're going to have real schedules. Finally, if you do it that way, or you're out of IO, it needs to be non-blocking, or if you have to be blocking IO, you can then have extra threads in the pool and release them when one thread is about to block, and then move to, at some point, return that thread to the pool rather than putting a new job on to it so that you only have one thread per processor. It's all really rather complicated. I don't have a good answer to that. For interactive applications, whatever process or intensive work you have to do, if there is any, it's not always true. That really needs to be parallelised. So, again, one thread per processor to do whatever it is. I don't know graphics manipulation, video processing. For high-performance computing with number crunching, there are frameworks that make this somewhat easier, but I don't believe that that works either to other kinds of applications. Synchronisation. Obviously, it needs to be synchronised so that they don't topple over each other's changes to shared memory. So, if you don't do it in the right places, your program breaks. But if you're doing it all over the place, so you have a mutex for every single little piece of data, your threads are right to spend an awful lot of time waiting for each other. So, if your threads are blocked, or mostly blocked, then you're not using the processor time that's available. The program doesn't make use of the processors. So, for synchronisation, you probably know that mutex is locked in condition variables, which are reasonable building locks, but at a pretty low level. One of the problems with locks is deadlocking through two processors, that need to require multiple locks at the same time. If you're doing different orders, you can get deadlock. You need to establish a lot of ordering. Well, how do you do that? And if you're running a library, say, and that needs to be thread safe for these mutexers to protect this internal state, and maybe while holding the mutex for the callback, you can hear the mutex that needs to participate in this lock ordering. We don't even know about it. So, it's very difficult to actually put together multiple blocks of code in this way. One thing that is being quite seriously researched is lockless algorithms and data structures. You may know that those are increasingly used in the kernel. There's this technique called re-copy update, which allows you to avoid using a mutex when updating some shared structure, obviously in the kernel. The kernel is effectively a long-threaded system. Everything is shared between all vertices in the kernel. So, those can be behind performance and avoid deadlocking, but they're very, very hard to write, very hard to use that for expert serving. One of the commonly used lockless algorithms is double-check locking. This is a technique for hopefully speeding up initialisation. So, if you lazy initialisation of some appointed to a single object, you can... I don't know. It works now, exactly. But it's a common idea. You check a variable before taking the lock. The trouble is it's broken. It just doesn't work. That was advocated for years by experts at the time. So, earlier on I mentioned the language semantics. We tend to assume, although we shouldn't know better, that the operations of our programmes, such as Weeding, are writing variables, synchronising and doing IO, are done in the same order as there is. But, of course, that doesn't really happen. You have to compile optimisations that reorder them to make them run faster. This includes eliminating, duplicating, memory accesses, in some cases, by caching variables and registers. Also, some processes don't support writing variables of some sizes. For example, on an alpha, on an early alpha, at least you couldn't write to it with a good byte. You'd have to read a six-book word, I think, modify that one byte in the word, write that the whole word. So, you've had two variables that share enough words, actually writing one of them right to the other. So, that can introduce racist additions into a programme that apparently doesn't have one. Also, processes reorder, read some writes depending on which byte of memory system they're in. They don't do them as quickly as they can. There are some constraints on that, but if you're writing to, well, no, it's reading, actually. There's a bigger problem if you read from, if you read a variable that sounds like a main memory, and then you read from a variable that's in a cache, the variables that the cache actually gets rid of early. This is all OK in a single-threaded programme, although it does cause some problems for signal paedlers. So, that's why it's allowed, that's why it's complied otherwise to get away with it. But, as I said, it introduces race conditions into multiple-threaded programmes, and the same goes for processes that use shared memory with other processes. You have, obviously, the other signalization of derivatives such as mutants that deal with this process will reorder, but they don't necessarily give it compiler reorder. In general, because the mutants, log and unlock functions aren't seen by the compiler, it can't assume what goes on in those functions. So, it has to effectively flush variables to memory around those, but there are some cases, such as where the lock and unlock are conditional, where that doesn't actually happen. In order to, in order for language to support non-threading properly, it needs to have a memory model that defines exactly what reordering is possible, what can be done by the compiler and the processor to reorder the operations, and how you, as a programmer, can limit it in general. There is an original exception. Java does have a memory model that covers multiple threads, but I don't know about you. Java's not my favourite language. The CMSibles Plus language is currently stuck with a single-threaded effort machine, so if you use multiple-thread normally, you'll be using the project-thread semantics, but bad luck, project speaks in terms of memory locations, not in terms of variables or other elements of language semantics. So it doesn't help. There is some active work on a memory model, a multiple-threaded memory model for C++ language, some of that, and maybe it will be able to see C-sharp, which I guess we can see more on Linux as to mono. It has a multiple-threaded memory model, but it's a bit vague, so you're on some territory there. Higher-level languages, I'm afraid, I don't know. There may or may not be a problem. It's all dependent, probably, on whatever language that's what I'm used to to implement today. So why am I talking about this at a deviant conference? If you're maintaining a program that uses multiple-threading, which more and more programs do, you are likely to see, I think, increasing numbers of bubble ports due to synchronisation errors in those programs that, as people are using multiple-threaded machines, you may also see problems resulting from the language semantics, probably the language semantics. Devants, obviously, support multiple architectures. They don't all have the same rules to process a reordering, so you might find some problems or some architectures and not others. So you've got to watch out for that. So I don't have a lot of good news, I'm afraid. There's some links here to the deviant reading, which is all for this talk of someone on my homepage. So I think we've got two questions and discussion now. I didn't understand your point about the out-of-order execution that you had. When I think of C or C++, I think of sequence points at places like semicolons and the double-and sign and the double-or sign. Where is my conception missing there? Well, sequence points themselves are something of a misnomer. They aren't points in the program. That's possibly worse than where the confusion comes from. Sequence points are restrictions on reordering. They say that this has happened before that, but the compiler is allowed to reorder as if things had happened in that order. So you have the rule that if you write to a variable twice without a sequence point between the two writes, without any order between those, then the spanning is right to find. If you have a sequence point between them, then that's fine, but the compiler might see that these two writes occur. There's no read between them, because I just eliminate the first write. But it might be that you arrange a program such that another thread reads between those. The compiler doesn't have to know about that and doesn't. And the only way to fix that would be to declare every one of those variables well, wouldn't it? No, all of those are only made to be specified. That's another problem. I think that the whole execution model or the whole virtual machine model for C++ is centered around the word volatile. A C++ compiler is conforming to the specification if a compiler program is doing all volatile accesses in the same order as an unoptimised program. And it may leave out a non-volatile access at will. If you have, like, 10,000 addresses, it may use address systems that have memory for structures. As long as these are volatile. The volatile is only going to have to the point that if you have a volatile access that is protected, you can use volatile accesses to enforce sequence points Well, no, you can't. Volatile accesses are ordered with respect to each other. But it doesn't say anything about non-volatile, non-volatile accesses that are there to be between them. They can still be moved. True. If you use a mutex, then you are guaranteed that the mutex is ordered with respect to volatile and I think even more volatile. Well, a mutex acquire release should prevent any unacquired release to anything because it should not move past those in theory. But, as I said, there's an odd case where if your acquire and release are conditional, I'll see what I can do up an example of this for you later. There is, has been observed that some compiler will do an optimisation with cascure arrangement to a register that introduces a race condition there. And actually, all the compiler is doing is perfectly legal for you I don't know whether this can ever occur if the acquire release will not be reached on time. It might be a good question. I have a multi-theprending parole process that winds up under, on the multi-processor system the production version runs on that gets a lot of load and it occasionally has bugs and I can't reproduce it on my single processor test system. No, no, you shouldn't shut up on the single processor system. And part of it is just a slower processor. Part of it is just... I don't have an open experience today, but you have multi-threading bugs that almost never appear on a single processor system because you only have one thread at a time running and so the race condition is only exposed if you have a context switch in those few cycles. But of course if you have multiple threads running in parallel then you get them after a few minutes, maybe a bit longer in your case, but this is the sort of thing you can see. I really don't know how multi-threading is going on. I hope for a bit more awareness of the resistance of the threads in the Linux. I mean, every time you look for any programming implementation or any library, it's all for the single thread case and you have like, well, in case you have multi-thread it's kind of messy, but just lock everything with a big note X and then basically it stops. And the library system code you may have is a little different from that. You're always looking at stuff. Well, yeah, some are interrupting points on that note. So you're not interrupting points. Consumption is here. Consumption is here. Consumption is there. You end up looking at month pages, specifications like the lawyer, and it's kind of like a project to do it. My advice on translation would be, don't use this. Don't use translation. Use condition variable. Much of that. But then, okay, you get signals. I don't know. Signals on threads are a really horrible case yet. It's kind of like a mad hack, in a way. What is it? As a pet support, in many cases, you kind of have to hack it into something which is 90% of the time for one pet. And I don't know. I'd like to see, well, I guess it's just a random mind, but I'd like to see examples of the back-more-than-thread system in chip decay libraries and stuff. And to feel comfortable using the threads, because I don't feel like a freak if I do it. Yeah, I guess that's not the very useful thing I'm saying, but I just want to get that out. No, I don't have any terribly good advice on this. I can't point you to it. I can't point you to it. I don't know, a guide, if I can tell you. Collections of code examples, I think, use, like, the rights of the chip decay tool with examples like where to stick, the fact in context. Really, you want to share information between threads as little as possible. You want to have the ability to share or transfer ownership between threads, but you want to share as little as possible so that once a thread has... If a thread has a single ownership of something, it doesn't need to use mutates as it doesn't need to. It never has to worry about blocking to access that. So if you can possibly arrange for only one thread tuning access to some state of it, then that might be better than letting every thread access this state of the decay and then having to put mutates in. Yeah, well, formally you need to declare everything that ever turns between two threads at volatile to enforce it to be actually written out. Volatile doesn't deal with some problem. Well, you can... Volatile is defined for the best-of-one single-thread abstract machine, and it may or may not do what you need in a multi-thread system. In particular, in most implementations of CNC-plus-plus volatile does not take a process of the author. Yes, it does, but it does not. The problem that the same process machine classification has, you can tell that a sub-protein call will do volatile accesses. So any volatile accesses will have to have happened before a sub-protein call such as a call into the piece of library to gather mutates. So why... Yeah, but as I said, there are some optimisations that we've done, but it seems like that will be true, but there is a case of the conditional loss. Volatile accesses will force the compiler to emit the instructions for actually writing it later. And the mutates graph is usually... has a memory on, so access would have happened actually at this point, so it's quite safe. It should be safe. Yeah, it's probably bad, yeah. But you can't afford to make more than volatile. Yeah, that's a problem. And we need better compilers that basically would understand if you have a volatile buffer and... Or volatile data... Or if you do such a case with those compilers. No. Do you really fix a language then? No, it's actually just a matter of implementation. If you have shared volatile objects and you have a working copy that's not volatile, you can work on your working copy, which can be cast in the register as you like it, and you just need to copy it over to volatile area and the compiler. The compiler should be able to sort this out. That's an implementation issue. So what about the data model? You remember a model of scripting languages like Perth, Python, Ruby, do you know something about it? I don't really know. Python, the secret function of Python, only allows one thread to run into the interpreter at the moment. If I remember correctly, if you have multiple threads, then you have one of them running in the interpreter loop and the other ones are either blocked or they're busy in extension code. So what does that mean? You're probably, as long as a library is a thread safe, as long as an extension is a thread safe, insofar as that can be said, then you probably don't have too much trouble from the Python code itself so you don't have too much of a problem or a level for all my memory models. I don't know. I think it means that the Python code will only run on basically one thread from the operators to one CPU. So if you have 18 CPUs, you're only running one of them or Python. So basically Python is running only a way of avoiding the application from walking. I think you can even see extension code that can separate track lines, right? Yes. So if one thread is running in interpreter's Python code and another thread comes back from running an extension, it will block the interpreter long. But if you have long-running... if you have functions in extension that do long-running pieces of work, then you may be able to make use of previous multiple processes. Haven't you had some practice? At work we have a rather large Python testing framework. There we use multiple processes. Questions or searching points? I could give another perspective. At the previous job we were implementing a functional programming language, a purely functional programming language with multi-friendly support. Pure functional programming language meaning that it is impossible to change a value. Yes. Which means that basically locking is not necessary. You do not need to lock data because it never changes. You just create more data and the only data will never truly go away. That's a problem to deal with. The most functional programming language is a garbage collector. So this is another problem. Our garbage collector was thread-safe. I don't know exactly how it was implemented. It became a PhD. Not mine. We have the same multi-threading Perl program that I had the problem with. I was using a Perl library that did not document whether it was thread-safe or not. It wound up not being thread-safe at first. I wound up as the easiest way to change my multi-threading implementation and have multi-threads controlling multi-processes. Each process has one thread doing the IO handling for that one process. I wound up trying to figure out how to basically do a subroutine call to another thread. I didn't have to... One thread can handle the IO to a file so it doesn't have issues with the multi-threads trying to IO to the same file at the same time. If anyone needs to see how I did that, I can show them offline somewhere sometime. This is for the bug tracking system, by the way, the spam scan. Basically, one spam assassin process was not able to keep up with the amount of email we get to the bug tracking system. I had to rewrite it to have more than one spam assassin. It's mainly network delay. Right, so you have threads blocking all the time, so that's right. That's unpleasant. Is there any way you can turn this into an event loop that the processing continues when you get a response from it that way? Oh, like I said, each spam assassin is a Perl library, basically. It's got programs that can call a part of the Perl library, but the back end is actually a Perl library. We've written this program in Perl around a free writing 90% of it. The one thread blocks on the IO to the process running a single spam assassin. As the messages get processed, it unlocks the thread. And then handles it. Of course, I have a lot of blocks and so on and so forth to make sure that the threads don't stomp over each other. And then I run up to 20 processes running spam assassin at the time. How many processes do you have to run this on? It's a dual hyperthreading Xeon system, so two or four depending upon how you count. So we've got a lot of those arriving. I love the process of the IO. Is that it? Ladies and gentlemen, thank you very much. I'm not quite sure what the next talk is, but it's probably tomorrow. Thank you very much. Bye-bye.