 You know, we will be talking, in this particular session, we'll be talking a little bit about both Linux and Android. We'll talk about a couple of places where Android is missing some very key support for multicore, which is an interesting problem, given that there are so many tablets these days that are all switching to multicore tablets for phones and tablets running Android. And when you get into the actual Android framework, you realize that the Android framework really doesn't support multicore very well. So the underlying mechanisms inside of Android, the native development kit, and the actual libraries, the bionic libraries themselves, do support multicore. But unfortunately, they haven't moved that support into Android itself. So from an Android perspective, if you go to create a thread, you have no control over where it runs. It just runs wherever it happens to run. And that's not, as you'll see in this session, that's not what you want. You really want to be able to reach out and control a lot of these things, especially if you're trying to move legacy code. So let's go ahead and get started here. My name's Mike Anderson. I'm Chief Scientist for the PTR Group. I've been in the embedded systems business now. Got my first job writing code on an 8080 back in 1977. So I've been in the business for a while, working with embedded Linux since about 2000, and then Android since about 2007 or so. So 2008, actually about 2008 is when I started with Android. So I've had an opportunity to see a lot of equipment, see a lot of components, see things that work, see things that don't work. And of course, one of the key problems that we're running into now is today it's very difficult to actually buy single core processors. Most of the manufacturers realize that if they want to move up in the value chain, if they want to sell more product, the way they do that is they sell you on the concept of multi-core. So we'll talk a little bit about the motivations for multi-core migration. And we'll compare the Linux and Android threading models to other threading models and other operating systems. This is really where we start to see the key differences between the way Linux works and the way operating systems like VX works or Windows or one of those operating systems behaves. And it's all about how the scheduler schedules threads onto CPUs, onto cores. We'll talk a little bit about logical versus temporal correctness and then how we have to go about rethinking our architecture in order to be able to address the issues associated with migrating to a multi-core platform. So there are a lot of things that we could talk about. Unfortunately, we only have 50 minutes. So we're not going to be talking about instruction level parallelism. We're not going to be doing anything about SSE, SSE4, Neon, any of those really cool things. We're not going to talk about OpenMP or Super Scaler out of order execution. We won't talk about SMT, simultaneous multi-threading, the hyper-threaded codes. And we're not going to be getting into a lot of detail on any of the SIMD instruction sets. Each of those topics is a very worthy topic and could easily take an hour by itself. So unfortunately, we just don't have enough time to address those. We're going to focus on one thing primarily, and that is what happens to your code when you allow threads to start migrating around on the CPU cores. So we're pretty much all aware at this point what the motivations are for multi-core. Lower thermal envelope. I mean, that's a significant one. If we can cut the current, the power back to the CPU, if we go from 1.5 volts down to 1 volt, then we can significantly change the power consumption of the platform. That also changes the thermal envelope of the platform. So we have lower thermal envelopes, lower power consumptions, and in theory, the ability to scale our code across multiple execution units. However, there are several gotchas in all of that. One is that each core is now clocked slower. Typically, if we had, for instance, in the first Intel platforms that came out, they had a 2.6 gigahertz Pentium. But the Pentium D, which was the dual core version, dropped that back to 1.8 gigahertz. And the reason was because now you have basically twice the number of transistors in the same physical footprint. And you just can't run them at the same speed. Otherwise, you generate way too much heat, and the silicon becomes a molten slag pile. And that's not what you want. So they would typically clock the core slower. Now, we'll also see a problem with cache misses and process migration. And that's going to be the primary focus of what we're going to be talking about here. And I have a couple of demonstrations that I'll show you that actually show the impact of these concepts. What happens when you allow things to migrate around? What happens when you allow cache misses to occur? So a lot of today's code that's coming especially from the world of real-time operating systems is all single-threaded. Or if it is running on a multiprocessor and has multiple threads, you're really using the idea of having those multiple threads with different priorities to ensure that one thread executes to completion before another thread gets a chance to run. So when we talk about a single-threaded application, one that there is only one main thread in there, then that really can't take advantage of additional cores. As a matter of fact, we saw this in the Windows world when they first introduced the dual-core Pentiums. People were running games. And they go, oh, cool, I'll buy this dual-core laptop. And my games will be twice as fast. And they really found out that their games ran a lot slower on a dual-core because of the different clock rate. So there was a lot of disappointment. And Intel had a significant problem on their hands because they couldn't convince people that buying multi-core processors really bought them anything. It was a slower application. That was because most of those applications were single-threaded. Now, Intel did one thing to try and combat that. They added this feature called Turbo Boost. And what Turbo Boost does is it looks at the thermal envelope of the part and says, if I've got four cores and I've got a good thermal envelope, then what I'm going to do is I'm going to idle a couple of the cores. I'm going to boost the CPU, the clock speed, of the other two cores. So when we go to Turbo Boost, we see them now marketing. Well, it's really a 2 gigahertz processor. But it could be as much as a 3.6 gigahertz processor. Well, that's only if it's able to idle the other cores. Otherwise, the thermal envelope is such that you can't do that. And not all operating systems support that. So it really is kind of a question as to whether or not you're going to get better performance on a multi-core or not. Now, if we talk about multi-threaded code, it's got multiple simultaneous execution paths. Now, what this means is that if we were in an operating system like VxWorks, for instance, we would typically have multiple tasks. Each one of those tasks would have their own unique priority assigned to it. And the way the scheduler works in VxWorks, it is highest priority always wins. So the highest priority thread would run to completion before the next priority thread would get a chance to execute. You can do a lot to protect global variables that way. Because there's no contention if the one that has access to the global variable has the highest priority. Unfortunately, that doesn't work in a multi-core system. Now, when we talk about algorithms, we talk about how we write code, we have to think in terms of the scalability of the code. Now, this scalability, there's a nice little formula here. Basically, this is Amdahl's law. This says that the actual scalability of the algorithm is a ratio between how long it takes to run on a single processor versus how long it takes to run on a multi-processor. In the perfect world, you have 100% scalability, which means if I go from one processor to two processors, it's twice as fast. If I go to four processors, it's four times as fast. Unfortunately, that doesn't work. And the reason it doesn't work is because of bus contention, because of resource contention around semaphores and several other synchronization primitives that always get in the way. The bus contention one is a tough one. You really can't do much about that. Certainly the synchronization primitives, those things you can tweak. And we'll talk about how you do that here in this particular session. So if we take a look at the threading model that's used by Linux and Android, there is, of course, the one Linux kernel. And the one Linux kernel with 2623 switched to a new scheduler. Prior to 2623, they had what they called the 01 scheduler, which was guaranteed a constant time dispatch, regardless of the number of threads you had. Unfortunately, there was a teeny, tiny little hole in that one. And that was it was based on how much CPU time you had used and whether you had exhausted your entire time slice. And things that had exhausted their entire time slice were marked as CPU hogs and things that gave up the time slice before the time slice expired were marked as IO bound. So CPU bound got smaller and smaller time slots, time quantums, and IO bound jobs got bigger and bigger time quantums, up to the point where you could actually get about 800 milliseconds allocated to your thread as long as you always gave up the CPU before your time slice expired. Well, 800 milliseconds, that's a long time. So they found that, hey, this really wasn't all that fair. So they replaced it in 2623 with a completely fair scheduler. Now, the completely fair scheduler uses a red-black tree to do its scheduling. And the red-black tree has a characteristic that it's a self-balancing B tree so that the depth of the tree never gets anything beyond N log N number of elements inside the tree. So that's a good thing. It gives us constant removal time. That is, constant dispatch time. We always know exactly which thread's going to get the CPU next because it's always the leftmost node of the red-black tree. But it's N log N to insert things into the tree. And so insertion takes a lot longer than removal does. Fortunately, we're removing things from the tree a lot more often than we are inserting them into the tree. So it kind of balances out. Now, that's four things that are executing at priority zero. What happens when you execute at priorities above priority zero? Well, first of all, only root can get to those priorities. That's sked-fifo and sked-round-robin, sked-other or sked-normal of stuff that runs at priority zero. And what runs at priority zero? Well, everything. Compilers, editors, everything runs at priority zero. Anything that you then purposefully scheduled at a priority above priority zero, you had to be root in order to do that. And you had to change the scheduling policy when you created the thread. So what happens in Linux and in Android, as it turns out, the scheduler is a one-to-one scheduler. So there are a couple of different models that we can see in scheduling and thread scheduling. One of them is the M to N model, which says, when I schedule, I schedule processes instead of threads. So that's the model that was used by Windows XP. We would schedule a process like Microsoft Word. And then once you were in the process, the threads would duke it out amongst themselves to figure out who got the CPU time slice. That's not the way Linux does it. Linux uses a one-to-one scheduling model where each thread is independently schedulable. And what is a process? A process is a thread called main. So as far as the scheduler is concerned, every process is just another task struct. And if that process happens to have multiple threads inside of it, then we have multiple task structs that are all considered equally when the scheduler runs. So what's going to happen with the scheduler is every time the scheduler runs on a CPU core, it's going to pick the highest priority thread ready to run at that time on that CPU core, and that thread wins. So like the RTOSes, Linux really uses a highest priority always wins model. Now, whether that's highest priority off of the red black tree, which happens to be the left most node, or highest priority in the static priorities from zero to 99, whichever one is the highest wins on all cores. There is only one scheduler. It runs on all the cores simultaneously. When does the scheduler run? After every interrupt service routine, every interrupt service routine, keyboard, mouse, ethernet, whatever. And after a lot of functions in user space, and several things that happen in kernel space, mutex operators, reader writer locks, spin locks, all these things can cause rescheduling to happen. So when we talk about this one-to-one threading model, we have to realize that each one of the threads inside of a process shares the same MM struct pointer, and therefore shares the same address space. So as far as the threads are concerned, sibling threads in the same process see the same memory. And therein lies the problem, because now we have a situation where each one of these threads sees the exact same data structures at the exact same time from separate CPU cores, potentially. And that's where we get into this concept of a race condition. So the scheduler on a unit processor would say, hey, priority 50 thread outweighs a priority zero thread, priority 50 wins, runs to completion, gives up the CPU, priority zero gets a chance to run. Maybe, if priority 50 didn't reschedule itself. On a multi-core, however, it is the highest priority thread ready to run at the time the scheduler executes. So we could have a priority 50 thread running on CPU core zero, and a priority zero thread running on CPU core one simultaneously. There's where our problem starts to happen when we're trying to move legacy code that was dependent on priority execution over to a multi-core system. Because now priority execution, yeah, it still sort of works on one CPU, but now we've got multiple CPUs and the priorities get mixed up because it's the highest priority that's ready to run at the time the scheduler executes. So because these things can now run simultaneously on multiple cores, we have race conditions. So what exactly is a race condition? Well, a race condition is a little difficult to understand in some cases. What we have here is programmatically correct code. It does the right thing. It just doesn't do it at the right time. So we have something that is logically correct, but not temporarily correct, which means that when we start moving to multi-core, we have to think in a different dimension. It's no longer, does it do the right thing? It's, does it do the right thing at the right time with the right data? So it adds an interesting complication. So race conditions are technically a violation of temporal correctness, often referred to as live lock. Now it's not a deadlock. The threads keep running. That's the problem. If it was a deadlock, you'd know it. Unfortunately with a live lock, they keep running generating incorrect answers, but you don't know it until the robot spins around and spot welds somebody behind it, something weird like that. All right, so where's the contention? What's the problem? What are we trying to solve here? Well, in most cases, race conditions come about because of contention over shared data structures. And this is particularly troublesome inside of a process because all the sibling threads share the same address space inside of the process. So they can pass data back and forth between the siblings by simply passing them by address. Pass me a pointer, I've got it. Hey, I can access this data. Now that's not the case between processes. Between processes, the MMU sets up a memory barrier, basically, that keeps, because of these local address spaces, keeps applications from pointing into another process's address space and mucking around with their data. It's not that it isn't possible to do that because you can, in fact, punch holes between the address spaces, but you have to go out of your way to punch holes between the address spaces. It's not something that happens automatically. And so what happens is this contention over resources doesn't happen on a uniprocessor because we usually have a piece of code that's running and it will run to completion before a second piece of code gets to run. And therefore, there's no contention. It only has contention if two pieces of code try to access the same thing at the same time. So what this implies is that there's a critical region of code, some sequence of events that has to be mutually excluded from other pieces of code at the same time. Now this also means that we're going to have some serialization around certain places in memory. And these memory places are referred to as memory hotspots. The more threads we have, trying to touch the same locations in memory, the more likely we are to have cash contention and CPU contention and therefore end up with a race condition around this shared global piece of data. So how do we go about detecting a race condition? Well, you could try to do static detection. And there certainly are some applications out there, clockworks, insight, covariate, prevent. There are some commercial products out there that try to detect race conditions statically by looking at your code. The problem is that static detection of a race condition is actually an NP hard problem. And for those of you who are math challenged, NP hard is the same types of problems they use for encryption algorithms. They're not made to be broken within oh, three or 400 years. So if you have yourself an NP hard problem, you basically have something that can't be solved easily. And that's really what we have if we try to do static detection of a race condition. Because a race condition is all about time. If without the time domain, we can't adequately represent the problem. So there are some heuristic detection techniques. And these are ones that kinda look at the code and they go, you know, there could be a problem here. I'll go ahead and flag it. And that's really what most of these tools, these commercial tools do. They have kinda rules of thumb and they say, if I have two threads that are accessing the same data at the same time, that could be a race condition, I'll flag it. Now, is it a race condition? Well, we only know that it's a race condition at runtime. So the problem with race conditions is we can technically only detect them after they've occurred. And unfortunately, most of our systems don't have the ability to run backwards in time to undo the race condition. Although there are some that I've seen that can do that, which is really kinda cool. But neither here nor there for a Linux perspective at this point. So dynamic detection is really the only way we can detect that our race condition has occurred. But unfortunately that means the problem has already occurred and you can't fix it. So one of the key things about race conditions is understanding what causes them and avoid them in the first place. When people ask me, how do you do debugging? Well, the answer is really simple. Don't put the bugs in in the first place and you don't have to get them out later. So that's the situation that we have with race conditions. If you understand the things that cause race conditions, then you can take steps to avoid the race condition ever happening in the first place. And the first thing that you do there is you look for the use of global data. Is there data or a data structure that's being shared between multiple threads at the same time? If there is, then that's one of the things we wanna avoid. So how do you do that? Well, one thing is to simply say, eliminate all global data. That's easier said than done. We could use the stack. The stacks are all unique on a per thread basis in Linux. Therefore, if we pass things on the stack, we can have reasonable assurance that we're not going to run into any contention over the stacks. That's probably a good thing. Or we could use thread local storage. Thread local storage in Linux, we can actually allocate thread local storage and then manipulate the individual variables in a thread specific way. Unfortunately, all of these approaches require you to rewrite your algorithms in some way. To rethink significantly the way your code operates. And when you're trying to move a million lines of legacy code from a unit processor to a multiprocessor, that's not what you wanna have to tell your management. Oh, we need to completely redesign for multicore. They don't wanna hear that. So what can we do in order to get past those problems with a minimum amount of headaches? So really contention arises from threads on separate cores. If all the threads are locked to a single core, then we've effectively reduced it to the unit processor problem. And by reducing it to the unit processor problem, then priority preemption works and we've eliminated the problem. Unfortunately, now this is referred to as containment, unfortunately when you do containment, it means that we've just taken the reason that we bought multiple cores to run multiple threads at the same time and we've eliminated the ability to execute on those multiple cores. So we've just killed the reason why we bought multicore in the first place. So this is a significant problem. And as this notes here, whereas Linux has support for thread affinity and the ability to control assignments of threads and things of that sort, that does exist in Android. The scheduler certainly supports it. Bionic libc as of ICS supports it. The NDK as of ICS supports it. But the Java framework, even as of Jellybean, still does not support the ability to assign threads to particular cores, which is referred to as setting the threads affinity. So this, in order for containment to work, we really have to take advantage of priorities. But these are the problems with containment. You have the issue of you've just reduced the problem back to a uniprocessor. And since multicore processors tend to run at lower clock speeds than uniprocessors do, you've slowed the application down by moving it to a multicore system. And that's, again, not what you wanna tell management. You just spent all this money for me to buy this fancy new hardware and gee whiz, the application runs slower on the new hardware than it did before. And the requirements here can be kind of subtle because even though I might have a slightly higher elevation in the way the Linux scheduler works, at priority zero, we can have up to 20 different, 20, 40, 40 different sub-priorities of priority zero. These are called the nice levels. And with the nice levels, you're still static priority zero, but you now have a time slice associated with you. So there is a very subtle window here where I could have one thread executing at priority zero and its time slice expires. While it was accessing the data structure. And another thread comes in right behind it and then accesses the same data structure. Well, this isn't going to happen very often, but that's the really subtle problem of race conditions. They don't happen very often, but they only have to happen once for the plane to fall out of the sky. So we have to be aware of this potential for the scheduler to do exactly what it's supposed to do, switch a thread out because its context, its time slices expired, switch another thread in that happens to access the same data structures. Again, it might take, I mean, on a fast machine, it might take years before this problem happens. And you may be then off to another job and it's somebody else's problem to try and figure out that there was really a problem which may not be necessarily a bad thing. It depends on how you look at it because you could always then be called back as a consultant and make big bucks. All right, so the concept of encapsulation, we could create a thread, I mean a class that encapsulates access to the data structure, but unless for some reason the operating system automatically includes some sort of synchronization primitive in the existence and the execution of classes, we still basically are in the same boat. It doesn't matter, I mean, I've seen a lot of folks where they think, well, okay, if accessing this data structure is a problem, I'll put it in a class and I'll make a method and the method will then access the data structure. Well, unless the method actually does mutual exclusion, you haven't changed the problem. It's the same problem. You just simply put it inside of a class wrapper and it hasn't solved anything, but it certainly is much more object-oriented in the way that it fails. All right, so what's the easy way to solve this problem? Well, the most common technique is to use mutual exclusion semaphores. And mutual exclusion semaphores have the advantage that in the way Linux executes them, they execute as few texas or fast user space mutexes. And a few texas adaptive in nature. One of the cool things about the few tex is it doesn't try to take the mutex and then immediately give up if it's not available. What it does is it actually spins for a few cycles to see whether the mutex becomes available soon. The idea for all these cases where we're talking about dealing with the critical region of code is we keep the critical region short. You lock, access, unlock as quickly as you can. Same thing, Stephen Rosdett was talking earlier about spin locks, same thing. Go in, lock it, access it, unlock it as quickly as you can. So with mutual exclusion semaphores, we have basically, because they don't immediately sleep, it means that if I have a thread that's just getting ready to give up the semaphore when another thread is trying to take it, then the other thread doesn't go to sleep. That's a good thing. And it does have priority inversion support. It also has the concept of ownership. So these kinds of semaphores, there are two basic types of semaphores that we run into in user space. One is a binary semaphore, which is used for synchronization. And the other is a mutual exclusion semaphore which is used for mutual exclusion, go figure. So with this concept of ownership, that's a really important thing. And we'll come back to that in just a minute. So for those of you who may not necessarily be familiar with the priority inversion problem, we have three threads here, low, medium and high. Low priority thread takes the mutex and continues execution. It then gets preempted by a medium priority thread. Well, that's exactly what it's supposed to do. So everything's operationally exactly the way it should be. We then get the medium priority thread preempted by a high priority thread. That high priority thread does exactly what it's supposed to do, but it runs out here and it tries to take the exact same semaphore that this one already owns. So the semaphore is unavailable, which causes the high priority thread to block. But now, because this medium priority thread is sitting out there, with priority inversion, the medium priority thread gets the CPU back and continues execution. If we don't know how long that's going to run, this is called unbounded priority inversion. And this is a significant problem. It is not a bug in the operating system. It is an error in your understanding of how things interleave in time. Unfortunately, it happens so often that they've come up with a way to fix the problem in the operating system. And that's called the priority inheritance semaphore. So with the priority inheritance semaphore, which is what the few-text is, the fast user space mutex, they have the same sequence of events. We take the semaphore here, we get preempted. This is all finding good. We run the high priority thread, it tries to take the semaphore. But as soon as it realizes the semaphore is taken, the kernel looks to see who owns it and says, oh, wait a minute, I've got a priority inversion problem here. So I will raise the low priority thread up to the priority of the highest priority thread waiting for the semaphore. It will execute it that priority until it gives the semaphore back, in which case it goes back to its original priority. The high priority thread then runs to completion, gives its CPU, I mean, its semaphore back, and then the medium priority thread gets a chance to run. And this is a great solution. It actually solves a lot of problems. And it's relatively new in Linux. It's only been around for a few years. It was certainly one of the major reasons why a lot of embedded systems people refused to use Linux, because it did not have this support for a long, long time. Now that it's got this support, it solves a lot of problems. By the way, one of the most known errors of this happening was on Mars. It turned out that there was a problem contending over some flash memory space on one of the rovers, and it was causing the rover to go offline for a long period of time. So they fixed it by uploading, it wasn't Linux by the way, this was VxWorks. They fixed it by doing an interplanetary FTP upload of a patch. And then they found the highest appointed NASA official at JPL to push the reset button. Because they figured if he was appointed, they wouldn't fire him, because the president had appointed him. And unfortunately, the real big problem is the round trip time to Mars and back is like 20 minutes. So you press the reset button and you don't know. 20 minutes, did you just brick it? Or is the thing gonna come back? And it actually didn't come back after 20 minutes. There was a, they had neglected to take into account some other initialization time. Well, in any case, it did work though. So, but what are the characteristics of mutexes? The problems with mutexes are that on the plus side, when you try to take a mutex and it's not available, it caused you to block, which causes you to give up the CPU and give somebody else a chance to run. Giving somebody else a chance to run is a good thing. It means that we're actually doing the time sharing that we're supposed to be doing and keeping the CPUs busy. Unfortunately, giving somebody else a chance to run means a context switch. And that context switch, if we happen to be running low on translation look-aside buffers, which we almost always are, it means we're gonna have to flush the caches. And when we flush the caches, we're gonna take a huge performance hit on the cache flush. Realize that a cache miss in an L2 cache on a typical processor going from L2 cache to physical RAM is gonna cost you about 200 clock cycles. And that's 200 clock cycles you will never get back. So every time you have one of those cache misses, you're going to take a significant hit. Now, therein lies an interesting problem for multicore called fault sharing, which we'll talk about a little bit later. So when we have the Threads API, it turns out the P-Threads API not only supports mutex semaphores, it also supports this thing called a spin lock. And you don't see spin locks used very often in user space, thank goodness, because what they do is they burn CPU cycles like crazy, but they do exist and you could use them if you knew, if you had a priori knowledge that the resource that you were contending over was going to be available soon. How soon is soon depends. So it might produce better performance if you use spin locks, but it might not. Because remember you're burning CPU cycles and those CPU cycles could just as easily be used some place else. So the overall amount of work that gets done when you're using these spin locks could be significantly less than if you use mutual exclusion semaphores. Another technique, of course, is to use message queues. In the message queue, we basically decouple the source and the sync of the data. So the source writes into the message queue, the message queue has in number depth. The sync has a chance to run at his rate. So what message queues do is they introduce a synchronicity into the system. So I can produce data at one rate and consume data at a different rate and we rely on the message queue to keep the buffers straight and make sure that everything is delivered in a first in, first out order. Unfortunately, message queues require two copies. One into the queue, one out of the queue. We can mitigate this if we use a shared memory segment and we write into the message queue the address of the message that we're trying to send and its size. If we do that, if we have variable link sizes for whatever reason, then we can use message queues and we can decouple the source and syncs. So we can actually get quite a bit of execution performance out of it by simply decoupling and making things more asynchronous. The thing that I mentioned a little bit earlier and the one thing that you have to be careful of is the use of binary semaphores. Binary semaphores look an awful lot like mutual exclusion semaphores. However, binary semaphores do not have a concept of ownership and because they don't have a concept of ownership, if I do recursive calls to a binary semaphore, I'll get a deadlock. So if I'm in a function A and I take the semaphore and then I call the library B, which also tries to take the semaphore, that'll cause a deadlock with a binary semaphore. Does not cause a deadlock in a mutual exclusion semaphore. So you have to know exactly which semaphore you're using and why you're using it. If you're trying to keep contention away, mutual exclusion, use a mutex semaphore. If you're trying to say wait here until this event happens, that's synchronization. That's where binary semaphores are used. Different programming patterns for different types of semaphores. So what are our guidelines here? When developing an application, first thing you wanna try and do is identify all those activities that can run in parallel. Things that don't depend on each other and identify the data flow through the application. I have a piece of data, starts in this interrupt server routine, moves to this device driver, comes into my application. My application has a processing pipeline. This thing has to touch it, then this thing has to touch it, then this thing has to touch it, then I light an LED. Whatever that processing flow may be. Identify the flow and identify what activities have to share the same data. Once you know that, and this is a concept known as data flow programming, where we look at the data flows and we try to understand what the transformations of the data are and how they're related to each other in time. We identify the correct sequencing of activities. I have to do A, then B, then C. I can't change the order or I introduce a potential race condition. I then also have to identify the relative importance of these various tasks and assign priorities associated with those. So we might have to go in and make adjustments to the priorities based on what we're seeing in execution time. But don't assume that priorities will preclude race conditions because they won't. Remember, because of the way the scheduler works, I could have priority zero running at the exact same time that priority 50 is running on a different core. So priorities do not guarantee that I'm going to run to completion. When you're designing your thread, basic rule, keep them as separate as possible. Don't share data unless you absolutely have to and then only share data in a very regulated way so that you know exactly what the interaction is. Also, try to keep data used by threads on separate cache lines. Now, this is a really subtle problem because if I happen to have an array X sub zero and X sub one, X sub zero is being accessed by core zero. X sub one is being accessed by core one. If those two variables happen to fall on the same cache line, then every access from either CPU, even though they don't overlap, force cache flushes and cause an impact called fault sharing. And fault sharing is a really subtle problem. Basically what happens, it causes some ping ponging effects between the processor caches and can cost you as much as 600, 700% in terms of performance impact just because you chose poorly in the allocation of your data. So the effective use of multi-core processors does require some thought on your part. You can't just simply take a piece of code that's been running fine in a unit processor, drop it on a multi-core processor and expect it to be faster. In fact, it's almost guaranteed that it won't be. And the chances are very high that if you then add multiple threads to take advantage of the multiple cores, if you don't pay attention to what you're doing, you will introduce a race condition. We can fix this using mutual exclusion semaphores and we can address processor affinity in order to optimize the code. So let me show you a couple of examples here. Oops, that's the wrong one. There we go. All right, so first of all, I have a piece of code here called mutex. And what this code does is it has two threads, a looper thread and a conflict thread. And what these threads do is the conflict thread tries to access a, oh, sorry, it's down here a little bit further, tries to access a location in memory. Actually, it looks at two locations, value one and value two. So the main thread simply increments value one and then increments value two. The conflict thread here says, hey, if these two things are not equal to each other, then I have a problem. And at this point in time, they must be equal to each other. They may not necessarily be equal to each other in any other point in time. But as long as they're equal to each other right now, we're good. So what does the other piece of code look like? Very simple. Value one equals counter, value two equals counter, counter plus plus. Now, when we're trying to figure out what the critical region of code is here, what things do we need to protect? Do we need to worry about the counter's increment here? Knowing what we know about the conflict thread, do we even care about counter? No. The only thing we're interested in is value one and value two. So what we're saying here is the only way this code would fail is if somehow I managed to preempt the code right between these two assignments. Now, on the x86, this is about six assembly language instructions. What are the odds that you would be able to preempt precisely between six assembly language instructions? I don't know. Maybe not very good. But maybe, maybe it's a lot more than we think. Every single one of those messages is an error. Now, the thing is, hey, these values aren't the same. The thing is, look what happens. Let me just control C this. Look at what happens. I can go for 500,000 iterations and the problem doesn't happen. If I'm testing in my lab, well gosh, I could run a five or six test and it would never happen. It only happens if I happen to get the preemption just right at the right time. Yeah. Oh, no, no, no, no. Doesn't matter. It does not matter because it's only at the time of the equals, if value one equals value two, that's the only time it matters. So, except the printf here, we may in fact see, in this case, multiple differences. It can absolutely confuse, absolutely. And so one of the things that we can do with this is I have this piece of code and it does all kinds of cool things. We can enable containment and run. And here I've got both the looper thread and the conflict thread locked towards zero. And this appears to work. Problem is, there's still a subtle problem here and that is if the conflict, I'm assuming the looper thread happens to expire its CPU time slice exactly in those six instructions and the second one comes in before those six instructions had a chance to finish, we still have a race condition. This doesn't look like it has a problem, but in fact it still does. Ah, there we go, we got one. All right, now with this particular, this is a quad core with dual hyper thread. So we can actually run, in some cases, that was really fortunate. I've seen this run for hours before one of those things happened. So that was really cool that it happened so quickly. Now, on the other hand, we could say, well let's do priorities and see if we can solve the problem by using sked-fifo priorities. No, sorry, just simply setting the priorities doesn't solve the problem. Containment doesn't solve the problem. What if we set affinities and lock them to different CPU cores? Ah, no, sorry, still doesn't solve the problem. So the only way we can solve that problem in this particular piece of code is to turn on semaphores. Once we turn on semaphores, then the problem never happens and it basically just runs happily. And in this particular case, we're not locking the cores, we're not locking the applications, the threads to any particular processor cores, we're allowing them to migrate around. So what's happening if you look at my CPU meter, we see the CPU meter here, oh, I just happened to position it over something that was bright, didn't I? Here, let's switch. Yeah, that's it, it's not gonna do that. It's hard to see it. But the applications are migrating around, there they were on that core, now they just popped over to core one, they were down here on core six, they're moving around all the time. So that's something that we see that's kind of endemic to this kind of problem. Now, I'm just run out of time here, but let me show you one other thing just for those of you who are curious. I will show you what happens when you have fault sharing across caches. So we're gonna run a program here and what this program does is it has two threads that are accessing different locations in memory over the same cache lines, off of different cores. So, excuse me, how did I get them on the same cache line? Oh, that is the key question now, isn't it? And if you send me an email, I'll be happy to send you the code, you see how it's done. So we're starting here with two accesses, this one happens to be zero and one, there's one and two, and you see the elapsed time is about 12 seconds. So these are two separate threads running on the same core, accessing the same cache, and we're seeing that it's taking about 12 seconds to run through several hundred thousand iterations. Now, this thing goes up and it goes up to, I don't know, 1,024, something like that, but as soon as it skips through all of this and it actually gets to running on multiple cores, we then see the effect of the cache lines because running on multiple cores instead of taking 12 seconds, takes six seconds until we cross the cache line boundary. And because we cross the cache line boundary, it means that now one core is executing on one cache line and the other is executing on a completely separate cache line and suddenly you see a 600% improvement in performance because we cross that cache line boundary. So I realized that it's, well, we still have a few minutes here. This one's almost finished with the same core. As soon as it drops to multiple cores then the time gets cut in half. I do happen to be running a conservative power governor right now. So I think it's running at about 1,200 megahertz on this processor. But as soon as it crosses this next boundary here with 1,024, then we jump to multicore. And as soon as we're on multicore, you'll see a significant difference in the performance, time being what it is. But any, while we're waiting for this to happen, any questions? Anybody still awake? Am I still awake? Oh yes, I'm still awake. Ah, here we go. We go to multiple cores now. First one, come on. Okay, so this one is slower. But that's because we haven't yet, we're now ping-ponging back and forth between the caches. So as soon as we get to the cache line size, then you'll suddenly see a huge difference. And the cache line size on this particular processor happens to be 64 byte cache lines. Now, how would you know that? Well, you know that if you go into slash proc slash CPU info, and in the CPU info they report to you what the cache line size is. At least on most processors it does. Otherwise, you have to, shudder, read the data sheets for the processor to figure out what the cache line size is. Now, like I say, as soon as this crosses the 64 boundary, you'll see it change. So can I answer any questions for anybody? I was all crystal clear. Everybody got that. I'm glad. Yes? No way, no, sorry, no, of course. It's all open source. And just send me an email, mic at the PTR group. Of course, these charts will be posted. They should already be posted. If they're not, then they will be posted shortly. My email's in the charts. I'll be happy to send you a copy of both the mutex code and, of course, the magic make file that makes it and this false sharing code. Oh, here we go. So here we see it dropped just simply because now this piece of data is on one cache line. This piece of data is on a different cache line. And depending on what you're doing because I'm running some other code here, it's not as big a jump as I would have hoped, but still it's six seconds instead of 14 seconds. That's fairly significant. And it's just a question of understanding how the data has to be aligned, which may mean that you have to be wasteful in memory in user space in order to guarantee that the data structures are aligned on different cache lines. And by being a little bit wasteful on the data line, on the cache line alignment, you get big performance increases from multicore. So that's all I have. If there's no more questions, I appreciate your time. Thank you very much.