 Thank you so much. Good morning. So my name is David Liu. I'm a Python technical consultant engineer for Intel. And today, I'll be talking about addressing multi-threading and multi-processing in transparent and pythonic methods. So just kind of a general overview of this talk. One of the things I'm going to do is kind of state what the current state of concurrency and parallelism is in the industry. I'm going to talk a little bit about nested parallelism and over-subscription, what those problems are. We're going to talk a little bit about composable methods and thread control, and then how some of these packages work under the hood that address these issues. What it means to have a pythonic style and kind of the future of pythonic style for parallelism. So one of the things I kind of like to do is the Python language itself has had a lot of luck in attracting good talent and a lot of the best people for addressing concurrency and multi-processing. And so we can see that lesson here. We're one of the few languages that encompass all of these frameworks. And over the years, you can see the progression of the frameworks has kind of included just general threading, multi-processing, task parallel type of workflows. And you can see that the large amount of packages that are now in this space help fill out the Python ecosystem such that we do have a lot of options when we choose to go for parallelism. And we're one of the few languages to have that. And from 2008 to 2017, you can see just the large amount of packages that we have. And a few of those packages have actually been talked about at this conference. So again, like I said, the options in this space are very good compared to other ecosystems. And the majority of them do a very good job playing nicely with a global interpreter lock. If you were expecting this talk to get rid of the global interpreter lock, this is probably not the talk for you. But we do a very good job by doing distributive vectorization techniques or by working nicely with the GIL. And that's one of the biggest benefits of the Python ecosystem and the packages that are included in this ecosystem. For the more domain-specific areas, one can rely on high-end C libraries to do that type of work for you to harness parallelism and threading. So SciPy and NumPy do a great job of this. So when you make a NumPy call under the hood, it's calling a C library, which is doing the majority of the data parallelism work that is required to get the job done quickly. And with that being said, one of the recent trends in the industry is increasing amount of core counts and thread counts. And that's becoming more commonplace in the server space and even in your laptop space that you have. You're seeing an increasing amount of cores and threads that are becoming available. And because of that, nested parallelism and over-subscription are now quite possible in the kernels that you're doing. And so some of you may be asking, well, what exactly is that? And we'll go into that in a little bit. But let's first talk a little bit about the GIL. Because this topic gets talked about a lot. The GIL has been complained about by many people in the space. And many efforts have been made to remove the GIL. There's a few talks in the last few years at PyCon that have been trying to do it. There's been a lot of efforts to remove it. And there's very valiant efforts to remove it too. But as it stands, what the GIL provides us is relatively important. And it's kind of hard to ignore some of those. The read-write safety of Python objects, predictable behavior. The language really wasn't written to be thread safe. And the guarantees that you get with types and everything come from the guarantee that the GIL provides you. And in addition to that, when you're developing your own modules and extensions and other things, that type of expectation on the developer is a very hard expectation. Is to say, well, if I'm developing a framework, I now have to expect that it's not going to be single-threaded. Other people may be accessing my objects. That's extremely hard to test. So passing that burden onto the developers is also not that great of an idea. And again, that's why the GIL provides something to allow you to be able to easily work and create extensions for Python. And again, because the GIL provides that safety and we have so many good frameworks, it's kind of a non-issue today. There's many frameworks that have found a way to cleanly step around the GIL. In SciPy and NumPy are great examples of this. You basically send a command for numpy.dot or something similar. It gets dispatched to your BLOSS API. And you can then use the Intel's math kernel library, or you're using open BLOSS depending on your implementation. That gets vectorized and parallelized inside the CPU and gets dispatched completely transparently to you. So NumPy and SciPy do an amazing job of this. And that's kind of one of the examples of cleanly stepping around the GIL by understanding what that data flow is. And again, there's a lot of other frameworks that utilize this type of vectorization. You have NumExpression, Scython, all do this type of vectorization work for you while allowing you to stay within the Python layer. Multi-processing frameworks that have now been included into the main library of Python as of Python 3 have great ways of escaping via separate processes. Not necessarily just stepping from the vectorization line, but you can also have separate those within those. And that's where some of the over-subscription problems can happen. Generally exiting the GIL with a C library is the most pythonicish way of doing things. And this has been talked about by a lot of people in the numeric space is that if you understand the abstraction of your computational flow, you can write a library that can do this type of work, wrap it in Python, and that essentially is the most pythonic way of operating. And this composition of abstracted flows, you can also do this by splitting off into multiple processes, can also be a cleaner way of escaping the GIL. And it's very rare to absolutely necessitate the language to be thread safe. There's very few instances that we would ever need to really do that. And I think that issue of the advantages of Python going away would probably be the main detractor if we started doing that. So if we start breaking up the space now into three main areas, so with this Venn diagram, if we look at application level parallelism, we look at single-threaded concurrency and data parallelism focus, we can split up the majority of the frameworks in this space and kind of categorize them and see what areas overlap. So you see the area that has been talked about a lot with Trio and Tornado or Celery or anything really that lies within concurrent futures. That's another big thing is when Python came down to say concurrent futures are the API that we want to really support, that was a huge area because now a lot of these frameworks are now designing towards it. You can see where that area is more like the single-threaded concurrency. And when most people think that they need parallelism, they most likely just need concurrency. And then when you get to application level parallelism, you're seeing multiprocessing or job-lib or similar frameworks being in that space. Dask also encompasses part of that framework. When you get to the data parallelism focus, you can see that the packages that we talked about in the numerical space, NumPy, SciPy, Numbus, Python, NumExpression all sit in that area because they understand that the data parallel is the area that they want to focus in. And by abstracting that call, you can then exit the guild, do that type of data parallelism type of work while being able to return all of that back into the Python layer. And then when you get into areas where you needed both single-thread concurrency and data parallelism, you can get like MPI for Pi or some really weird types of concurrency and data parallelism focus that will lie in that area. And then the center area of obviously maybe it's a unicorn, maybe it's MPI for Pi, like obviously that's also a little harder to work with. So this hopefully will give you an understanding of what the different areas are encompassed. And what I wanna do with this though is try to focus down and talk about two specific areas today. So if we're gonna take a look today at application level parallelism and data parallelism focus, this is where a lot of kind of the final frontier has been sitting. And so if we expand that now into three areas, we have Python multi-processing, Python multi-threading and then data parallelism focus. So now you can see where some of these frameworks now lie and like Dask is clear in the middle of it, it's one of those actually interesting type of frameworks. If we were in the US and Matthew Rocklin was here, he'd be very happy as he's one of the main maintainers of Dask. But that being said, now that we understand what space I wanna talk about today, this area that's kind of the intersection of them is where nested parallelism and over-subscription can occur. When you start mixing these different libraries, multi-processing with NumPi or Numbo with other elements that have been composed on top of, or you start getting to multi-threading, this is the area that over-subscription and nested parallelism can occur. So you may ask, what does that actually look like? The answer to that is it can look like relatively benign code, right? So for many of you, this may look like a very, very simple type of thing that you would run into if you were just developing a NumPi or you were just trying to scale just a little bit. So here we have a NumPi call with random, we have multi-processing pools with a thread pool and then we're gonna do a pull that map on a NumPi call. Well, what exactly actually have you done here, right? The problem is is you've now done a composable type of nested parallelism without even knowing it and that's where it can get really, really scary because now you can have threads being spawned in a nested parallel process and if you start putting this on a larger compute system, it can go out of control. So you go from P threads and then a P Python threads and then you go to the threads that are in NumPi, well then what will happen is that nested capability will then create nearly double or quadratic the amount of threads if you're not careful depending on what's available on the system, right? And so you go from a relatively known set of threads that you're like, okay, this one called a NumPi, I know what's going on, right? Well, now if I call that with multi-processing on top of it, I've just kind of created a mess and tangle of threads because now one can spawn a bunch of other ones but it's relatively uncapped because it doesn't have any rules being passed down. So what are the problems that you have with that is you essentially will get oversubscription, right? And so you'll have a lot more threads than are actually mapped to the CPU and they're also then mapped to other ones so it's not, it doesn't have any rules controlling it. So with that many threads you'll have direct OS overhead for switching out threads, CPU cache becomes cold, you're gonna get a performance hit and you're gonna say, well, this worked actually faster on my laptop. How did that happen, right? It's kind of an invisible impact if you're not used to it and other threads are kind of waiting for the other ones to return and it's just like, it's just trying to, has way too many threads to the actual logical cores. Now, a lot of the popular frameworks that have this problem have solved it in a relatively simple-ish way. It's not the cleanest way and what they do is they lock the amount of threads to one for that specific process which is an okay-ish solution but it doesn't scale well. They'll set OMP num threads, they'll set the block time to lower but it's not always the cleanest way and Psykit-learn definitely has this, if you use grid search you'll see it, PyTorch, TensorFlow, they all exhibit this problem because the type of composable parallelism that they use to give you the type of either machine learning or other forms of work, this is one of the issues that they run into. And I'll talk a little bit more about SMP when we get to it but SMP is one of the packages that addresses it. So now let's talk about the composability modules that help address this space, right? One of them is TBB for Py and from our Intel distribution for Python, TBB for Py is included with our distribution, it's free and what it is is actually a Python C extension for managing the nested parallelism using a dynamic task scheduler. So if you use our version of Psykit-learn in our Intel distribution for Python, we actually use TBB under the hood for some of those but when you start looking at what it's providing, this is kind of the focus today is what exactly is it providing is the dynamic task scheduler for that. So if you have dynamically mapped tasks and you have ones that will occasionally end up a lot faster than the completed ones, it's able to put those back into the thread pool and allow you to spawn new ones even if they're unbalanced. So it handles unbalanced work relatively well and it instantiates via monkey patching of the Python's pools enabling the TBB threading layer to be interchanged with the MKL here and so no code changes are required on your part because of that monkey patching capability. Another one that we use in this space is static multiprocessing or SMP and it's a pure Python package that manages nested parallelism through coarse-grained static settings and so what that means is it's trying to augment your parallelism by saying, I'm gonna take the rules that have been defined by your parallelism and the types of environment variables and pass them down to the inherited processes to try to control over subscription via that method. So it handles ones that are a little more structured in that way. It again instantiates via monkey patching and it uses affinity masks and over open MP to statically define and allocate those resources to avoid those excessive threads. So now if we return back and we look at this example, you can see that these two packages can address the issue that we have here. So with the nested parallelism, how does that actually work? So TBB tries to accomplish this by saying, okay, you have your application, you have your open MP threading and you have separate but uncorrelated areas of open MP parallel regions and so what happens is tries to map too many software threads and it tries to compete for all the logical processors and tries to map too many of them. And what running under the TBB module essentially does is it says, this is the pool that's defined and it can dynamically allocate and release new ones to be able to operate within that pool. So it tries to keep them mapped to the logical processors while keeping that on hold of the over subscription because if one of these starts spawning like five or 10 of them while the other one starts spanning one, you can start seeing where the problem occurs whereas this, if it starts wanting to spawn, then it'll still be pulling from the same pool and still be mapped to an actual logical processor. Now SMP does it in a completely different way. It says with the same problem that we had before, it's saying we want to take the thread pool implementation and propagate the mask and settings towards each of the individual spawn processes to go down. And so you're essentially augmenting your MKLR BLOSS threading to be able to have the augmented settings passed down to each of the threads that are created from those processes. And so one of the advantages here is you can actually mix the type of threading. It can handle both types of open MP threading in this case, which is relatively powerful. So one of the things I'm gonna do here is I'm gonna show you a little, a small demo of what this will look like when you start having oversubscription. And I'll be running this on one of our, just relatively large two-use socket server to show you what that looks like and then showing you how these frameworks address that problem. So right here, I'll show you kind of what type of setup this is. So this has, with hyper threading, it has about 88 cores. So it's a two-use socket and it's one of the Xeon. So one of the things I'll do here is this will take a bit of a while, but let's see, let's hope the SSH doesn't work today. So this will take a little bit of time. So what this code is actually running, and I'll show you here, it's a relatively benign piece of code. And again, we have a for loop, we have a thread pull map that's from multi-processing and we have a numpy call that's called inside of it. And one of the things that you can see from this is it's a relatively small amount of code that we might have written ourselves that could actually cause this problem. And so if I were to run this on my local laptop, it wouldn't be too bad. But if I'm running it on a system with that many cores and that many threads, it's gonna take a while. So here, this example will repeat three times and it'll display the time that it took to actually get the accomplices. So the first one took 39 seconds. So now I have to burn off another 32 times 39 seconds while I'm talking here to let this complete. So again, while we're letting that run here, essentially what it's done is we have our data which is created by numpy random. We have our thread pull created through the multi-processing pull. We have a loop of three, which is looping three times. The time, it's relatively simple call here. And then QR and that for the amount of data in that range. And so let's see if we've... Okay, so we've hit the second one now, right? So we just have to burn off another 39 seconds here, right? So again, this is because what we showed in that last slide before that is you're essentially hitting over description because it's saying, oh, I have all these threads that I can address and I have the multi-processing, then mapping to the threads. It's like, hey, look at all the threads that I can create. It's just gonna create as many as it can, right? And so that's where you can get in trouble is now you've written your application. It works great on your laptop. You scale it to your server, you're on your production machine, and oh, this is, why is it slower? Why is it so much slower? So one of the things that you can do now is you can actually run it like this, right? And what TBB is going to do is it's gonna say, I'm gonna set my pull size. I think there's some defaults and you can actually look at what those are by doing TBB and then you can dash, dash, help, and it'll show you what the default sizes are, and you can set that dynamic pull size. So if I start running this, I'm probably not gonna have enough time to actually finish my discussion here before it just decides to clean itself up. But yeah, so there you go. When we talk about combating over description, quantifying what that problem can have and that nested parallelism can have is very evident now, right? Something such as simple as the demo that I just showed you, TBB could just handle that, and that's relatively Python. You can actually call it your script under TBB. You made no code changes. I made zero code changes to this thing and it actually did that, right? SMP handles it a slightly different way, right? And so now if we call it under SMP and run it now, then again, it's taking those settings that augmented style of parallelism and now it's completing it relatively quickly. So here you can also see, it accomplishes the handling of nested parallelism and over description in a completely different way, but it still addresses it and still is able to handle that in a relatively simple way by allowing you to run under SMP without making any code changes. Okay, so now that we've kind of seen this demo, I think it's time to bring it back a little bit and talk about the industry again, right? So in Pythons ecosystem of concurrency and parallelism, much of the concurrency and async areas are very rich with packages, right? There's a lot of packages in that space. We've done a lot of work with concurrent futures and it helps to solve the need of the majority of the Python users, but now when we look at the areas of true parallelism and data parallelism, it's a strong area, but its focus has been relatively small in comparison to the concurrency and async offerings. So that's why when we look at the packages in this space, it really hasn't been much shown in the area and we're trying to now make headway as kind of one of the final frontiers of parallelism in Python. So most of the ways of achieving parallelism in this area rely on vectorization frameworks or with multiprocessing or distributed methods. So I think that kind of pops the question of how do you do it in a semi-Pythonic way, right? So I'm gonna introduce this kind of silly idea of Pythonic-ish. I'm not saying it's true Pythonic because that's a whole different discussion, but let's just talk about Pythonic-ish. What makes it relatively Pythonic-ish is relatively few code changes, right? So you might have a small bit of code changes. Maybe you have to modify its current behavior of one's framework to fit your needs so to prevent a massive rewrite. That's one of the things that would be considered. Is it directly in the Python standard library? Is it writable from the Python layer? Do I have to drop into a different lower level language like C to be able to utilize it? Is the interface easy to understand and does it keep you in the Python layer and not drop to an intermediate representation, right? So I think that then poses a question, how close can we get? So if we look under the lens of TBB4Py, it meets quite a bit of these, but again, two of them aren't met, which is that it's not directly in the Python standard library and it's not writable from the Python layer. But on the other hand, you don't have very many code changes. You're not modifying a lot of the current behavior of that framework to make it work for you. To relatively easy interface, you saw that I just called it while under the module of TBB ran the script, made no code changes, so it's relatively easy interface to understand. You can set those with some command arguments if you need it. And it keeps you in the Python layer and doesn't drop to an intermediate representation. Looking under the lens of SMP, it's relatively few code changes and it doesn't modify any current behavior or framework. Now, one of the interesting parts is it is somewhat writable from the Python layer because it does have an API that you can use, but you can also use it without that and you could run it just like I did where I'm just running it under the SMP module and letting it just pass down the settings. It's relatively easy to understand and it keeps you in the Python layer. I think the other thing to also add here is SMP is completely in Python so you can look at it on our GitHub. It's a pure Python package. So from the standpoint of being able to integrate that into a solution or into other people's frameworks, it's relatively simple. But it's still, again, not in the standard library, but it's maybe a little closer to it but it's accomplishing in a different way. So I think this then poses the question of these final four questions, which is how realistic is it to have a firm requirement for a pure Python implementation, right? So TBB is not a pure Python implementation, but SMP is. And these, again, now we're talking in the light of addressing nested parallels and over-subscription. The second question would be what is the best way to modify your Python code? Is it monkey patching? Is it new, a different framework? Like how do we wanna address that space when we want to modify our Python code to operate under that augmented threading? And at what level should the parallelism be controlled? Should we be controlling it at the module call level? Should be controlling it when we're calling it from our own source code? Where should we be doing that? And can an interface be agreed upon to operate on that parallelism, right? So concurrent features did that relatively well. Can we do the same? So let's answer the first two questions. And now with the demos being shown for TBB4Py and SMP, how realistic is it to have a firm requirement for a pure Python implementation? I would say it's not required, but it's highly recommended. We can see that with the uptake of the packages that we've released, people are more trending towards the pure Python variants of it. There's also limited things that you can do from the pure Python layer, but maybe that's something that vendors can work with the actual Python in the PSF and the core developers to try to find something out. And what's the best way to modify your Python code? Is it through monkey patching a new framework? It's seeming like monkey patching is the new normal from the space. We're seeing a lot of examples where monkey patching is becoming the de facto standard when making packages that augment other packages' behavior. We see that in Scikit-learn, we see that in other places. So that seems to be the new normal and seems to be okay. I think you also have that question of at what level should this parallelism be controlled? Should it be controlled at the Python layer maybe? So I think that question is like, well, the Python layer, it's sort of, it can be controlled from the area. The challenge that you'll start finding is that it needs directives for how additional layers can compose it, right? And that in itself, maybe some type of composing directive would be useful in that space. Can an interface be agreed upon to operate on that parallelism? I think the jury's still out on that one because with every iteration that we make of attempting these packages, we learn something new. We learn something that works and that doesn't work. And it seems like the Python community is still in that space. And I urge you, if you're in this space to continue pushing and seeing what makes sense, we're still very, very young in this space to know what is Pythonic, what's the best way to operate on it, what's the best way of operating and augmenting your threading behavior and keeping that to be able to scale when you actually deploy this to your production cluster or something similar. But with S&P, we do get a slightly more clear picture as to what it could look like. So now that we've talked about all of this, I think you can see now that TBB4Py at S&P attempt to address the Pythonic-ish methods that I've set out and augment the way you do multi-threading, multi-processing and try to do it with a way that makes it such that you don't have to modify a lot of your code. And I would still say it's best to leave the two forms of multi-processing and multi-threading at their same levels and to not really change too much of how we interact with them, at least from the Python and C levels, try to keep them at their respective levels. And multi-threading is domain-specific, so I've talked about when you do something that's data parallel, typically you know the domain that you're operating in, and that's seeming to be the best choice. You have a lot of options for staying in Python, or you can drop down to C if you need it, right? So NumPy has decided that they want to be in C, and these other frameworks allow you to stay within the Python layer and still not have to actually build with some type of C-based library, right? So Numba, NEM expression, Scython, do a great example of this. And one of the thoughts is, well, if you actually have some type of directive to say, okay, at this point, I want to only have maybe 20% of the threads being able to be spawned during this one section, that might be better. You could leave that in the comments, but doesn't that just sound like pragma OMP, right? So I think that then chooses the question of, well, what is Pythonic at that point, right? If we're leaving things that literally look like C, is that really that useful? Are we complicating the language? Maybe that's the way that we're gonna achieve composable parallelism when we start combining them. I think that that poses a great question. Augmenting the threading behavior seems to be the more useful based upon the experiments that we've run and putting the bulk of the responsibility, but that also means that putting the bulk of the responsibility is on the users themselves, right? So depending on how you want to do, if you're a framework designer, how do you choose to do your threading is really your choice, you know? And I think that that is a relatively heavy responsibility. Not as heavy as expecting yours to always be caught midway being completely thread safe, but it's also a high requirement. And threading in general for numerical has a lot of known frameworks. And I think the thing is, is if you're gonna try to remove the gill or do anything similar, you're gonna be removing the ability to use just a Python object, and then you'll need stricter typing, right? So then that poses a question, well, why are you actually using Python in this instance? So to kind of summarize everything and kind of end it, the Python ecosystem has a critical mass of good frameworks that we kind of walked through today that look to address multi-threading and multi-processing. So for those of you who are working on it, keep on pushing and seeing what the limits are. Today's demonstration here, we're showing what we're trying to do on our space, and we encourage you to either contribute or find other ways and propose other ways of doing so. So thank you, and with that, I'm open for a Q&A. Thanks very much. Does anyone have any questions? Hey, great talk, thanks. You were asking about a good interface for integrating this into systems that require parallelism into Python. You're probably aware of Joplib. Yes. So how does that interact? Is that, is your stuff running below Joplib, or do you directly integrate with it somehow? So that's a great question. So if we step back a little bit and we look at where Joplib sits within this part, right? So when you have Joplib, and then you're calling things from Joplib, those are two different layers of parallelism, and that's where I was talking about where, because there's actually no real communication between what the Joplib requirements are, and then the NumPy call that you put into Joplib, that's where that over description area can occur. And so I think that Joplib does a very good, well, good job, no pun intended here, of being able to separate those tasks out and being easily able to define a way to compose the jobs that you need to do in either a task parallel format. It's biggest, I think comparison would be Dask, and both of them do a very good job in that space, but I think we still run into the problem of we have a composable parallelism problem. We have something that's clearly a application level parallelism and something that's clearly data or task parallelism. And that link either, from the way that we've defined Python, it either needs to be kept separate or we need a way of interlinking it without breaking I guess the APIs that they were defined as. I think we're losing the abstraction capability if we try to bring that layer down too much. Thanks. Thanks, anyone else? Yeah. So you mentioned about the CPU bound parallelism that it's very clear when you can know about the oversubscription, but also there's a part of a sinker weight pattern and that's basically for IO bound processes. Right. So at the end, it's very hard to determine what's a better weight approach with that because you don't know if the CPU is starving while you are doing all the IO weight processes or how you would approach that. So I mean, one of the things that I do as a consultant is I work with customers that have those style of problems. And to determine whether we have a CPU bound slash IO bound problem and determine what's actually the issue, we typically use a Python profiler. So one of the products that we use is Intel V2 and Amplifier. And so we will take a look and see what's going on from a code profiling perspective and then try to look at what's the behavior of the code in that space. Sometimes it takes a little bit of static analysis to be able to determine that or looking at IO saturation with the tools to be able to determine that. But it's actually a very, very hard thing to detect in your right. It is extremely hard to know. If given no tools, even with the open source profilers in the Python space, it's actually very, very hard to know what's going on. Thank you. Hello. Good talk. Thanks. Two small questions. Is SMP developed by Intel as well? Yes. And which one was the first one to be developed? TBB was the first to be developed. The threading building blocks has been around for quite some time. I think it became open source recently. But it's the longer legacy one. And then SMP was developed, I think, about a year ago to kind of address the space. Because one of the systems that we used had 63 plus cores per socket. And then you get that towards a lot of them. We started seeing that this problem existed when you start scaling out that issue. So that's why we developed it a little later in the game. OK. And what is TBB 4.5 doing in C? It's operating with the TBB runtime, actually. So one of the things that we do is it's one of the libraries that we ship. And so it's actually operating directly with that dynamic library. So that runtime, basically, if you download any of our packages like NumPy or Scikit-learn that utilize it, it'll download the runtime and interact at runtime with that library. Yeah. Thank you very much. Yeah. Thanks. Anybody else with a question? Hi there. Great talk. I'd like to ask just what if I would like to run a few programs under the TBB or in CBA? What would happen? Will they understand that their oversimps subscription could happen and like would happen? Or should I concentrate everything in one program to avoid broken? That is a great question. So the way that both of these packages work and how most types of tools that accomplish that type of control in that space have to have is you have to start from a single Python process in order for that to work. So if you think about how JobLib and Dask are doing it, it's starting from something that's multi-processing down to something that's threading based upon those processes, but it's started from the same one. If you start them on different Python processes and they're not started from the same one, then that's problematic because they don't see each other. So they're going to have different pools. Whereas if you start it from the same one, it's going to have the same pool. And so it'll be able to better handle oversubscription in that manner. So it's better to focus that and start it from a single Python process, if possible. I don't know if I got the difference between the TBB and SMP. What are the use cases one over the other? So one of the things is with TBB, it handles dynamic types of threading better. So say you have something that returns within 10 seconds, but it has the chance of returning in a second. If you have that with TBB, it'll be able to say, OK, this one ended quickly. We're going to put this one back in the pool and let it come back out. SMP handles it a different way, which is to say I'm going to pass down the settings for the amount of threads that can be spawned from that process. So say I have OMP num threads equal to 1 or 2. And it's going to say, OK, for this process, it's going to be 2. For this one, it's going to be 1. So it's passing it down by saying I'm having these settings passed down. And that's how it controls it by not letting it go outside of it. But it's better for structured work. Because if you think about it, if it's structured and it passes down those settings, then they'll stay essentially semi-pinned to the processors and not be jumping to different processors all the time. And then you'll have cache issues if you start doing that. So symmetric work generally is better for SMP. And then more dynamic types of parallelism that have the chance of returning a little bit earlier or unbalanced is better for TBB. Thank you. Thanks. Any more? Can we use these packages without the Intel version of NumPy, SK11, and so on? I mean, you can download the packages themselves. So you can download TBB for Py as a standalone and just run it with your, again, this is kind of the talk about Pythonic-ish, right? So you can download both of these packages independently of our distribution on our Konda channel, which is you can just use the dash C Intel channel if you're using Konda to actually look up these packages. Thank you. Yeah. One more. Hi. Can you say something about platform compatibility? I guess it runs on Linux. But what about BSD, Windows, Open Solaris, and so on? Great question. TBB runs on all platforms right now. So we have for Mac, Windows, Linux, and a majority of flavors of Linux as well. SMP right now is Linux only because of some of the items that we're using, but we're looking to see what other options we have in that space. It's just currently only on Linux for the time being. Any other questions? Does that work back? No? All right. Well, we'll thank our speaker, David, again. Thanks.