 You're welcome to the performance Python talk from America algorithms. My name is Yves Hittish I'm founder managing director of the Python Quants as the name suggests We are mainly doing work in the financial industry. So my examples will be primarily a financial But I think they apply to many many other areas. So if you're not from finance You won't have trouble to translate what I present today to your specific domain area Before I come to the talk a few words about me and us I said I'm founding a founder managing partner of the Python Quants I'm also a lecturer for math finance at Saarland University I'm co-organizer of a couple of conferences and organizer of a couple of meet-ups actually all Center around Python and quant topics. I've written the book Python for finance Which will come out at the Riley this autumn. It's already out as an e-book I will show it later on and Another book derivatives analytics with Python, which will be published by while a finance Next year probably apart from Python and finance. I like to do martial arts actually This is the book and actually today's talk is based on chapter 9 of the Python for finance O'Reilly book As I said, it's already out as an early release as an e-book and the printed version will probably come out at Well, let's say mid November is kind of the date. I'm finished with my editing I hope O'Reilly will Will come up with a printer or final version pretty soon There's also a course right out now on Quants up actually Which also covers a topic that are the topics that are present today It's completely online based one. Maybe you want to have a look when you come from the finance industry I think then the benefits are the highest here in this era. We are Doing otherwise is at the moment mainly working on what we call the Python quant platform We want to provide a web-based infrastructure for Quants working with Python and applying all the All the nice things that I present today I would show it quickly later on maybe with a couple of examples We haven't created a pattern notebook there. You have a pattern chill easy file management with editing So anything you want to need in addition We also provide our proprietary analytics suites decision and deeks analytics on this platform That's enough about us now about a talk. What is this talk about actually? When it comes to performance critical applications two things should always be checked from my point of view Are we using the right implementation paradigm? Sometimes this boils down to what is typically called idioms and are we using the right performance libraries? I think Many of you might have heard the prejudice that Python is slow And of course Python can be slow but I think if you're doing it right with Python Python can be pretty fast and One of the major means in this regard are performance libraries that are available in the Python world And I can only previously touch upon all of these that are listed here I think there was yesterday the talk by Stefan about the thyson So about any topic that you see here you can have complete talks or even complete tutorials for day Or even for week for some so it's a complex topic But my talk is more about showing what can be done The main approach will be to say this is what it was before it was a little bit slow Then we applied this and that and afterwards we see these improvements We don't go that much behind the scenes We don't do any profiling during this talk But you will see in many many cases when it comes to numerics Python and these libraries can help in improving performance substantially Let me come to the first topic Python paradigms and performance As I said what I call paradigm here in this context usually is called idioms for example This is just a function that you see here. Don't try to get the details It's just a function I will use regularly and I've provided it here in the in the slice and that you can use it afterwards As well It's just a function to compare a little bit more systematically different Implementation approaches and compare performance a little bit more rich richly, but there's nothing special about that Let me come to the first use case a mathematical expression. That doesn't make too much sense with a square root We have absolute value with transcendental functions in there and a couple of things that are happening there You might encounter these or similar expressions in many many areas as I mentioned before in Finance and math finance you have these you have these in physics and you name it in almost any science as of today You find such or similar numerical expressions. We can implement this pretty easily as a Python Function as you see here It's a single equation and we translate this mainly in a single line function nothing special about that What we want to do however is to apply this particular function to a larger array to a list object in the first case with 5500 thousand numbers actually So this is kind of we are usually the computation of burden comes in when you have a huge data set and you Want to apply these expression to the huge data set it's not that the single equation is kind of complex But it's in the end the mass of the computations that makes it typically slow So to stop working with we generate in the list object using simply a range with five hundred thousand Numbers and what we then do is to implement another function which uses our regional function f Implementing the numerical expression where we have a simple looping This is the first implementation out of six that I present. So this is a pretty simple straightforward Function where there is a for loop in there We have another list object and we just calculate the single the single Values and append the results to the other list object in the function then returns our list with the results a Second one second paradigm or idiom if you like is to use list comprehension Actually the same thing is happening as before, but it's a much more compact way to write the same thing So we generate in a single line of code the list object by iterating over our our list object a and Collect the results given the values that the function every turns a little bit more compact Maybe better readable if you're a Python coder you might prefer this one We can also do it. This is quite flexible. I wouldn't suggest to do it in that case We will see this will be the slowest one But it's very flexible and we are working for example with Classes objects where we value derivatives and derivatives They have like kind of complex payoffs and so forth and you can describe these like in a string format It makes it pretty flexible to provide different payoffs for these classes And this is for example one area where we use it But typically when we use it It's only once that we have to evaluate the expression but here in this case You might notice that the expression is evaluated per single iteration of the list comprehension So it's and as we will see this is a very intense a compute intense or interpreter intense Way to do it to like every time I iterate over the expression to evaluate it This will make it pretty slow as we will see of course if you're working in numerics or science You would be used to vectorization approach of NumPy. We can do is kind of implement the same thing This this time now on a NumPy and the array object Which is especially designed of course to handle such data sets and such data structures and With a single line of code and using vectorized Expressions we can accomplish the same thing So now we were working on NumPy and the array objects and using NumPy universal functions To calculate what we interested in Put a similar to the list comprehension syntax But in the end we would hope for speed improvement because this is especially designed to exactly do these kind of operations Then we can also use Dedicated specific library, which is called num ex for numerical expressions here in this case We again provide the whole expression as a string object, but in this case Actually, what happens is that this string object this expression is compiled only once and then use afterwards And here again, we are working on NumPy and the array objects So num expression is especially designed to work on NumPy and the array objects So in this case we would also see hopefully some kind of improvement because it's kind of a dedicated specialized library to Attack these kind of problems You might have noticed that in the first example I have set the number of threats to one to have kind of a benchmark value We are only using one threat one core in this case here. I increase the number of threats to four So if you have a four core machine, you would expect here kind of an improvement, but what kinds of improvement? Let us have a look in summary again We have six different paradigms or idioms used with Python and in the end this is kind of Delivering in any case the same result. So like as it's often the case and when you see people coming from other languages Coming to Python being new to Python not knowing all the idioms They're using probably those that are used to from other languages like C or C++ you name it and and sometimes this can be a pitfall in a sense that They're using maybe the wrong paradigm the wrong idiom, but let's have a look what the differences are now our comparison function comes into play And we have a clear winner obviously we have a multi-threaded version This is the f6 was last one. We're using multiple threats to evaluate in American expression on the array object Then we have to single-threaded one, which is the second fastest and the third one is the NumPy version and then the Python ones Follow after that so we see actually this kind of a given the list comprehension for example We have a 28 times increase in performance using the multi-threaded NumMix version and as I mentioned already before the f3 This was the one which uses the built-in evil function of Python You see that we have a speed up in total here of 900 times So these can vary of course depending on number of threads they're using and so forth But the method the message I think should be clear We have in Python many many ways to attack the same the very same problem and All the ways will yield the same results, but there might be considerable performance improvements when going the right route and avoiding pitfalls and especially avoiding Yeah, implementations that are per se compute intensive. So this is for example We are profiling would come into play and don't I don't present it here I said my approach is more like this is before then we do something we compare it and this is what is afterwards So if the profiling of course would have revealed that evil is kind of a very time-consuming function And most time is spent for example with f3 in this type of Implementation Let me come just briefly to a rather subtle thing When we've seen the numerical algorithms implemented based on numpy and the array objects be it directly by the use of Numpy universal functions or by using num expression Have been the fastest But this kind of in certain Circumstances I and I encountered that quite a while ago, and it was first like a little bit like I didn't know what's what's going on in there But later on it became pretty clear what was going on So it's from my point of view worth mentioning depending on the specific algorithm that you're using that memory layout can play indeed A role when it comes to the performance let me start with a typical Numpy and the array object Which we instantiate by providing the d type here in this case float 64 and here the order or the memory layout comes into play We have two choices with numpy. There's C for C like Layout and F for Fortran like layout in this case. You don't see any difference. There's nothing special You see just the numbers printed out, but don't get confused because this is just a graphical representation of what data is stored Actually at the moment, but if we have a an array like this, you can explain what memory layout is all about actually When we have to see like it has a row by storage We provide here the order C This means that the ones the twos and the threes are stored next to each other So this is how in memory. I mean memory is a one-dimensional thing So we can't store it given a unique location in memory So we don't have kind of two-dimensional and dimensional things where we can store data into it's kind of a linear thing So we have to decide how to put multi-dimensional thing things into memory And this is how is it stored when you use the order of C? Using the order F then in this case we have a column by storage Which means that the one two three the one two three and the other one two threes Stored next to each other working with such small number and the array objects doesn't make That that much of a difference, but when you come to larger ones and in particular to asymmetric ones like this one We see we have three times 1.5 million elements in there then we can expect some differences in performance We essentially a two different and your a objects here the one with the order C Of course and the other one with F to just compare it But we now start calculating sums for example the C order you see already kind of a difference When you're calculating the sums over the different axis So NumPy is of course aware of the axis list objects when you construct like two-dimensional things with list objects There's no awareness or it's kind of no attribute For the axis, but in this case we can calculate the sum Row-wise or column-wise if you like and you see there's kind of a huge difference You're like a 50 percent difference when it comes to the two different axes only the performance of calculating some The one delivers back kind of 1.5 million one dimensional result the other one returns a result Which has only three elements in this case, but of course in America operations are running differently over memory for both cases For standard deviations you observe the same thing so according to the results here Going along axis one which means the second axis of course with the serial numbering is much much faster than the other way around So if you have these problems and you have to implement it might be worth considering really doesn't make sense to have a three times 1.5 million array or 1.5 million times three array So you will see considerable performance improvement going the one way or the other depending on what you're exactly Interested in when it comes to the calculations Third sums with the F order and the array object you see These operations are actually both slower. They are absolutely slower than the C order operations But you see different relative performances. So in this case doing the sum according along axis zero Which means the first axis is faster relative to the other axis The same actually holds true Not really. This is pretty close for the standard deviations and you see This is also The absolute disadvantage might be due to the fact that C is the default and the C world in NumPy is a little bit more maintained Or more important, but once you have to for example interact with the Fortron world and you are like Required so to say to work with the F order that it might make sense again to consider the question is three times 1.5 million better or 1.5 million times three. So you will see in certain cases huge differences Let me come to another approach Which is usually and I think all the approaches that I present today are like in a certain sense low-hanging fruits There's typically not that much involved when it comes to for example the redesign of the algorithm itself So I don't do any redesigns of algorithms I'm always sticking to the same a problem to the same implementation and then showing what it can do Sometimes you will see that of course using different libraries. You need to rewrite something But it's not about the for example parallelization of a certain algorithm Whatever present now it's more like given a certain implementation of an algorithm. What can I do with parallel computing actually and? Si already announced before I'm I'm used to use these the financial examples and here is a Monte Carlo problem which is about the simulation of the Plexigold's Merton Stochastic differential equation, which is kind of a standard geometric Brownian motion Which is also applied in many many other areas and physics and so forth And what we want to do is kind of simulate this model and value a European call option. Don't worry about the details It's only to say that this is usually kind of a very compute intensive algorithm to work with and That might benefit usually from parallelization But first the implementation of the algorithm what I do here is kind of already a I wouldn't say optimized implementation But at least I'm using NumPy and using vectorization approaches to be faster than for example the typical Python looping that we have also seen as As an alternative before I could have done this also with Python But isn't the point here I want to stick with this number implementation and see what we can do when we parallelize the code you see If the import function here within the function because when we use ipython parallel Which I will do here where the whole thing will be pickled and we have to import within the function to get everything to the single workers First as a benchmark of course a sequential calculation This example is only about like calling for a couple of times the same the same function and Parameterizing the function by different strike prices in this case But again you can replace this with any function your way of which is similar from your specific area And what we're doing here is kind of indeed just looping over the different strikes we interested in and Collecting the results that will get back from the function nothing special in this simple loop Collecting results and we have finished so you see here We do it for a hundred of option calculations and we get back the strikes the list of strikes and The results from our function and in this case the hundred calculations take 11.4 seconds Just the results visualized that you get a feel so going over the strikes Well, you European call option means the higher the strike the lower the value This is what we would expect so obviously a function works pretty well now the parallel calculation We use here and there are many alternatives I've seen already salary and I know that there will be a couple of talks about the alternatives But I python parallel usually as I said is kind of a low hanging fruit many people are working With I python notebook these days and there's very well integrated so we can just import from ipython parallel our client class Object here and instantiate the client in the background or using for example the ipython notebook dashboard I should have fired up already kind of a Either a local cluster or when working really in the cloud or in cloud-based services You kind of huge clients so the largest ones. I've heard of where about like five hundred twelve nodes IPython parallel is known to be not that stable when it comes to like a thousand nodes for example So it doesn't really scale beyond a certain point But still for example people doing research or for smaller applications is kind of a pretty efficient way What I'm doing here once I have a client given my profile my local cluster for example I generate a low balance view in this case and the code that I need To do the same what I've been doing before with the sequential calculation. It's just It's just almost the same kind of two difference actually worth mentioning in this case. I don't directly call the function I rather asynchronously apply my function given the parametrization to my view I append the results and have to wait until all the results have been returned Otherwise the whole thing will break down So these are kind of only two lines if you like are attached in the code and this is not even in the algorithm This is just how I collect the results. So there's not that much overhead Given the the sequential implementation We might have had three new lines and or four new lines in total and one line of code has been changed To implement the different approach actually in this case And the parallel execution Now I'm a little bit surprised. What does it take 29 seconds and the wall time is not a white one We have the we have to I've been looking at the wall time, but the total time for the execution was five seconds here actually in this case Because we have used multiple multiple cores. So it speeds up by a factor here where we are we have started Let me get back like 11 seconds and a little bit. Yes, 11.4 seconds And we end up here on this machine at five seconds a total time But to have a little bit more rigorous comparison I come back to the performance comparison by again applying my performance comparison function, but here you might have noticed that Implementing this approach leads to different results actually We don't get back kind of only the number what we get back is the whole set of results and a meter data Which did the single jobs are returning so for example having a look at the meter data You get back much more information like when it was completed when it was submitted and so forth But we're mainly interested of course in the result and this can be retrieved by this attribute We have this results object. Here's the attribute and in the end I can Hear why another loop collect all the single results from a parallel application of the algorithm and Just to compare here the sequential and the parallel calculation, of course There are numerical differences because we're working with a numerical algorithm which implements simulation So we would expect kind of numerical errors or differences even if you're doing the same thing We parallel or sequential, but what we are most interested in is the performance comparison and So this end we have used the function already And you see here working on a on a machine with four cores in it leads to a speed up of roughly 3.1 so of course you have an overhead For distributing the jobs and so forth for collecting yourself, but in the end you will see that Applying this approach typically scales almost linearly in a number of course not in a number of threats hyper threading for these kind of Operations don't bring that much, but you will see usually as I said almost linear scaling in the number of Workers of example working with another server. We use these approaches with the eight-courser There you see like kind of speed ups of seven times point something But again, not that much overhead involved. No, we haven't changed the algorithm at all and By investing maybe an hour of work or whatever you might improve your numerical computations considerably If you're only working locally and are not interested in like spreading the parallelization over whole classes or whatever Then there's of course the built-in multi-processing module Again, I personally scales over smaller medium-sized classes But sometimes it's helpful even to parallelize code on local machines and I mean I know the percentage but most machines as of today even the smallest notebooks have multiple cores and and even using two cores already might Lead to significant speed ups when you now think of a larger desktop machine where your four or eight cores You will see also significant improvements and again the fruits are low hanging in this case as well So we import multi-processing as MP and our example algorithm here is again Monte Carlo simulation This doesn't do the valuation, but this doesn't do actually the same thing. It does the simulation So there's not that much of a difference We have a different Parameterization here that we apply but in the end is kind of the same core algorithm that we use here to compare the performance What this does is kind of gives us back simulated paths in our case It will be stock prices, but also many many Things in the real world and physics and so forth are simulated that way I mean probably motion was invented so to say in the first place for describing the movement of particles and water So I mean it comes from physics, but the the finance guys have adopted all the approaches used in physics So we are simulating paths over time so to say What we now do here is kind of a More let's say rigorous comparison or not rigorous But what we do is kind of we we change the number of threats that we use for the multi-processing Implementation we've kind of a test series and it's implemented on notebook with four cores i7 And we use the following parameters We have 10,000 paths that we simulate and a number of time steps 10 and what we want to do is kind of 32 Simulations which translates to the number of tasks that have to be accomplished here in this case so it's a simple looping over a List object starting from 1 and ending to 8 so we start with a single thread and end with 8 threads And you see there's not that much of code involved. It's actually pretty comparable to the to the iPython Parallel example you just have to define our pool of processes our pool of workers that we use for the implementation And then we map here in this case. They're different approaches I must say but here we map our function to a set of input parameters actually it works pretty much the same than the then the map functional programming Statement in Python, so we map our function to our set of parameters And say well, please go ahead and in the end we wait for the finishing and append the times that it takes for the single Runcy in this case But as always a picture says more than a thousand words and you see here we start for the 32 Processes with time approaching almost 0.7 seconds and we come down to well Something about one point Or zero point one five second, but you see multi-threading doesn't bring that much in this case Actually around four five actually here in this particular case at five We have the the lowest execution time, but you see the the the benefits are pretty pretty high here in this case it again Almost scales linearly with a number of cores available not with a number of threads But with a number of cores available here for our 32 Monte Carlo simulations And as you have seen is only mainly two lines of code that accomplished the whole trick Let me come to another approach We haven't really touched the code to the implementation We just have taken an implementation for the last two examples and have parallelized the thing But more often than not you want to try first to optimize What is actually implemented and one very efficient approach is dynamic compiling There's available a library called number This is an open source number where optimizing compiler for Python code Which is developed and maintained by continuum analytics and it uses the LIVM compiler infrastructure And this makes in a couple of application areas for really efficient Yeah, let's get it get The collecting of benefits and the low hanging fruits that I've been mentioning so often right now That is almost like sometimes really surprising because we would see not that much effort Not that much overhead involved, but usually can expect given a certain type of problem Really high speed ups first introductory example before I come to a more realistic real world example And the example is only about counting the number of loops But counting in a little bit like complex fashion that we have to transcendental function of cosine here and then calculate the logarithm But in the end this nested loop structure doesn't do anything else, but counting the number of loops It's nothing about that, but we know looping on a Python level typically is expensive in terms of performance and time spent And we see it here when we parameterize this looping structure with 5000 and 5000 This takes about 10.4 seconds to execute in the end. We have a looping. It shouldn't come as a surprise Over 25 million iterations here in this case The benchmark again 10.4 seconds to remember We can of course do a numpy vectorized approach to accomplish the same result Actually wouldn't make sense to like count only loops But there are a typical numerical and financial algorithms that are based on nested loops that you can easily Easily vectorize with numpy so this kind of very general and very powerful approach, but We would see what the negative consequences are here in this case again, the function is pretty Compact in a sense that we just instantiate here our array object Which is symmetric in this particular case, and we just do the calculation. We just do the summing over our resulting Array object where we have applied before the logarithm and the cosine function and then do the summing over the results here in this case I mean, it's always the same Always coming up with the one, but nevertheless, it's compute-intensive and we see there's already a huge speed up The execution time is below one second here in this case by using the vectorized approach So numpy as we know is mainly implemented in C where we are doing this kind of here We like delegating the the costly looping on a Python level to numpy and numpy does it at a speed of C code Which is a little bit faster as we see here. Actually, we have a speed of more than 10 times But there's one drawback Instantiating such a huge array leads to memory requirements Of course here we see we need an array object which in the end consumes 200 megabytes of memory and I mean it's not not kind of a nice feature You have an algorithm which doesn't consume any memory and here in this case using numpy a vectorization Leads to memory burn of 200 megabyte and now think of kind of larger problems And you will certainly find some where memory doesn't suffice in the end So this is kind of nice because it's faster But in the end it consumes lots of memory is memory not an issue you might go that route But there is an alternative actually and this is number that I mentioned before and again The overhead is like kind of minimal in this case. I just import the library usually abbreviated by NB and Called the jit function for just in time compiling. I mean it's not just in time It's not compiled in runtime. It's compiled at call time actually But it's called jit here in this case, and I don't do anything with the with the Python function at all So I just let the Python function as it is the f underscore pi and I generate a Compiled version of it by using number So now executing this we see that when I call it for the first time It's still not that fast because for the first time I said it's compiled at call time There is kind of a huge overhead involved But when I call it for the second time you see this is then ridiculously fast given the Python implementation So here we get to speed so I say well now we can compare to see implementations to optimize see implementations because number uses the LLVM infrastructure and on the LLVM level they're kind of these all these Optimized compilers that like compile it optimally to the given hardware at hand So this works as well as I will show later on with a different example also on the GPU actually So here we see huge improvements in speedup and again, I can only stress the point There's not that much effort involved. It's just the calling of the jit to the original Python function And here see kind of huge huge huge huge speedups given this implementation So it might be worth Considering using number when you have a similar problem We see all nested loops and we do this and that and and so forth And the beauty is which comes on top is that the number implementation is as memory efficient as the original one There's no need to kind of instantiate an end-eray object with 200 megabytes or even larger So the beauty of the memory efficiency remains and they get these huge improvements, but just Compiling it with number. You know me option pricing is kind of a very popular very important America algorithm in the financial world So let's see if it works with that as well Don't worry about the details again. It's just like a parameterization of the model What we have to do here is kind of simulate something Then we calculate some inner values of an option and we do a discounting So we have kind of a three-step procedure if you like and the three steps are like illustrated here I can do maybe a little bit smaller Again, the code is not that important But there are two point worth noting the one is that I do the whole thing based on NumPy arrays So I do if you like Python looping, but based on NumPy arrays So I'm not working on lists with Python loops I have my Python NumPy in the array objects and I do Python looping here over my arrays And you see we have three nested loops to implement this when it when I go the looping route This is not that I say you should do that By no means but I will show the effect of going different routes afterwards so just remember Looping over NumPy and the array objects and we have three nested loops and by now we should know that looping on the Python level should be costly What does it mean costly in this case the execution for a given number of time steps takes 3.07 seconds Actually, this binomial option pricing algorithm solves the same problem as we have been attacking before with the Monte Carlo simulation So we can compare the result and you see here the Monte Carlo simulation Which is usually considered to be the most expensive one when it comes to Computational power that is needed as even faster in that case. It's not that exact I must say there are numerical errors in there But three seconds for the binomial option pricing model here compared to the 82 milliseconds given our Monte Carlo simulation from before But you see there are similar results that we get from the two Numerical methods, but this is actually the point There's just to say what these two algorithms solve actually the same problem in a sense the first approval again We can go the NumPy vectorization route. I said well, I don't Touch the algorithms themselves, I wouldn't say that touch the algorithm here This is just kind of using different idioms using different paradigms to do to implement the same algorithm in Python and here we can make use of course again of NumPy vectorization. Is it two minutes left? Oh, okay Can do the NumPy vectorization actually and What you see from the vectorization process is that is again already much much faster, but we now apply the JIT from number and Get back a compiled a machine code compiled version of our Python one Then you see that we again get a speed up of three times comparing this more rigorously You see here. Well, we have the number version is 54 times faster in this case and three times faster than the number version Let me skip through a couple of slides. There's a study compiling some with Syson as well At this point before I forget it if you go to my Twitter feed dygh, I have tweeted The links to two presentations actually this one I have also side presentation so you can read all the things to it, I might do it right now going to Twitter.com and it is the ygh and I have tweeted links To it, so I'm not able to present everything, but here you find the links to the presentations On my on my Twitter feed actually Study compiling with Syson works similarly Here are examples where you also get kind of huge improvements I skipped through that in order to have a couple of minutes left for questions But again, if we do a performance comparison In this regard for example here working with floats So if you have a look at it, there's no need to work with floats, but still having this Kind of rigorous performance comparison when you go to the algorithm back You see I have an implementation using Syson and another one with number and here in this case They're actually pretty similar when it comes to performance So with Syson you usually have to touch the code and you have to do kind of static declarations and so forth But with number sometimes I don't say that always don't get me wrong But sometimes you can get the same speed-ups, but just using the just-in-time compiling approach of JIT It's actually the last topic is generation of random numbers on GPUs. I want to spend the last minute on that Because this might be useful in many many circumstances and usually is considered kind of a very hard thing to get the power of GPUs Included in your work. What I'm using here is number pro Which is a commercial library of continuum analytics Which is kind of the sister or brother library of number and what I use is kind of the Originally the the native libraries that are provided in the QDA lip in order to generate random numbers This not that much specialties included that we just generate random numbers Which are stored in a two-dimensional array in that sense. Here's the code for the for the QDA function QDA only gives back like a one-dimensional array, so we have to reshape it afterwards, but I mean this is straightforward What I do is kind of compare the performance for different sizes of arrays that where we want to get standard normally distributed random numbers back and I skipped the first slide because I have kind of implemented a rigorous implementation and what we see here in this one chart and this Almost says it already that if you just want to generate a few random number, so to say Then you see that the CPU might be the better one because you have Overhead involved when you're moving data when you're moving code from the CPU to GPU this overhead, of course But once you reach a certain size of the random number set you see that the CPU It's not a linear scaling is that what you see the increase in time needed And you see that there's hardly any increase in the time needed on the on the QDA device here to generate the random numbers Here again the message if you have only a small set of random numbers don't go to the GPU There's too much over I had evolved remain on the CPU But again if you're working with really large random number sets and here the largest one that I'm generating is 400 megabytes in size per random number set then you see that of course the the QDA approach Pretty much well. Yeah, of course outperforms the NumPy in-memory version with the CPU based on the based on the QDA approach here in this case So again only couple of lines of code It's a single library that you call and you get all the benefits from that And you see there's a huge speed advantage over the QDA device over the number one The last thing I just want to mention is how about I owe Python is not only when it comes to numerical operations But I had it included in my abstract Python is also pretty good when you want to harness the power of today's I owe hardware And usually it's pretty hard to get to the to the speed limit of the hardware or with Python and here working with an example Array object 800 megabytes and just natively save that you can also use your pie tables and actually a five format and a couple of Other things, but it's already built in into NumPy that you can save your arrays to disk You see this almost at the speed of the hardware that is allowed Here writing on the Macbook with as ST try if you see for the 800 megabytes This is much much faster to save and to load SCC as it is to generate a memory So the memory generation of this 800 megabyte array with the memory allocation and the calculation of the random numbers takes 5.3 seconds, but on this machine it only takes two seconds to write it and two seconds to read it So you see how fast you can be with Python and there's no kind of Performance trickery involved. This is just like batteries included and Python typically makes it pretty efficient and pretty easy To harness the power of the available hardware as of today and this brings me to the end and thank you for your attention Sorry to stay between lunch and You just ask a question if you I think I can hear you I can repeat it Of course, of course, yeah Of course is what is what I was saying is kind of for this particular algorithm We have data parallelism and code parallelism, but this is kind of the the most single simple scenario you can think of Of course, I'm pretty aware of that Yeah, of course is kind of one of center protocols that are used within we haven't used it actually but of course There are kind of bindings and many application examples and pretty good use cases about that Okay, so hi. Thanks for the talk. Love it Small question. So suppose you were doing a huge time series analysis It's not in this scope, but obviously that's something. It's kind of hard to do in parallel There are algorithms that work very nice in parallel and there's others that don't work very nice in parallel What's your gist on doing things with algorithms that don't work nice and parallel like what's the besides compiling? What are some tricks you maybe use? Just this is a question out of curiosity So yeah, and I understood actually time series analysis Of course one thing we are like concerned with all the day in finance So it's one of my major things so to say but I didn't get the question in the end So there are algorithms that are that you can really easily approach something parallel And there are algorithms where this is not so easy I can imagine for certain like very advanced time series models say Arima type of models Now those don't paralyze that nicely What are you? What's your gist on if the problem is hard to paralyze? What's the best tactic to approach those problems? I mean of course not not everything is kind of well suited to be paralyzed What we're using for example or working heavily with kind of least squares Monte Carlo where you need kind of the cross-section Usually Monte Carlo you would say a hundred thousand paths I can parallelize into two times fifty thousand path. It's the same with the time series analysis You could do like I have a hundred thousand observations I can implement my algorithm the first fifty thousand and second fifty thousand There's one approach to do it, but not every algorithm is like well, sweetie Because you need kind of the whole history or whatever built up So you need the cross-section of the information in order to have your algorithm produce the results that it's supposed to So usually I mean the approach that I present is kind of using parallelization for an unoptimized Algorithm is kind of usually not the way to go what you would do in this case when you say well I don't have the algorithm that can be Pretty easily parallelized of course you would in any case go for the optimization of your algorithm by using a thyson and everything But I think all the libraries is not only thyson What I haven't mentioned is kind of theano for example if you have a look at pi mc3 That's me. This makes heavy use of theano just in time compiling where kind of your objects your classes are like on the fly dynamically optimized for the problem given at hand using just in time compiling or Call time compiling it's kind of slight difference But this is typically I think the approach that you would take this and well Let's first optimize with any given means that we have available the algorithms But I agree not every Algorithms can worse you to be parallelized But again if you have two times serious to analyze then you can start to think that again This was my my point actually Many similar similar tasks and this is of course the trivial case for parallelization So if you're starting off with serial Python code, I think it's pretty obvious that using these Parallel tools will make it go faster, but I think lots of people Believe that Python is not what you should be using if you want to have efficient Paralyzation because of the additional overhead because you have to use multiprocessing because of the gill and obviously Python is a higher level language So I'm not talking about benchmarks, but just in real world applications like actual Application that you have written what do you say to people who would be tempted to stick with C++ to squeeze out that last bit of Performance like do you find that Python is sufficient for kind of everyday needs for these kinds of simulations so far? We haven't ever gone the route of going to C++ or C for all our client projects Nor for the things that we've been implementing so we're using for some the multiprocessing tool for DX analytics or library Where we have simple easy scaling and then parallelization on the machine at least This is typically where the things are on on larger machines with multiple cores with the huge memory This kind of the scenario that I have I mean many many I can understand people that have issues like with the scaling over like classes and so forth, but of course I mean You have this word and with thyson is kind of the Very good example where you can say well even using thyson I can decide whether I have 90% Python and 10% C so to say or I have something that looks like 90% See and a little bit of Python you can still notice in between So and this is a beauty of Python that you don't have usually these either or things you can even go for like After profiling where I say well, this is the real bottleneck of the whole thing Let's go for the zero for that I recently met during our Python for quant finance meetup in London a guy He was well I again did something in assembler because we thought this would be I don't know if this is a Right thing, but still you have the flexibility and and there's a beauty of Python That is not either or that it integrates pretty well, and of course C and C++ are the two worlds where Python interacts natively with so whenever you say well, I have this approach and I can use this better with C++ Why not going that so many people are doing that and the financial industry is kind of a standard to do that But we for the things we have been doing can only say it again for our stuff and for stuff that we have been implementing for our clients We've always stuck to the Python world of course using performance libraries and all that I mean under the hood this comes down to see and then other stuff But not on the on the top level where we've been implementing things