 Thank you very much for the introduction. So good afternoon everyone. My name is Ranjit and I work for Julia Computing. I'm going to be talking to you about Bhadik Reddy Julia. And most of this work was done with Dr. Kiran from Nanikram Intel Labs. So let's get started. So let us ask him a fundamental question. Before we ask that, we should ask why multi-core? Is this better? Maybe I can just hold it. No. So is this better? Yeah. So why multi-core? So as you can see in this also familiar graph here, CPU speeds have been going at about 3.3 MHz. And this chip manufacturers essentially are putting more cores on one chip. So the responsibility has been transferred to the programmer to think about that application and write it in parallel so it can use all the cores on your chip. So this is essentially what's happening in high-performance computing today. Most HPC cores are moving towards hybrid MPI plus X models, MPI plus transfer message passing interface. So essentially you have an application and you have a cluster. So your application will be run by many different processors sitting on different nodes and each of these nodes is a multi-core computer hence these processors will be threaded or rather using a shared memory model to work on the multi-core node. So this of course gives you a lot better performance but it's also prone to program error and you'll see why. So how do you divide work amongst threads? So there are a couple of ideas around this. One is called task parallelism. So task parallelism essentially means that you have different threads executing different pieces of core. Now unfortunately this is a bit of a challenge in terms of correctness. So how do you verify that your final answer is correct? So this task parallelism is notorious for Heisenbergs. Heisenbergs are bugs that come, that serve is known then but other times you don't serve at all. So in Julia we have implemented data parallelism. So data parallelism essentially means you have this large chunk of, large block of data you're dividing into different chunks and threads are operating on each of these chunks. So essentially each thread is executing the same core but everything is happening in parallel. So where does Julia fit into all this? Now if you will take a look at this graph there are a bunch of blocks here but the key takeaway is that each of these blocks stands for the relative difference in performance of parallel core versus naive CEC code. Now this essentially means and if you look at the graph there there's a 50x factor there so this essentially means that you get optimal performance on a large parallel system. There is a lot of architecture specific hand tuning and this increases as the number of cores increases. So that is unfortunately become the industry norm and hence it takes a lot of time to write a decent parallel program that can give you good performance. This is where Julia fits. We're going to try and make that simple. So Julia has a multi threading infrastructure which is currently on the branch chain threading on GitHub and we merged with the .5 master version soon as soon as all the conflicts are done. There is support for logs and atomics. To compile this branch you need to set a flag, Julia and equal threading in one of the header files and you can start using threading. So what exactly is the threading model? Now it revolves around this one macro but half threads. So essentially what we are implementing here is loop parallelism. So let's say you have an iteration space. You can split this iteration space between your threads and everything will run in parallel just as you can see the example code over there. You can do this in a number of ways. For example you can split the iteration space of k equals 1 to n or you can put a bunch of code within a block, million and n or you can get all your threads to follow function 4. So what exactly is happening here? Now at first there is a broadcast operation and then finally there is a barrier. Now this after its macro just expands to a C call. C call is the way you interface with C functions in Julia. This is how we call a C function from within Julia. So what essentially happens is that all the threads get F arguments and all in the reference and all the threads are simultaneously invoking that function F. Now as soon as the thread is done it waits on a barrier so that all the other threads can catch up and then your code continues after that point. Now there are a number of things to consider while doing market trading in Julia. Firstly there should be enough parallelism in the problem. So what if there isn't enough parallelism? So firstly think about this. Your chunk of data which you are operating on in parallel should be big enough. If it was small enough and you kept throwing threads at it you would run into something called oversubstruction and the broadcast and barrier latency associated with calling your threads will overpower any speed up you get. So essentially these are the things you have to think about. There is also another thing you should think about called thread safety. The thread safety essentially means that you are manipulating your shared data structures the data structures which all your threads are operating on in a safe manner. So multiple threads shouldn't be writing at the same time that particular bit must be serialized and so on and so forth. In the case of app threads to give you a very specific example your second iteration shouldn't be dependent on first. If it were those two operations are inherently serial you shouldn't be using threads at all. So we decide to test this out on a few workloads and does the journey work? So this is a summary graph of the entire the workloads that we put in. So Julia threading is the purple barrier the one of the fourth one. So as you can see it's pretty effective compared to all the others. Julia threading is doing a fantastic job and Julia threading is doing a fantastic job but you can see Julia threading is even fine. Julia threading is doing a fantastic job but Julia threading takes that and that stuff. So the applications here one is a 3D Laplace circle sensor another one is a Monte Carlo simulation a software relation problem another is a Lackus Boltzwheel model a fluid dynamics model. So these simulations have been done on Haswell server of 18 cores per socket and NVIDIA K80 GP TEPR architecture. So let us look into these. Julia threading is the last of one column and Julia single threading is the last column. So as you can see this particular diagram Julia threading is pretty effective and on every workload I'm going to show you something called a scaling graph which essentially plots the performance versus the number of threads you use. So over here you see it scales to about 5 or 6 threads and this happens due to reasons I told you earlier even though there are 18 cores and you'd honestly love to see a NVIDIA scaling graph you'd love optimating cores but unfortunately due to these latencies it sort of plateaus about 5 or 6 threads. In this case though we have a pretty good we have a pretty good scaling graph good scaling up to 18 cores. So all these workloads have been benchmarked against MATLAB and MATLAB accelerated GPU so this particular column stands for MATLAB accelerated GPU this number is about 2.96 seconds and Julia is 3.0 Julia with threading is 3.01 So just close to showing you that with your existing infrastructure you can do a whole lot with multiple cores on your machine. So yeah this is the final this is the final workload. As you can see again Julia threading is pretty effective as compared to the others. So apologies for the we couldn't get this particular directory of MATLAB but yeah Julia threading in this case as well scales to about 5 or 6 threads and then plateaus out plateaus out So other things workloads which we have done whose results are preliminary we actually showed this to sort of demonstrate the effectiveness of Julia's new garbage compiler it is of course Julia Maas has got to be on this new compile now so earlier we had a few issues and then with the new compiler gave us a speed up about 5 threads and this is a 2D wave equation this is actually a showcase workload from MATLAB's website for GPU acceleration that is why GPU acceleration over here is very effective so this is in part the fact that it's very so the thing is very FFT intensive and QFFT which is QDA's 9B550 is highly optimized we are calling FFTW which they call on CPU as well Julia threading still has a bit ways to catch up with that so let me see if I can show you a little demo at this point I can get that is the front visible so I have implemented something called a new classical growth model and that is essentially a model in macroeconomics which is being solved in an iterative manner so just to show you how simple threading is so I am going to start Julia and just run this model for you just do this one hand and the other hand so it's a good path so let's try time which is just called me so as you can see a number of explanations that it's giving over here I think it's about 250 something explanations and it's about 18 seconds let's see what happens for me like this exactly where all you need to do is to add a couple of words and the knowledge which will be part of the machine operations piece let's compile this again so you can see it's basically fast forward oh and it's a good 2x speeder so there you have it folks threading is really simple how many threadings did you find this particular one has 4x you can just check that it was the entrance of mine it's 2x speeder so this could scale further this is a 16th one sheet so I haven't tried increasing the threads and see how it scales what's wrong with your threading what is it doing maybe take that question as in yes it does depend on the number of threads you have and it's important to sort of the area of threads fill the number of threads if you are going to increase beyond that you probably wouldn't be able to we probably wouldn't be able to remind you that it's 4x so that one is just for x86 so this is in collaboration with that so but you are just getting to where anyone are so what's the definition no smart working on this right so essentially you are doing a comparison with matlab that is accelerated with GP in some cases I showed you a number for matlab and yeah junior could be accelerated now just on the board you are getting close by the matlab actually I'm sorry I didn't get one there is a question basically in matlab on GPU so junior would be telling it that could have better than that yes how do you how do you how do you do it I could answer that thank you