 Are you ready now? Yes. Okay, good. So I can start. We start with the next presentation by Jan Verschelde about the mathematical topic, that is not just mathematics, there is also quite some way to know and let him explain what he is about. Thank you very much. It is always a pleasure to come here. I am actually a half-way mathematician, half-way computer science, so I am in the department of mathematics, statistics and computer science, but I got my degree from Belgium, so that is actually one of the prime motivations that I am here, but I have always learned a great deal by coming. So I will talk about high-level parallel programming and how good it actually really is for shared memory purposes. My interests, my primary interest is still in solving polynomial systems, but I picked one target example that I can use as kind of a running example. We have very good experience with basic work-crew model, but we are also experimenting with different models of load balancing, and I will mention briefly here the work-stealing method. So this is probably the most mathematical slide, so the mathematics comes in the beginning. So we have a bipartite graph on one side, the unmarried man, on the other side, the unmarried woman, and there is an edge when the pair of an unmarried man and an unmarried woman, whether they like each other and they want to marry. So the problem is actually to count all the perfect matchings. So you want to connect every unmarried man with an unmarried woman, and you want to count in how many different ways you can do that. So this is the graph, interpretation. The data structure is actually a matrix, a matrix of zeros and ones. There is a zero if there is no match if the first man and the first woman over there, they don't really like each other, so then there is a zero. The first man likes the second woman, and it goes also the other way around, then there is a one. So this is called the marriage problem in computer science. It's also called the job assignment problem. So another version of this problem is that you have a set of workers on one side, and you have a set of jobs that need to be done. So there is an edge whether a worker is capable of doing one particular job. Why is this problem so interesting? Well, first of all, the algorithm to do this is really, really straightforward. In high school, you must have computed determinants while computing what I want to do is actually much easier because you don't have to do the sign pattern. So if you computed the determinant, you had to remember the sign of the permutations. Here you don't. So it actually is a simple row expansion. You have two ones on the first row, so you reduce it to two simpler problems. Then here you have two ones on the first matrix, and then you have three factors for the second matrix, and it goes on like that. So this is the permanent. So it's a very simple algorithm to quote up, but here comes a surprise. It is really, really hard to do. So already with matrices of dimension 17, you have to wait three minutes. And this problem is what computer science and theoretical computer science called sharp P-hard. So there is actually no better algorithm if you want the exact number than to go through all the permutations. So you have a matrix of relatively small size. Here I generate them at random just by flipping coins. And you have kind of a structure data structure, but the structure of your computations is actually still unpredictable because you have a lot of zeros sitting in there. So some factors will compute fast, other ones will actually require more time. Now the third thing is that it's actually also a computation that you can do in parallel very quickly. So here you see a 10 by 10 matrix, and at the right are the beginning of the permutations. So we have a two and a one at the upper right corner. So I selected from the first row the second one, actually the first one, so in the second column. And from the second row I selected the first column. So two one. So I have a 10 by 10 matrix, then I have to compute the factor as an 8 by 8. And that 8 by 8 permanent can be computed independently of the 10 other ones. So the next one is two, three. So again select the second column and then you select the third column. And then again you have an 8 by 8 permanent that can be computed independently. So here you have a 10 by 10 permanent that is reduced by 11, 8 by 8 permanent. So by a simple row expansion from partial row expansion you can actually generate as many jobs as you want. So how do you code this up? Well we have, you all have a parallel computer so they actually don't make serial computers anymore. So you all have multiple cores there. And the cores all have access to the memory. So what we do in the work crew model, we actually initialize a queue of jobs. Like in the previous example I had 11 jobs. I have a queue of 11 items sitting in there. So the items are actually the permutations, the start of the permutations. And the queue, actually the data structure has a semaphore. So when a task is idle it will actually look for the next job. Now it has to request that semaphore wait if it's occupied by another one. If it has the semaphore it takes the job, it increases the job counter and it continues computing. So there is one way of load balancing, that's a simple way. We have been looking into another strategy is that instead of you have one simple queue every task has its own queue, a double-ended queue. And that double-ended queue is used as a stack if the task has its own DQ, double-ended queue. But underutilized tasks can actually start stealing when their DQ is empty. So here is the idea. So the idea was actually a long time available. It's also very good in situations where you have backtracking searches. The permanent you can see this has also a backtracking search where in one direction you may go very far and have to compute a lot and in another direction you cannot generate that many jobs again. How do we do this in ERA? So this is kind of a cartoony slide but there is not much more to it. So simply to launch a set of workers you generate, you actually define a procedure that is generic so it takes another procedure called job as a parameter and that job defines whatever you want to do with it. So it has two arguments, the ID number of the worker and then the total number of workers. And the implementation of the multitasking that starts all these jobs is actually here. So this is a procedure that I use over and over again. Job actually is going to define its own memory. So you have to make sure that every task has its own memory that as a local variable. So in this way there are no memory conflicts. But once that is set up it all works quite well. So that's one idea. So the second idea I mentioned from semaphores I call it actually from the ADACOR gems. So to synchronize the taking of the jobs from the queue you actually have a package that has one single protected variable sitting in there that acts as a semaphore. So you have a simple array of pointers to whatever the jobs are and that array is protected by the semaphore. What you also have is a global variable here which is simply an array of factors. So in this application where the permanent computation say for a 10 by 10 matrix is split it out into several 8 by 8 matrices. So you will have as many factors as you have tasks and every task will update its own variable and at the end you actually sum the factors. And that is done when all the threads have finished. So dynamic load balancing actually works quite well this way. Here are some. So this is actually then when the fun starts. So this was done on this laptop. So it has two processes that supports hyper threading so you can actually launch four tasks. So actually I'm working with Boolean matrices. So you can also define it for integer matrices. What I'm actually here counting are all the permutations. So this is a 16 by 16 matrix. So it could go all the way up to 16 factorial but there are a lot of zeros in there and I generate them at random so I have simply a coin flip for every entry in the matrix. So sometimes I have very smallish ones, other ones I had very large ones. I can actually play with granularity. I can expand the first two rows but I can also expand the first three rows. So at the very beginning before all the tasks are started the queue of jobs is built. So I have some playing there. So if you have fewer jobs you can very quickly fill up and start all your tasks but then you may have too few tasks for full parallelization. You can also have more jobs. And actually with two cores sometimes you get the speed up that you may expect. So this is two tasks, this is four tasks. With two tasks actually not everything is fully occupied. With four tasks you actually have five tasks running. So if you run this you actually see on your performance monitor that there are five threads active. That might be a little bit too much. So then you can also play with three tasks. And you can expand the first three rows or you can expand the first two rows. So the result of this experiment is actually that with very little code and already an interesting application you can gather a lot of information about this particular application. And sometimes hit very nice speed ups. Any questions at this point? Feel free to interrupt. Now what if you have then a real workstation? So I have also a 44 core machine on my desk. And then you can actually play around with doubling the number of cores each time. So it also supports hyper threading so I went all the way up to 64. And you can also play with a number of jobs. So here this was now on the same matrix. So I should have said that in the previous slides I always generated different matrices. You kind of have this fluctuation. Here we take the same matrix, I went to dimension 17 but I didn't want to wait too much longer. And you can see that you can have relatively few tasks. So then actually it goes well in the beginning but not too well at the end. Actually you better generate more jobs. So with 44 cores I get to a speed up of about 30 or with quite basic code. So what we are now investigating is the application of work stealing. So this is still work in progress and more an implementation plan as something that is actual working. So I will now mainly say what we are thinking about. So instead of having one simple queue we will now have an array of double ended queues. We will need two semaphores for every double ended queue because it could so happen that the queue actually collapses to one single element. And you don't want that the task that owns the queue has to fight with tasks that try to steal the jobs. So that is one thing that can be implemented. The other thing is then the work stealing algorithm itself. So we expect that we actually lose the speed up because of the initial time. So every program actually has a serial component that cannot be parallelized. And that is going to be the main limit for your speed up. Now with the work stealing actually you don't have any start up anymore because every task will have its own queue of jobs. So actually you can launch the tasks immediately and they will have to build their own queues on a very specific schedule. So you can actually linearize the permutations. You can actually count them 1, 2, 3 and actually by that count distribute among the tasks. So every task will actually go through all the permutations but they will only solve those factors which where the remainder module of P, P the number of tasks equals their task identification number. And this way actually they all have their own specific recipe for knowing which factors they have to do. Now in the work stealing then they are going to steal from the next task. So they are not going to steal all from the first task. So they are going to go stealing to the next in the order of their identification numbers. My last two slides are more advertisements. So I have been working on my software for quite a while. The code that I show today is available as you can look it up at GitHub. So I am still trying to maintain it. So multitasking is really for shared memories really really a good benefit for exploring parallel algorithms. In case you wondered why you might need permanence with polynomial system solving there are actually applications out there. If you perhaps you have seen the movie about the mathematician Nash. So the totally mixed Nash equilibria are actually their number actually is a permanent. So in game theory these things actually come up. I once told a course and there was an undergraduate in economics and actually they teach already Nash equilibria at the undergraduate level so it's quite common. We've worked with work stealing already in a polyhedral context so when I typed up the abstract I was thinking about polyhedral cones more but I did not figure that actually permanence are much much nicer for a 20 minute talk. So I thank you for your attention. Is there any questions let me know. So the question is did I see bice side effect. I'm not sure if I understand so oh yes so the time is here.