 My next speaker is Jan Verschel, I don't know if you know how to pronounce it in the English way. It's okay. He's coming from the USA, University of Chicago, Illinois. Thank you very much. I would like to thank the organizers for having me here again. So this is a talk on a mathematical application of ADA, of in particular multi-tasking. So I'm the only one in this session with a subtitle and I'm glad that this didn't scare you off. So I'm glad you came. So here's my outline. So the main point of my talk is that ADA works extremely well to define shared memory parallel programs. And the whole point is that you really get to do what you want to do. So there is some mathematics on there. I will try to use the mathematics mainly to indicate how you can tune your application. So computers are getting faster and faster, but that's actually creating a problem for us. Not that we have no things that don't go fast enough, but making computers work efficiently is actually quite the challenge these days. So here is the motivation. So this is the picture. The picture is kind of misleading. What we actually are always looking at are polynomial equations. So we have two polynomials in three unknowns. They define a space curve. So the space curve is this figure eight. So it's kind of bend it. And you are positioned at the top, the point zero zero two. So the point zero zero two satisfies these two equations. You know one point on a curve and you want to continue. You want to see what lies ahead. So you're going to compute power series expansions. The red one is the trivial one. So you think if you only do one term in your exponentiation, you think it's kind of a parabola. But if you take more and more terms in that power series expansion already actually the next one is here already very interesting. You see the crossing point. So this is where interesting things start to happen. And that crossing points as you take more and more terms in your power series actually starts to approximate the real crossing point. Now this is misleading picture. We only see the equations and actually the crossing points is where typically your power series will no longer converge somehow. But here actually they allow you to say something about the behavior of that curve. Now the point of the talk is to compute this efficiently and we run Newton's method all the time. So Newton's method is probably the Newton's method that you have seen in high school. Also with power series curves symbolically as you have to do this by hand, you can actually compute term by term. What you need to do is you need to evaluate polynomials over and over again. And you also need to know all the partial derivatives. So there is then solving a linear system. So these are the three things that you have to do. You have to evaluate, differentiate and solving a linear system. And it converges quadratically. So actually Newton's method is a very promising method. But there are a lot of computational difficulties. So if you do this by hand, you're quickly going to give up once you get past even already for one variable. It might be hard. But our curves, they live in any dimension. So the degree of the curves. So I should have pointed this out previously. So this is a quartic, a curve of degree four. So you have two equations of degree two. If you have three equations and four variables, if there are quadratics, the degree is going to be eight. If you have four variables, five equations, the degree is going to be 16. So you can see that if it starts to get very quickly, every time you increase one of the dimension, your complexity of your problem actually doubles. So with 10 variables, 11 variables, 10 equations, you may have a curve of degree 1000. Now, even if you would fix a, I'm not going to go past 10, then actually you have your power series. And actually they go on forever. So there's no limit there. We have some power series that converge rapidly. We're lucky. But some power series converge very, very slowly. So we may not know how far we need to go. And that was actually the motivation for this study as well. We want to have something that's running fast and allows us to compute a large, large degrees. And so even if you would say I only compute eight terms, well, then there is the multi-precision arithmetic that you may need. So high degrees, things start to curve. The roundoff starts to creep in. Double precision is unlikely to be sufficient. So these are the three motivations that we have for using parallel computations. And the code was developed on three different platforms. So the laptop that I have here with me is the middle. One's kind of the middle between these other two configurations. So first of all, the obvious point is that you can't have single core processors anymore. All your processors are multi-core. And the other point I would like to make is that while it looks very interesting and appealing to have a workstation like that sitting on your desk, but having it fully, so this is also the workstation that is serving our web server. So perhaps I should have said that right in the beginning. We also actually launch a web server. So this machine is actually hosting the web interface to our software. So it has 44 cores, and I get the best results with 88 threads. But I will restrict to the middle configuration. And here is the ADA code. The ADA code that I have been using already now for a very long time. And that also actually have other people have used. So I saw a recent paper that was published in composite structures of lamination design. So if you want to know where polynomial systems occur in practical applications, if you have to design a robot arm, for example, a robot arm can take possible configurations. So you can already see this with the elbow that I have. You want to reach a certain position and then you want to compute all possible angles of your arm so that you can reach these positions. Now you don't want to twist your elbow. So all the polynomial equations, they express that the lengths of your mechanism, they have to stay fixed. So the methods that I'm using often very well known into mechanical design. And my users, they simply download typically the binary version. And if you have minus t, it uses the multi-threading. And it actually runs this very simple procedure. It launches threats and every task has a unique ID identification number. And that's that genetic procedure. So this is a genetic procedure, the body of genetic procedure. And the argument, the genetic parameter is the procedure job. So job is what I always provide. And I can tell each task specifically what to do based on its identification number. So tasks actually work on job cues. Okay, so if you want to write multi-task encoder, there are two main issues that you have to consider. The first one is memory. So every task has its own stack, but they all share the same heap. So if you start allocating and deallocating, you better do that outside the task routines. So everything that you do on these power series, you have to allocate the auxiliary vectors outside. And pass those pointers to the auxiliary data structures in the arguments of the jobs. So the second thing that you have to worry about is the granularity of your computations. So there should be actually enough parallelism. So typically you have a limited number of tasks, but the tasks actually need sufficiently many jobs to do. Sometimes you need to synchronize. The easiest way to synchronize is just so synchronization means that there is a piece and all the tasks have to wait to a certain point to continue. The easiest way is actually to stop the tasks and then to relaunch them. So that's the way we do the synchronization right now. So these are the two main issues. I actually have three implementations already of my power series library. So you have, first of all, the functional correctness, but then you have to think of it in a whole different way when you think of the shared memory parallel computation. And then of course, there are always other issues. I will point out some other issues. Okay, so now what do we do? We evaluate and differentiate. This is something that I also did only learn when I was doing the parallelism. So you may have seen rules to differentiate or compute all partial derivatives. The cool thing is that if you have a product of n variables, if you do this symbolically, it will be an n square operation. But you can actually do this in a linear time, linear in the number of variables. You can see if you take this simple product of variables here, if you have the stars, so you can compute. So the left here are actually the names of the variables, so the kind of funny names. But at the right you see all the star operations. The point is also the star is not the star of numbers. So we multiply power series with each other. And the coefficients again are typically multi-precision numbers, double, double, quad doubles. So there's a lot of arithmetic overhead here. So you may want to save on the number of multiplications. The bottom line here is that this is kind of the saving step in what makes everything run very well in our parallel computations. There is a straightforward parallelism. So a polynomial, so the name says it itself, there are many, many, many monomials. And you can compute all the monomials independently. So you have a very straightforward parallelism here. So that's the first mathematical idea. The second one is that you can work with a matrix of power series and invert that. That's all very fun. But actually what you should do is you should look at a power series where the coefficients are matrices. And I worked here the simplest example. We are going to solve a linear system. So the matrices are our partial differentials that we have computed. So they are power series again. Now if you linearize it, we arrive at a block triangular system. So even if you, for example, you can invert power series, polynomials, they have no inverse power series. They do. That's kind of the cool thing. And this slide also actually shows this. So you have to compare, if you solve the inverse, so you actually are after the updates in your Newton's method. So you have to invert the matrix A0. And once you invert the matrix A0, you adjust the right-hand side. And here's where the pipelining comes in. So you do the first update, the delta x0, and then you update the right-hand side, the B1 and the B2. And they can happen independently. And this is where your parallelism comes in. So here you can have two tasks working independently in the second step. So here you have a maximal speed up of two, no matter how large your matrices are, if you do this with this very coarse granularity. Now, of course, if the matrices start to get larger, you better use a multitask QR. So we have also played with that. So this is kind of the bottleneck at this point. So at this point, this is still a work in progress. So these are timings that I did yesterday in my hotel room on this laptop. Trying to see what happened, how far I can go with the degrees of the truncated power series. So this was done in double arithmetic. It was done on a ten-dimensional benchmark system. If you're really curious, you can figure out via my websites what the polynomials really look like. So with degree 16, I doubled the number of tasks in each step. So at best I can get close to four. So the polynomial system is actually very, very mild as far as nonlinearities go. You can read these columns from top to bottom, but I also like to read them diagonally. If you go from degree 32 to 64, you kind of double the size. Now, this is not a linear operation, these multiplications of power series. So you almost get a tenfold overhead if you want to do that. Now, if you then use your 16 tasks, this is where you have kind of the speed spot where with 16 tasks, you still get a little bit more of speed up. Now, the time actually then doubles. So from the 2.3, you go to the 5.2 seconds. Yeah, and at that point, so this is ongoing work. For this laptop, the fan starts to get blowing at 64 degrees, and I don't want to exhaust that pure laptop. But on a bigger server, it's actually much more challenging because then also the precision, the double precision starts to deteriorate a little already. So we're still working on getting the precision fixed. But as far as the parallelism goes, it's fairly simple to implement. And you can focus on the mathematical difficulties. And also, whenever you have an application, you can focus on really what matters. And we have five minutes left for questions. Thank you very much. The question is how do we actually get the best speed up at 88? So the processors indeed support hyper-threading. But then I don't get 88. I think the best I get is close to 60 somehow. And this is really for the polynomial evaluation, where every monomial can be devaluated independently. Second question. That's a very good question. How much is used of the vector instructions? That I don't know. Well, I know when I do my linear algebra column-wise, like the Fortran does, then I get better performance. But the performance actually then deteriorates when I use complex numbers. So if I do floating-wise column-column, then the compiler can actually do the vectorization correctly. But I still have to figure out how actually the compiler will figure it out for complex numbers. And then there are also complex double-doubles, complex quad-doubles. So that's then another challenge, yes. Thank you. Thank you. We have a few minutes before you leave. I remind you that all of this room has been organized by this client. So unfortunately he's not here. For those who were not here at the beginning, he broke his right to cover his duty. But he did a terrific job in organizing things and getting the speakers together and getting the room again. And I got the message that he's watching us on the live stream. So please, a big applause for that.