 So thank you very much. I'm pleased to be here today. Last time where we called was 10 years ago. And I'd like to talk about the work mostly by my recently graduated students one last month and the other one year ago. And I'm going to take advantage of the work beyond in the meaning of this conference. So what I'm going to do is talk about mainly high-performance computing. The motivation, of course, remains turbulence and we are interested in, first of knowledge, computer time provided by National Science Foundation in the U.S. through the machine called Blue Waters. And also we have had, in the current current year, we have time awarded by the Department of Energy. So I think the work scaling is something that we all in turbulence knew when we were in students. At first it was very weird to me. What does that mean? I'm going to talk about three particular meanings of this today. One is think about, if we look for scale similarity, last year the simulation say, we always ask how do things scale with the Reynolds number. And which is in scaling, they talk about, you know, show range. And if you are talking about boundary layers, they talk about inner scaling, outer scaling, and so forth. And the other kind of meaning I'm talking about is in the sense of computing. What we mean, swung scaling is same. If I give you a computer twice as many cores, can you solve the problem in half of the time? So in other words, how much time do we need as we increase CPU count by certain factor? And the idea is that even now we have a big computer, we can solve the same problem faster. Of course, that's not the only thing that we want to do. After all, don't we want to simulate higher Reynolds number, higher speed number, and higher Reynolds number in general? So we want to talk about also we scaling. So we scaling is the idea that if we increase the problem size, how much resource we need? Basically, we want to use a big computer to do a bigger calculation. After all, we don't want to stay at the same resolution level all the time we want to go up. And that's the idea of we scaling. And in doing this, we find that to give better answers to question one, it seems that we need to figure out what to do with the other two types of scaling from a computing point of view. I'm going to talk about, briefly, two different algorithms. One is to track hundreds of millions of fluid particles. And that's the grand general point. As we might know, this is very important for the study of dispersion. And basically, we have reasons why we need so many particles. Generally, that increases the problem number. And also, recently, we talked about we have a paper on backward relative dispersion. Basically, instead of starting from a particle at a certain location, find out where it's going to go. We say, this particle and that particle, where did you come from? And it turns out that we need a very large number of particles for that kind of work. And also, we're going to talk about passive scooters, of course, rather than the theme of turbulent mixing. And I'm going to focus on the regime of low diffusivity, high speed number. They're very common in the quick and most important application, perhaps, is in the ocean. And the difficulty is the fluctuation of ice as a so-called bachelor scale, which is much smaller than Cumbagore. So frequently, we are forced into a compromise of say, do we keep the Reynolds number high and aim for relatively very high speed number? And say, live with the modest speed number, or we'll say that let's just hold the Reynolds number there and just focus on increasing the speed number. So first, we'll talk about particle tracking. The equation of motion that we know is very simple. The weight of change of position is the velocity. And the velocity is obtained by taking the Eulerian velocity field at the measure at the instantaneous particle position. To say for an experiment, I think people still use this equation, but they would do it in a different way. So in experiments, typically, there will be some mechanism to sense the positions of a certain number of choices. And we have confidence in whether those choices are between two inches of time with differential and gather velocity. In DNF, however, we first actually use equation two. We first obtain an Eulerian, do an Eulerian DNF, and then we can specify initial positions in great forwarding time. And every time instant, we first get the, we basically evaluate this equal sign by interpolation. And we have to have a high order of accuracy as well as differentiability, especially we might want to differentiate the interpolation velocity as a function of time. And these are the basics of roughly particles. Of course, similar ideas arise for the inertial particles and Brownian particles, but I could have motion where on the right-hand side, there would be some additional terms for equation one. And the particle count is mostly driven by sampling consideration. I will get to that later on. So let's take a basic look at what is the challenge. We want to simulate other traffic turbulence. Most of the time, we prefer pseudo-special methods for the sake that it's a scheme proven to be of high accuracy. But let's say that we have anchored quick points now, and we have to divide the solution domain, you have to divide the solution domain amount of different processes. So under stress on the left-hand side, we can take FFT and X. So FFT is a global operation that requires a complete line of data in core. What about you want to take FFT in another direction? So we have to divide the data as suggested here. So we have to do a transport. The problem of the transport or the challenge of the transport is that it requires all-to-all communication. Basically, every part of the process is going to send information to every other one, and at the same time receive information from all the others. There are some ways in which we can make this work more efficiently. Secondly, we can choose one of the domains that we have a so-called two-dimensional processor grid to hear out and see what we send, what we send both in columns. And if we choose a number of rows to be the same number of nodes, by the way, what is a node? Essentially, most of the high-performance computing platforms now consist of multi-core processors. For the machine that we call Blue Waters, actually the number of cores is 32 per node. So because the cores on the node are essentially the same piece of heart where they view well, so if some communication has to take place, it's just like talking to the next person sitting on the same row as you are, and they will be quicker and easier to talk to the other people seated on the other side of the room. And on Blue Waters, we also have remote memory addressing that we think we'll talk about just a little bit. So this is a Qubish-Brain interpolation. We prefer to use Qubish-Brain because the four-volts are quick and also it generates a smooth approximation. There are three operations as suggested by this formula. So first, we have to calculate these S-PQR here, what we're saying is the three-dimensional spin coefficient. It's written like a basis function representation. And we can solve a system of simultaneous equations to obtain the Qubish-Brain coefficient. And next, we know that, of course, at this time, going back to time t equal to zero, we know where the particles are. We know what x, y, c coordinates are. And so we can evaluate the one-dimensional basis functions, these b, c, and d based on the particle position. Well, those primes are essentially normalized positions, let's say, how many Qubish-Brain things and so forth. And then we have to perform this triple summation over 64 contributions. Now there are some problems, some challenges with that because depending on where the particle is, some of these terms going into the summation are going to be multiple processors. So we have to somehow combine the result over all the parallel processors and also in order to obtain the differentiability properties of Qubish-Brain for n-grid points in one direction, we're going to have n plus 3 spin coefficients. How do we divide the log efficiently over the multiple processors? And also when we talk about very large number of particles and of course the memory requirements dictate that we might have to divide those NP particles over a certain number of parallel processors. So how do we do that is an important question. One can say that if one parallel processor is going to keep track of the same particles, same set of particles every time, every time incident. But let's say that if we have a, if I say that this is something like a subdomain, if at the beginning a particle is inside this subdomain then one parallel process may hold all the information. But over time, if particle from inside is going to wander up and down, it's going to go everywhere. It's going to go everywhere. So this original host process will not be able to keep track of it anymore. And as a result of that, if we don't do it carefully, the communication will essentially overwhelm the entire calculation. So what we have to do is that you should divide up the data based on the instantaneous position. So we've got to watch every time incident. We figure out where the particle is. We figure out in which subdomain is it located. And we will turn over the control for that particle to that particular parallel process. So we are essentially mapping the particle to the MPR process based on the instantaneous position. Now what about with the particle like close to the boundary? Of course, some information will become from another processor but then there will be the one that is immediately the next neighbor. So the communication between this subdomain and this subdomain is relatively faster between this one and one which could be very far off. And we also want to be able to declare global memory. So in high-performance computing, we talk about one-sided communication. That is to say that if I send somebody an email, I don't have to suppose that I don't have to wait for that person to read that email. But by some prior agreement, that person has already given me access to a certain part of his or her brain. So once I send a message over to them, it's automatically there and that will be a lot faster. And unlike a lot of different methods, we don't use ghost layers. Ghost layers is the idea that what about if I give this process, let it have some information from the neighbors. But then that way we show you the paper that hopefully will be published soon before the end of the year that the use of ghost layers will be extremely memory-intensive that we don't really want to do. And I said the particle can go to another host process. It depends on how far it can go. But we have a cooler number, considered a limit on the time. On the time set, that means that every part of each particle can never go beyond one grid spacing. And this is some pseudo code. I'm going to skip quite a bit of that, but basically we say that for every particle, every particle there's a number of particles. So we make some decisions about, but first we can go ahead to evaluate the basis functions. And then we look over two indices, not three because our solution domains are like this. In one of those, we have everything. We don't have to worry about a particle going to the left and left or right in the way that I posted. Only front and back and top and bottom. And so we have this particular syntax which is referred to as CoAway 421. It's a square bracket. And basically you're just saying that it's a particular way that every particle process would have access to. And that would make the overall calculation somewhat faster. Well, what is the performance that we got? This is a plot of the time taken, measuring in seconds, which is the number of parallel processes, or the number of parallel processes. And these are three clusters of data points 2048, 4096, 8192 cube. And in every case, we have a number of particles, 16 million, 64, 256. And you can see that the course is proportional to the particle count. Well, of course, this is not the time taken for the DNS calculation itself. This is just the time taken to go over the loop on the previous slide. And we can see that if the scaling is perfect, the swan scaling is perfect. And on a lot less plot, it will be inversely proportional, a slope of minus one. One could say maybe a little surprising. It was surprising at the first time we... Well, first time I saw this result. Actually, the swan scaling is best, 8292 cube. Most of the time, we know that the scaling will be less or worse and worse at the number of processors increased. But in this case, that's not quite true. In this case, the code happens to be more efficient at the large problem size than at the more modest ones. So, and that requires some analysis. And we can think about it in terms of the communication time. The most general case is this. Now, each subdomain is going to consist of a number of grid cells, and in one direction, and divided by something in the other direction, as well as the third direction. Communication is going to be... Well, communication will have to consider the cross-sectional area. What is the cross-sectional area of the boundary of that subdomain that separates it from these close neighbors? So, and we have four edges of this rectangle. That's why we have terms like this here. And cubic splines involve two grid points on each side. This is because of the cubic polynomial. And so the number of the probability that some communication will be required can be obtained by this formula. There's factor four here. This is the volume of this subdomain divided by the area. And this is the number of particles per processor. The P is the number of processors. But sometimes the formula here may not always work because what about if N over Pc is one? What about if instead of a rectangle here, I'm only getting one line, then every processor will require information about it below. So in that case, we actually find that the communication time is proportional to the number of particles and universal proportional to the number of processors. That actually means that we're going to get perfect scaling. So in this part, that's part of the experience of the results that we have. And also this track of QAV420 is especially efficient when the messages are small. This is somewhat different from more common considerations. And what about the particle migrations? Well, when the particle migrations, one subdomain to the other, the control of the responsibility for updating is the new position transfer to the new host MPI process. And there will be some send and receive operations going on with how many neighbors? So there will be eight of them if this is a cross-section of the subdomain. You're looking at essentially one on each side because that's why we have eight neighbors. And also there may be a slight degree of imbalance. Well, the idea is that anytime some processes may have more work to do than the others. But we're helped by the fact that we're assimilating homogeneous turbulence and also non-unitional particles. So the fluid particles are not supposed to accumulate systematically over a period of time and so the degree of imbalance is really small. And there are some factors that would have an influence on the particle migration, the core number, limits on how far a particle can go. And there will be more migration in the domain very thin in one direction. And also, well, we would have the knowledge that this may not be so efficient for inertial particles or Brownian particles. We can jump over more than one grid space. There are more simulation costs here. There's a 4096 cube and an 8192 cube. The number of particles are going by a factor of four. So if we look at the last, let's say, we have the last problem size here. Essentially, the particle loop is becoming less usable in the course. And its capability does improve the last problem size. We will assume the number of particles much more than the number of grid points though. And we are using one side of the communication, the four-transcendent now. Now, let me turn to the second part of my talk now. So we're talking about mixing. In terms of passive skaters, we are all familiar with this equation. But on the one hand side, I have taken the mean gradient term, incentive mean gradient to be the source of the skater fluctuations. And I think when I myself arrived here only Wednesday, Wednesday night, I know that my friend, Professor Goto, has given the talk on some of the pieces of high mean number mixing. And also Professor Srinivasan, it is one hour natural on Wednesday. Also mentioned some of these concepts. So I'm going to skip off to the numerical issues. Again, I'm showing also a discussion by Professor Goto. And basically the idea is to demonstrate that the form of the spectrum has very different power laws depending on which mean number of regimes we are in. And they have been there. DNS results can be compared with all of these regimes. What is the most important, what is the most difficult thing to do? I think this is still to try to get at the risk of convective range because you will have a very high mean number and a very high mean number requires that we make the grid spacing much more compared to what we do for the velocity field. And I quote here with some certainly bias for papers that I have written myself. What have we done of a high mean number? Just to emphasize the difficulty. Again, every time we are willing to have the number of grid points, half the grid spacing, double the number of grid points, the core number restriction means that we have to double the number of times seven as well. So every time we do that, we pay 16 times more increase in computational resources. And how much increase in speed number we get? Well, square root of 16 with 4. So how do we go from 1 to, let's say, 512? That's a long way to go. So what we have to do is to try to get the best parking that we can. I would insist that we want to have a velocity field. Let's say that we insist on a velocity field that has some inertial increase of 36. So my earlier paper, I'll name the age. Definitely not going to do it. Now I'm 38, not quite either. And we would like to have a mean number as high as possible comparable to salinity in the ocean, which is 700. And we want to develop the best algorithm for doing that. And of course get access to the bed machine. And along the way, actually, there is going to be some advance in high-performance computing that we have been able to make. So how do we do this? We have a problem of very diverse resolution requirements. So we can actually have a velocity field on the course grid and the scalar on the final grid. That's not quite a new idea. And then we have a passive scalar, so the velocity effect of the scalar. So we have to get the velocity over from the course grid to the final grid. We will do an interpolation. And we let you have as high accuracy as possible without very heavy communication. And pre-dimensional FFT can only give us high accuracy, but it won't answer the second requirement. And so Professor Goldberg, he came up with the idea of a hybrid algorithm. Let's use complementary differences to obtain the derivatives of the scalar that's in the affection diffusion equation. And it turns out that the communication requirements are actually much less. And we want to take it to a big computer and make it a scalar way while we do so-called multi-threading. And we also have made some progress in using graphical processor units. So first we have what we call a dual communicator algorithm. So these are actually the same grid, physically they are in the same position. But what we're going to do is set the number grid point for the scalar for the velocity to be 8. And so the workload on the final grid obviously is much heavier. And we have two separate groups of processors. One will be field in number. We will take care of the velocity field. But of course, let's say that you can set the number of lines here so that we don't have very fine divisions of the grid. And for the scalar field we have a final grid. We have another set of processors. Most of the processors will be working on the scalar. And this is the passive scalar. So the velocity information has to go from one communicator to another, one group of processors to another. And this is what we call essentially one way to communicate a transfer. And one of the ideas that comes up from this is some of the processors are carried out by this group of MPI processors and the other group they can well be overlapping with each other. They can essentially be done at the same time. And that's the beginning of the idea of trying to make the calculation faster. And here's an informal flow chart. The left-hand side all the operations of the velocity field on the right-hand side all the operations on the scalar field. The pressure and diffusion equation requires a first and second derivative. So this is where we use the CDD, the combined component difference combines the sense that we're getting the first and second derivatives at the same time. And there's some transfer operation. The velocity field has to go there and we have to determine the time step size as well. And then what kinds of operations are actually required? The CCD proteins require the solution of a block, a system of a block tridiagonal equation blocking the sense of two-by-school because we're getting two derivatives at the same time. And it turns out that by an algorithm developed by another scientist in Japan to calculate the solution of this equation can be done without any transpose. We don't have to, like, the two-dimensional decomposition for the velocity field. We have to do transport this way, that way, and that way, and so forth. We don't have to do that anymore. That is going to save a substantial amount of time. We do have some communication, the so-called gauge layers, which come in for finite different schemes. We do have to pack the messages. We have to collect data from different parts of the computer memory, and then send it off, and then we receive data, some data, and then we have to unpack it into the appropriate positions. And on CPU-only machines, what can we do to make the calculation faster? First, we could say, if communication cannot be avoided, can it be done in a non-blocking manner? Let's say that many of us are capable of, say, hearing somebody talk while typing on the computer, while doing something else. So this is what we call non-blocking communication, as long as we are not operating on the information that is still in the process of coming in. We can also carry out a computation in one direction while communicating for another, because the tips in different directions are essentially the couple from each other. We use the feature OpenMP Rocks, or it's OpenMP in case you're not familiar with this. This is a very common software library for essentially shared memory computing. So most of us, essentially, if we have one node, 32 cores, how can we make the 32 cores essentially function as one single unit? And we can also dedicate some of the one thread to communication. All of the others will be computing, and that's the idea of the so-called OpenMP, that's the parallelism. We have a paper, we have the details now. What about the performance of the code? Here we have data points, 1,024, 2,048, 4,8192, we have also tested 16k kub. We can see that the scalability is perfect, the smaller problem size is the larger problem size, we start to have more problems. We started to have some of these data points now falling onto the dash lines that we have drawn. Nevertheless, our production problem size is here. This is just for the CCD routines. So we set up with this 0.936 seconds, we move it down to 0.616 seconds, about 40% improvement or so, and we are strong in scaling, we are scaling pretty good, if we look at just the data points marked by the sign-in symbols. And I think the percentage, and then I quote, should it be relatively high, what about the DNS code itself? Then we have also the symbols in sign-in, also they are very good, that's the best that we could do. But well, interpolation, the idea of interpolating the velocity field from the course which is the final grid does take considerable time, 25%, but just in the last 24 hours, one of my co-authors, Matthew Clay, he said that he has found a way of making that path faster, don't have the new points yet, but then things are going to be getting better. Okay, so what is the future going to look like? I think I have... Okay, I'm not having a lot of information about GPUs, so this is the second last slide, as you can see. So in the U.S., for the last few years, the fastest machine has available to the academic community, at least it's a 27-part flop, quite XK... Well... Yes, that's the so-called XK7. The XK7 has the nose which is all fitted with GPUs. So one unspoken rule for getting allocation of this machine is that you have got to be sure that you can use GPUs. Otherwise, no matter how excellent the science will be very difficult. And we have figured that out and by next year, they will have a machine five times faster than this one, it will be an IBM Power 9 with six GPUs per CPU, versus just one now. And hardware is going up to extra scale with the Department of Energy 2020 something. This number may... Maybe two or three. And the community is moving towards many core machines. They also have a nice landing platform by Intel, 70 cores and four hardware threads per core. So sometimes it takes a substantial amount of effort to try to run our cores efficiently on these machines. And also the CPU... Essentially, the GPUs give us the promise of fast commutation with no power. The question is... Of course, the other thing is what is the software really? I should say that on the 27-part of the machine, if you want to code where that barely passes one platform, you're supposed to be doing relatively well. So there's a substantial challenge of the software and we have to use algorithms that communicate either less or communicate faster. So the last slide here... I think I hope all of you will agree that the turbulence the turbulent mixing, all kinds of turbulence not just the simple kinds of that I've been working on. Essentially, grand challenge problems in high-performance computing. We do face, however, as a matter of fact, fierce competition out of few designs from those who predict natural disasters who have worked with biomedical applications in California, earthquakes as well as in Japan. And so we have to be we have to be very competitive and we cannot do so without being actively responding to change in high-performance computing landscape otherwise, those advances will essentially be leaving us behind. And we have a big problem in how to restore the data. We can generate petabytes of data. So one day, if I lose time when I need something, I don't know where to put it. There's one question that I have no idea about honestly. But if we can show that there's a wider community who can use the data to do useful things, then maybe there will be some breakthrough somehow. As for the future science, I'll look for turbulence. I think we all want to go towards larger problems. This may be more geometrically more complex or physically more complex. Let's say what about active scale if you have a atmospheric boundary layer and so forth. And we have to make sure that I think we have to look at the resolution effects more vigorously than before. And finally, we just need to build a sustainable computational laboratory. But by that, I don't mean a computer room, but that's basically essentially a virtual laboratory, if you will. All the data sets. How do we how can we sustain them? How do we make sure that they will not be lost so that the wider community can actually benefit from the simulation. Thank you very much.