 We have been looking at ways by which, different ways by which we can accelerate the convergence I would say of the program. I use convergence in a very generic sense. Again I repeat when I say accelerate, I actually mean that I want my answer back as quickly as possible. So you have to really take the whole picture into account. I want to, we have looked at, I use Laplace equation as an example for multigrid method. Of course Laplace equation is linear equation and as a consequence the equation that we got for the correction was identical to the original equation, okay. So what we are going to do is, I will show you how to apply it, a possible way to apply it to 1D Euler equation. In fact it is a variation of an algorithm that was actually developed here on campus. Just a small, it is basically an application, direct application of the so called correction scheme, what we have been looking so far to the Euler equation. We will do that but first yesterday after the class I got a question so I thought maybe I will just mention it because it is of importance, okay. So we are transferring the residue from the finest grid to a grid, a coarse grid and the coarse grid size is half, right. The number of grid points is half or the actual grid size, the grid width is double the fine grid, right, double. So we were going through a process, okay. So if you think about it we are going through a process, let me just start off, this is h, 2h, 4h, 8h, right. So we have a process by which we transfer the residue, okay. Remember so going down it is always the residue, I want you to remember this because when we do the Euler equations there can be some confusion, residue. So you are going to transfer the residue from the fine bit to the coarse grid, okay. So what is going to happen is that is we are not going to make corrections, essentially we are not going to make the corrections on the, the corrections will always be made on a finer grid, that is the idea. The residue is always passed to a coarse grid. So the obvious question is why go from h to 2h, why not go from h to 4h, right. So we talked about having a V cycle or a W cycle or something of that sort. So yes, you could have, right, you could have, so you could have different kinds of, you could have different kinds of cycles that you go through. And earlier I had suggested that well we want to stay down here at the lower end because the amount of computation number of work units, the term we used was work units, right, number of work units being used was less. Am I making sense, okay, the number of work units being used is less. And the obvious question that comes up is why go from h to 2h when you can go from h to 4h or h to 8h directly, right, why bother with intermediate steps. And there is a logic to it. The reasoning being that if you go from h to 8h, so you remember the underlying principle that we are using, if you go from h to 8h, right, the grid size is much, much, much larger. The number of grids is much less. So the highest frequency that can be represented here, highest frequency that can be represented here, am I making sense, is 1 8th the frequency that can be represented there, okay. We are using that fact, we are using the fact that when I go from h to 2h, so you were basically saying that on the grid h, if this represents the spectrum, the frequency range that you can represent on the grid h, on the grid 2h, that range drops from the largest, largest wave number that you can represent becomes n by 2, okay. So on this, what I want to show you is, on this, if you have a residue, if you have a residue, let us say that is uniformly distributed initially, you want to do enough iterations here, so that you eliminate the wave numbers on the higher half, then you can transfer it to, then you can transfer it to a coarser grid where it would be n by 2, right. If you were to, if you wanted to go down, if you wanted to go down to n by 8, you would have to do a lot more work on the fine grid. We do not want to do that much work on the fine, you understand what I am saying, you would either have to do a lot more work on the fine grid or you would have to do more residual smoothing or something of that sort, okay. In order to push it down, right, so this hierarchy really works, in fact you could say well, is not there a way by which we could get an intermediate grid but the cost is very high. So it turns out that h 2h 4h 8h seems to be the best way to do it, okay. There may be other possibilities that you can think of instead of going from h 2h, maybe h to 1.5h sums up, right, I mean you can think in terms of fractions like that but basically what you do is, now why, what is the, just from a, just to continue on this note since I have drawn this figure, what is the algorithmic, what are the issues when you implement. So one way would be, yes, you do some, you do some iterations with the equation, so if you do some iterations with the equations, possibly you shrink it, you get it down to that, right, you do some residual smoothing and maybe you get it down to n by 2, transfer to the grid, that is one way to do it. If you do not do that, from the demo that we did, do you remember what happens if you take, if you take a frequency higher than what the grid can represent, do you remember what happens to that, do you remember what happens to that, it falls over, it falls over. So any error that you get close to n will actually go close to 0 and I am making sense, any error that you make close to n will actually go close to 0. If it goes close to 0 is very bad, close to 0 is decaying at the slowest rate, close to 0 is really bad, for all the grids close to 0 is difficult, am I making sense, close to 0 is quite bad. So what you want is, you want on the other hand, close to n by 2 is not bad. So now I am talking about how much effort, how much smoothing, residual smoothing do you want to do, if you go from n to 3 quarters n, where will 3 quarters n fold over to about n by 4, am I making sense. So if you go to somewhere there, this is going to fold into the higher frequencies of n by 2 which is going to go away quickly anyway, am I making sense. So you can see that actually you do not need to make that much effort on the finest grid. You need to make enough effort that you definitely get to 3 quarters n, maybe you leave yourself a little margin. Then transfer it, yes there is aliasing you are doing something bad, but it transfers whatever happens, whatever contamination that occurs, occurs there, if you are willing to live with it, it is going to decay very quickly. On the 2H grid, it is going to decay very quickly, you are going to put in that effort anyway, but that is of course a grid, you are doing less work there, for one sweep you do less work there, am I making sense, okay, right. So of course the proper way to do it would be to do sufficient iteration so that you push out, you eliminate, you totally reduce the residue that you have in the top half, so that whatever you transfer down, there is no sampling related issue, there is no aliasing. Is that fine, okay. So there may be a temptation to go from H to 4H, but that would require lot more work on the fine grid, that is the reason why we are not going to do it, okay, is that fine, right. So now how are we going to, how do we apply this to non-linear equations, well as I said I am going to look at a very particular way by which we have done it, we just recognize the fact that we do not solve non-linear equations, we invariably end up linearizing, right, we do not solve non-linear equations most of the time, we just end up linearizing unless it is some form, some quadratic or something simple that we can handle, we typically end up linearizing and most familiar linear form that you have right now is the delta form, okay. So if I look at the delta form of, recollect that we have already derived the delta form of the one-dimensional Euler equation, it is of the form I need some kind of a matrix now, so I will just choose S, I do not think I have used S anywhere, so it is of the form S delta Q is –delta T into R, in fact if you allow me I will absorb the delta T also into the R, okay, allow me I will absorb the delta T into R, so it just gives me basically S is –r, it does not matter, we can, I do not know why, why cut corners –delta T into R, it is okay, we will leave it as it is, okay. And if you think about it, if you think about it, if you look at this equation now, this equation looks like the correction equation, what did we say we had, that is basically what we had, right. So this EH is the correction to, the current estimate that you have is the correction, right, the current candidate solution, just like delta Q is the correction to the current candidate solution, if you are looking at, if you are only looking at steady state we will just still focus only looking at steady state, because we know there are ways by which we can apply the same algorithms to transients, right, so if I am just looking at purely the steady state delta Q is just like this correction, residue, residue, right. So there is no reason, there are, there are algorithms out there that will basically save the original equation nonlinear and therefore you should, there is an extra term that shows up, you can go look it up, right, I am not going to spend time on it, what we have basically done is we have recognized the fact that anyway I am going to linearize the equation, the equation looks the same, right. So what we do is we take this residue here and transfer the residue, the important thing, right, the important thing is we transfer the residue R, so all of these are presumably at H, we transfer the residue, right, okay, it is possible that because Q is remember this is delta Q and you are always advancing in time, you are always advancing in time. So other than this you may also have to transfer Q H to Q to H, here we basically, this restriction we simply use the injection operator that basically means that we just take the value at that point, you know, we do not really do, right, we do not do a transfer by averaging and so on, okay, so Q H to Q to H, you can transfer R H to R to H and solve this then on, is that fine and in between as I say before doing this transfer just like we did earlier you can do whatever residual smoothing that is required, right. So before you do the transfer, before you actually do the transfer, so what would be the algorithm you would march in time, you would march in time for a few time steps on the finest grid, you compute the residue at every, anyway for the delta form you compute the residue at every time step, do some residual smoothing, transfer that residue from R H to R to H, right, use injection, so we have the Q, take time steps now on the 2 H grid, am I making sense, so you can take a few time steps to Q H grid, if you want you find the residue, smooth that residue, transfer the residue to a coarser grid, right, then on that coarser grid again go through the same process, repeat the same process, am I making sense, okay, when you come back, when you come back, you transfer the Q back, you transfer the Q to H to Q H and delta Q, so there are various, I am giving you a very simplistic algorithm, various wrinkles that you can add to this, you transfer the correction back, so this is a critical thing that I want you to remember, there should be no confusion, you can get into a confusion because I had pointed out to you that this equation, this equation can also be looked upon as S times delta Q implicit, right, I pointed out to you that you can look upon this equation as delta S supplied to delta Q implicit is delta Q explicit, right, if you recollect, if S was the identity matrix, if S is the identity matrix and you are using for instance central differences for R that would be FTCS, right, so the right hand side in a sense, the delta Q in a sense represents, the right hand side in a sense represents the correction that you would get from an explicit scheme, right, just to recollect, we have done this before, so the delta form in general can be looked upon as operator acting on the implicit scheme giving you the increment correction from the explicit scheme, so this can, if you look at it from this perspective, it can get a bit confusing, right, so when you are doing multi grid, it is better to look at it this way and remember that it is always the residue that is transferred from the fine grid to the coarse grid and it is always the correction that is transferred from the coarse grid back to the fine grid, okay and you are always going to evaluate the residue on the finest grid always, residue ultimately when you say yes I have solved the problem, you are going to take a few time steps and you are always going to evaluate the residue on the finest grid, am I making sense, the last few time steps that you take before you declare convergence will always be on the finest grid, is that okay, okay, so that you would have evaluated that residue on the finest grid and the solution that you have therefore is on the finest grid, right, that is very important, are there any questions, fine, so this is as far as multi grid is concerned, what I am going to do now is I am going to just quickly point out other ways by which you can get your answer back fast, right, I am not talking about now directly algorithmic since we have talked about putting things on top, I will just say a few words about parallel computing, okay, I am going to just say a few words about parallel computing, parallel programming, okay, so all of us nowadays all of us do parallel programming, right, as a routine, right, I mean it is a fact parallel or concurrent programming as it is called or concurrent, concurrent as in the program is running bits and pieces concurrently, which is true, you just get on to the net, you just get on to the net, you check your email across the net or you browse the web page across the net, you are doing parallel programming, your local computer is running something, some part of it, the computer far away is running some part of it, in fact there are lots of computers in between running all sorts of things for you, all occurring simultaneously so that if you for example are watching a video or something of that sort, right, for people who may be watching this streaming, they will see it streaming, am I making sense in a sequential serial fashion with nothing jumping whereas there are actually multiple computers along the way that are running, am I making sense and in your own computer there may be a graphics processor that is running the display and so on, so lots of parallel programs that are running even now, right, so the minute you click a button on the browser and something is happening there, something is happening here, you are running a program in parallel, the only thing that you have to see is how do you bring that advantage to our CFD codes, okay, so first just you have to get a little flavor, as I said I am only going to give you a little flavor as to what are the underlying ideas, you are not going to spend a lot of time on this, parallel programming comes in, it can be, you know you can classify at different levels, so normally you talk in terms of the coarseness or the fineness of the, how parallel it is, right, so fine grained would be for instance Laplace's equation, if you say ppq at iteration n plus 1, 1 quarter of that, right, so fine grained computation would be that on the given CPU you recognize that these two can be added up while those two are added up, that is very fine grained, am I making sense, right, actually on the CPU there are lots of things that occur in parallel in the sense that it is possible that when this number is being actually loaded into the CPU, the address is being calculated even as the arithmetic here is being done, I mean in the actual CPU there are other levels, finer grains of parallelism that is going on that you are fortunately saved from, you know, having to know or to look at, right, okay, fine, but it is possible that you say that oh they have two CPUs, I have a two core or a four core computer, it is possible that you somehow set it up so that these additions take place on one, those additions take place on one, very fine grained, right, so given phrase or a given sentence is broken up into small pieces, that is one way to look at it, fine grained. A slightly coarser grain, right, would be or I will make it, I will swing over maybe to a very coarse grain, I may regret that, let me just get a colored chalk, you remember yesterday I was talking about the checkerboard pattern, okay, so we are going to use the checkerboard pattern, so what I will do is mark out points in a checkerboard pattern, I hope that works, maybe I have chosen too many grid points, it does not matter, okay, it looks like a mess but it is okay, fine, I do not think I have missed any point. The idea here is, if you look at the orange x, if you look at the orange x and if you look at just the where the non, where I have not marked, so I have put four squares here, right, four quadrilaterals there, the orange x depends on the four quadrilateral, if you solve the Laplace equation, okay, so any one of these x dot values, any one of these x dot values, the key critical point here is, any one of these orange x dot values does not depend on the other orange x dot values, does that make sense, is that clear, you just look at p plus 1q, p minus 1, pq plus 1, pq minus 1, any one of these orange x dot values does not depend on, depends on the neighboring values, does not depend on the other orange x values, so all the grid points which are marked with an orange x are independent of each other, they can be calculated simultaneously, right, they can be calculated simultaneously, you understand what I am saying, because they do not depend on each other, one value does not depend on the other, right and all the grid points which are not x dot, do not depend on any other grid point that is not x dot, they only depend on the orange x, so if I take this square here, it depends only on the four neighbors which are all x dot, they are all orange x, am I making sense, it does not depend on another white square anywhere, is that clear, okay, so in a sense all the ones with the white square, I am not going to fill it all out because it is a lot of white squares also, all the ones with the white square can be calculated simultaneously, independent of each other, okay, so in a sense we have partitioned the problem, we have found some element of parallelism, when I say they are independent of each other that means they can be done simultaneously, so on relatively what should I call it, coarser grid, still fine grid, if you were to store these as vectors, the white vectors would depend on the orange vectors and you can write a vector equation and if you have a computer that allows for vector operations, you can actually do simultaneously, you understand what I am saying, if you had a computer that would allow you to do the sum and average, this kind of an operation would allow you to do a vector operation like this, you could just basically compute all the white squares in one shot as a vector operation, then all the orange x is in another one shot, you understand what I am saying, as a parallel operation but that of course assumes that your computer will allow you to do parallel operation, right, it is possible, there are even on the desktop that you have, it is actually possible for you to write it out, you may have to choose 8 at a time or you may have to choose 4 at a time, that is you may have to choose 4 of these at a time because it supports a maximum of 4, a vector of 4, right or you may have to choose 8 at a time because it supports maximum of a vector of 8 or whatever it is, you understand, your graphics processor for example does mostly 4x4 transformation, all the matrices are 4x4 operation, floating point operation, right, so it is possible for you to actually say that I can do 4x4 vector operations, I will map this in a parallel fashion into 4x4 vector operation, am I making sense, okay, right, okay, that is the other way to do it. Here what we have done is, we have done what is known as domain decomposition, I have decomposed the domain on which the problem is defined, I have decomposed it into 2 sets, I could decompose it into as many sets as I want but I have decomposed it into 2 sets, each of the orange x's itself can be divided into 4 or 8 or whatever it is, you can further divide, knowing that they can all run in parallel, right, so for example if you divided this up into, if you say that any CPU that I have can do vectors of 8 but I have 10 CPUs then you can take 8 here, you can break it up into 10, 8, 8, 8, 8 and they can all run in parallel, the fact is because they are independent of each other they can run in parallel, that is the important part, okay, that is the important part and if you get what is known as linear speed up then if you use 2 CPUs it will be twice as fast, if you use 4 CPUs it will be 4 times as fast, sometimes because of various reasons you can get super linear speed up but that is rare, most of the times you either get linear or something less than linear, is that fine. Another way to partition it, another way to partition it, right, now since we are talking about domain decomposition, okay, since we are talking about domain decomposition, see one as I said one way of looking for parallelism is that you basically see that the operator and the operations that you can do can be decomposed, here the parallelism we seek is comes from the domain, right, so domain decomposition, so what you could do is you could break it up along that grid line, I divide the domain into 2, along that grid line I divide the domain into 2, I partition it into 2, so the grid values to the right and the grid values to the left are mostly independent of each other, isn't it, the grid values to the right and the grid values to the left are mostly independent of each other, so you have to decide now there are different ways by which you can do this, the number of grids that I have drawn, you could of course draw the dividing line in between, right, that would make the partition easy but you can decide how you want to associate, so let us just say for now that you have associated all of these points, all of these points, you have associated all of those points is one set, okay, so what you are going to do is you are going to solve this problem basically on one CPU and you are going to solve this other problem in another CPU, that is the idea. Then what do we do at the interface, what is the problem at the interface, well if I am going to take the average of this, I need a value that is on the adjacent CPU, so if I say that this is going to be solved on CPU A and this is going to be on CPU B, then if I go to the boundary grid, if I go to the boundary grid and taking averages at the boundary grid, I need values from the adjacent CPU, so for boundary grid B I need values from A, from boundary grid A I need values from B, okay. Now one of the first observations that we make is B will take values from, these kinds of responsibilities have to be very clear, B will read values from A but B will never write, B will never change these values, they are not computed by A, they are not computed by B. The values at these grid points are the responsibility of A alone, there is no place where there is overlapping, there is no place where there is no grid point at which both will do the governing equation, I want to be very clear, there is no grid point where both will do the governing equation. If A, if on A you are to compute the, if you wanted to compute the boundary, the value at the boundary grid point, you would see from the averaging process for Laplace's equation that you need something that belongs to B, so when you are doing this parallel computing A and B need to transfer data, A and B need to transfer data, okay, we are back to this idea of transferring again, A and B need to transfer data, right. And as far as you are concerned, as seen here, it is, it looks like, well it could be, it looks suspiciously like applying boundary conditions but A and B could be transferring data, I am just saying that it is suspiciously like boundary conditions because in your mind you can, in your program you can implement it as though they are boundary conditions, okay, there is a certain ease, is that very clear, okay. So along these, along the border, however you decide to split it, so you could split it, for instance here I will deliberately draw it in between, okay, instead of drawing it at the grid line, I will draw it in between. So if you had 4 CPUs, you either have a quad core machine or you actually have 4 CPUs, it is possible that you split it, 4 pieces and the same thing happens. So you need to know, you need to know, you now need, see now you need a little organization, you need to know where is this data, so if I split it now, if these bottom 2 are A, B and the top 2 are C, D, I will write it on the sides, C, D, if these 2 are A, B and those 2 are C, D, then you need to know whether the data comes, when you are computing a D boundary, whether it comes from B or whether when you are computing a D boundary, whether it comes from C, okay, so they will have to trade. So you can see that as I break this up into smaller and smaller pieces, what is going to happen, what is the trend that you see, the volume of computation on any CPU is going to shrink, the amount of data that you are going to trade is going to increase, okay. So you have to strike a balance, it does not make sense saying, oh I have 1024 CPUs, I have 1024 grid points, let me distribute one grid point to each CPU. Then all they will be doing is trading, am I making sense, it does not make sense, so you have to be careful, you have to make sure that each one of them is doing a reasonable amount of computation before it trades, okay. And again we are back to this issue of if I am interested only in the steady state, maybe I will take a few time steps before I trade, why should I trade at every time step, I am not looking for a transient, right, so maybe I take 5 or 10 time steps and then I trade, am I making sense, in each one of these I am only looking for a steady state solution, if I am not looking, if I am only looking for a steady state solution, then I do not have to update these boundaries at every time step, if I am looking for a transient yes, but if I am not looking for a transient, I just take a few time steps instead of trading every one time step, I take 10 time steps and then I trade, okay. So and then you usually find that, so this really depends on how expensive is the trade, okay, how expensive is it to trade the day, so now you come to the next two levels of coarseness or fineness of computation. The two models that you will hear about are shared memory, shared memory models, so typically if you have a dual core or a quad core or an 8 core or a 6 core or whatever it is, so it is a CPU in that box sitting on the, supposedly on the same memory bus, the memory is shared, all the CPUs see the same memory, okay, so in which case the trade is not really that expensive. There are issues, there are other issues in parallel programming that you have to be worried about, right and there may be people who get upset that I do not talk about race conditions and all that, but anyway, right, I will just mention it as we go along, but the fact of the matter is they are all, so there is no expense in the sense of there is no expense in trading, so you could trade every time step if you wanted, okay, there could be a potentially a problem with each trade, we will talk about that, but in a sense if it is a shared memory model, which basically means that you have memory and you have a lot of CPUs and all of them are connected to the same memory, they all see the same memory, then there is no cost, right, other than the fact that you may have, you know there are, I mean there could be a cost, there is no cost as far as you are concerned, but there could be cost, I mean there could be bus contention, there are lots of other issues, there are lot of homework that the CPU may have to do in order to get what it wants, right. The general idea being that keep the trade down to as little as possible, right, okay. The other possibility is a distributed memory system or a model, as I said all of this is just to give you a flavor as to what it is all about, right, so here you would have computers that have a CPU and their memory, each one would have the CPU and the memory and they will all be connected together in some fashion, again the tightness and looseness of coupling will depend on how fast this connection is, okay, so typically the kind of thing that we do is we just hook them up to a regular network, we have a rack of, right, a bunch of these and we just hook them up, of course our individual CPUs may be shared memory, individual boxes may have 8 cores or 4 cores in them, which means that they may have a bunch of CPUs sharing the memory and then they could also be connected on a network, am I making sense, is that okay, in which case then the trade becomes a little more expensive, right, in which case what you have to remember is that each one of them the amount of computation that you do before the trade has to be a little larger for the parallelism to actually be to be worthwhile, okay, so again you do the computation here, each one of them would be doing the averaging or the Euler equations or Navier-Stokes equations or whatever it is in each one of the pieces and at the boundaries you trade the data that is required to be traded, is that fine, okay, so what are the possible issues that you have or possible issues that you have, one possibility of course is, right, so you have, it is possible that there is a piece of memory especially in the shared memory system, there is a piece of memory that you are reading from into it simultaneously something is writing, see we have already, as I said, as I have already mentioned, so if you take this boundary point, C has the authority to write into that boundary point, none of the other CPUs have an authority to write into that, they are not going to do it, it is not going to happen, right, so you are not going to really see two writes happening to the same memory, that is what I mean, you are not going to see two processes, two threads writing to these, two processes writing to the same, two parts of your program writing to the same, right, so I have sort of inadvertently, I did not mean to do it, I already used the word thread, so this is like you have a thread or a process of running here, right, you have a process executing here or thread running here, you have thread running there, thread running, they are called threads of execution, you understand, so there is a thread of execution, when you think about, when you are thinking about your program running, when you are debugging your program, what are you saying, you are actually following the thread of execution, you are saying wait a minute, okay, here I am averaging, I am setting it here and then I am going to subtract and find the residue, right, you are following a path, you are following a thread of execution, when you are doing parallel computing there are four threads of execution, there are four CPUs, okay, right, so the thread of execution here, thread of execution here, especially in the shared memory model, where you often you use what are known as threads, I mean it actually is a technical term, but the thread execution here, thread of execution there, thread of execution here, thread of execution here, this thread C will write, has the write, has the RIGHT, right, to write, WRT, right, it can write, it can write, let me put it that way, it can write there, whereas D will not write that, so you will never have a write conflict, two writing, two threads writing to the same memory location is not going to happen, that is what I am trying to say, okay, but you can have D reading from this memory while C is writing, that could be a problem, right, so you could have when you are saying I am trading, it is possible that you can always have a situation where one of them gets ahead of the other, okay, these conditions, this is called a race condition, so it is actually possible for you to get an inconsistent, you can, it is possible for you to get garbage, you understand, it is neither the old value nor is it the new value, it is some in between value, right, so it is actually possible for you to run into, it is possible for you to run into a funny situation, right, so though we are guaranteed that only Cs will do writes where C has responsibilities and these will only do reads of the relevant parts, there is still that issue, okay, so you will hear in parallel computing a term, we will talk about mutual exclusion, so you want the reads and write, you want exclusion, you understand what I am saying, you do not want them to be happening simultaneously, so you want to make sure, so there are ways by which you can do it, so every time the values on the boundaries are being written, if you look at your programming language it may either allow you to lock that memory so that nothing else can read or write from it, you can lock it, saying that I am doing something to it, nobody else can do it or you can implement those ideas yourself, those ideas of locks yourself, okay, so as I said all of these, all of this is just to give you a flavor and for me to introduce these terms, these are things that you can go and look up, okay, so you will hear about terms like locks, these are used, if you have ever taken a train journey, these are all used, same of course locks, you know what same of course are, if you go on a train journey those are those things with lights hanging out that are at various positions, so when it is up that basically means the guy driving locomotive knows that he has to stop, right, so far from far off the shape is such that you say oh I have to slow down because then you need time to slow down, so you have a segment of track, you have segment of track and what are you looking for when you have trains in a railroad track, mutual exclusion, right, you do not know, otherwise you have a collision, literally, so these things are called collision and two of them try to do something, it is called a collision, so to avoid this collision, so what you do is you set up by the flags or same of course the locks, so when in the old days and you can still see it in some railway stations here, right, so there would be a lock, you would be given the privilege of being on that track, you go along and you would be given a lock and you are on that segment of the track and when you go to the next railway station you give up your lock, you throw it, there is a little basket like thing there and you throw the lock, lock is thrown, that is what it was done in the 1800s, right, you throw the lock because then so unless you have the lock you cannot really enter that segment of track, right and as long as you have the lock nobody else has the lock, right, so you know that there is no one else on the railroad track, am I making sense, okay, so it is the same idea, basically what you are doing is you are saying that I want, I want the memory, the memory is where my thread of execution, think of it as a one-dimensional like a railroad track, my thread of execution is going through this memory and you just want to make sure that two of these threads do not collide, right, so you have to have some mechanism by which you ensure that if they are accessing, if one of them is accessing some memory another thread is not going to come and collide with it, okay, so you need to have some mechanism by which you either lock, you get a lock, so the thread that is going to access that memory can get a lock for that memory, so there are lots of ways by the people, there are lots of implementations, lots of issues that are involved with respect to parallel computing, it is not that, what should I say, I am not saying that so you can just go run your program, split it up into pieces and it is going to work, okay, so there is some element of effort involved, of course there is also, since I am constraining myself to CFD, for the most part the level of parallelism that we have is not, the inherent parallelism that we have is not bad, okay, so if you divide up the domain things work, if you divide up the domain and you are careful, if you are careful just make sure that these read-write conflicts do not occur, there is no problem, okay. At the other extreme, the extreme, the simplest extreme, the simplest extreme of parallelism of course is what is called embarrassingly parallel, so a simple example there would be, let us say you wanted to run your program, say this will obviously be a trivial example, let us say you wanted to try out your program for various CFL values, right. So in theory you could take 16 CPUs and on each one of them run your program with 16 different CFL values, say people are smiling, so it is obviously an embarrassingly parallel, right. If sigma was a parameter, if you discretized in sigma or you are running, if you are running SOR and you are running for different values of omega, so omega can be anything between 0 and 2, you discretize an omega, say I am posing it that way so that I can sort of show it as though it is a parallel program, you discretize it in omega and give each different set of omega values, sets of omega values to different CPUs, does that make sense? But we look at it and as I said people smile because it is obviously an embarrassingly parallel, one of them does not depend at all on the other, okay, so in that sense it is an embarrassingly parallel, right. So parallelism is, so now you see that there is not only what happens, what happens here but there is also what is the level of coupling that you have in your code. So if your code, the coupling is so tight that disengaging becomes, then that becomes an issue as to how much effort that you have to put in order to calculate, in order to find out or calculate whatever values that are required. So if you use between, let me, I will give you for instance between Gauss Seidel and Gauss Jordan, Jacobi iteration, Jacobi iteration, we come back here p plus 1q, pq plus 1, p minus 1q, pq minus 1, see now I have changed it from Jacobi to Gauss Seidel, you have chose Jacobi earlier for a particular reason, right. But if it is Gauss Seidel, now I have a problem, this is at n plus 1, so I cannot, if I am starting off at the bottom left hand side, I cannot really calculate this till I worked my way up, I mean I can, there are ways by which you can do it but what I am trying to say is if I stick to this and I say I want to parallelize this, right, I want to parallelize this, then I have a problem because it is very tightly coupled. I cannot do p plus 1, I cannot do p plus 1, n plus 1 unless I have done p, n plus 1, right, because when I do p plus 1, when I want to iterate at p plus 1, n plus 1, when I want to calculate that, this value is what I am calculating right now and I do not have it, I am just calculating it, right and I cannot do pq unless I have done p minus 1, is that fine. So this is tightly coupled. So though Gauss Seidel runs faster than Jacobi, Jacobi is easier to parallelize than Gauss Seidel. Now you have to make your decision, which one, which way shall I go. So it is not as though, oh here it is, you can do this, you can do all of them. So some algorithms may be a little easier to parallelize than other algorithms are, right. So it depends on what is the overwhelming need that you have, it is possible that just running the serial code is faster than running the parallel code, code parallel because of some inherent something like this, inherent coupling like this, okay. On the other hand, if you say well it does not matter, I do not care, I do not care, there is a kind of parallelism that you can try to run, okay. Just to give you an idea, you know, you can have something like, what is like a fugue, I do not know if you know what a fugue is. So you start here, you start here, you know, I am sure you have heard fugues. People usually, kids usually do it with rho, rho, rho you about, okay, fine, fine. So what you do is, you start here, average, average, average, average, average, average, you work your way down and once you have done, average, average, once you have done, once you have done this, when you have worked your way from, maybe I will use that figure instead, let me just get back here. Once I have gone through here and I have come up here, I can do this now. Again, I can go to n plus 2, right. I can go to n plus 2 here because all the terms required to go to n plus 2 are there. So while I am going here, while the first process is going along like that, the second one can now then start. You understand what I am saying, doing n plus 2 and trail behind it, fine. And by the time the second one comes here, the first one has gone one row above that, you can start a third one, okay. So it is just a matter of looking at it saying, so you can throw your arms up in the air and say, oh my God, it is tightly coupled, I cannot do anything with it. But actually if you sit down and think about it for a while, you may be able to figure out to be around, fine, is that okay, right. So that is as far as accelerating goes and as I said, please remember, these days it is not that expensive. It is relatively expensive because you need processors, right. So if you want, if you do multigrid method, if you get a factor of 10 speed up or a factor of 100 speed up, to get the same factor of 100 speed up with parallel processing, you need 100 CPUs at least, there is a difference, right, there is a difference. So that is like a sledgehammer, that is an expensive solution, that is really like a sledgehammer. So try out all the others, try out SOR, try, there are lots of preconditioning, right, there are lots of other algorithmic things that you can do. Make sure that you have done all of that stuff and despite all that, yes, my program requires parallelization, anyway you should be able to write a program that does parallel 2 CPU, 4 CPUs, possibly 8 CPUs because if you go out and buy them, they are all dual cores and quad cores now anyway, right, you should be able, at least you will be able to use that, is that fine, okay, yeah, thank you.