 My name is Wayne Kelly, I'm a lecturer here at QUT. I'm going to be talking about the problem of parallelization. So I first want to define the scope of the particular part of that issue that I'm trying to address. Don't explain what's wrong with current approaches for dealing with that particular sub-problem and then describe a novel approach that I'm trying to follow to solve in a different way and a tool that I'm developing to support that technique. So parallelization, there's two real scenarios here. If we want to write a parallel program. Now sometimes we start off with no program, green field, and we start off from the very beginning intending to write a parallel program. So from day one, we're thinking in terms of threads or processes, and we're thinking in a parallel way. There are many parallel applications developed in that way. But equally, there are other applications where we start off with a sequential program either because it's a legacy code that's existed for many years and now we want to parallelize it to improve its performance. Well, sometimes simply it's an easier path. We first want to create a simple sequential program, get it developed and debunked, and then worry about optimizing it and parallelizing it. So let's assume we're starting with a sequential program and we have the task of parallelizing it. Again, within that scenario, we have two options. Sometimes we need to fundamentally change parts of the program, change the algorithms. The sequential algorithm is just not suited to parallelism. We have to choose a fundamentally different data structure and different way of doing things. Now, that sort of algorithm change is beyond any sort of automated tool support. It really requires human and intellect to make that sort of change. So I'm not addressing that scenario where a fundamentally different algorithm is required. What I'm considering instead is where we're going from a sequential program and making it into a parallel program simply by exploiting so-called inherent parallelism. And let me explain what I mean by inherent parallelism. Here's some trivial examples. Example on the left, we see here, just one moment, so I can point a little more easily. You can see here this third statement on the left-hand side. It depends on the first computation and the second computation having completed. But there is no dependencies between this first statement and the second one. So we could, if we wanted to, execute them in parallel. A more interesting example is perhaps where we have a loop where the original program is to do it sequentially one step at a time. But in this example, since all we're doing is initializing each element of an array, there's no reason why we can't do all that in parallel. I'm not fundamentally changing the algorithm. I'm just exploiting parallelism that existed in the original program. So that's what I'm fundamentally trying to do. Now, you might ask in those previous examples, well, you could parallelize those two assignment statements, but would that really lead to a speed up? So let me first explain my overall view of the world, the parallelism world. I view the process of parallelizing the application as three sequential steps. You first decide which parts of the program could you run the program in parallel, could run in parallel, if you say chose, and still get the same result as the original program. Now, that requires analysis of dependencies, which I'll talk more about in the next few slides. Having decided which parts could be run in parallel, you then have to ask yourself the question, which of those are actually worthwhile running in parallel? The overheads of synchronization and granularity, depending on the architecture, may not make it worthwhile. So that requires performance analysis profiling to decide which parts worth parallelizing. Then finally, having decided which parts to parallelize, you then actually need to do it. So you need to create your threads, you might do it manually, or you might use OpenMP or MPI or whatever your favorite threading language or libraries. A lot of the talks we've talked about heard today, various languages for doing parallel programming, they really address the third question, how do you actually express the parallelism? They don't really help you address the first question of how do you take a sequential program and work out which parts are actually parallel and which parts aren't. So my research is all about that first question. Now, you might say, well, that's trivia, I write the program and I know this loop is parallel, or it's not. Well, you may not have written the program, but it's very large or complex, and even if you write it yourself, there may be some subtle little data structure using somewhere which does introduce the dependence between one iteration and another. So it's very hard for a large complex program to convince yourself that what you're parallelizing is actually safe to do so. Now, this issue of parallelizing, it applies equal, all the problems and issues, applies equally well whether it be some sort of automatic parallelizing compiler, a tool trying to parallelize, or a human trying to do it manually. It's hard for both of those entities to perform this problem. Caracom pilers aren't smart enough to do all the fine-grained static analysis that's required, and humans don't do this very well either. It's very time-consuming, requires great skill, and it's error-prone. So we really don't have a solution to this problem at the moment, there's problems on both sides. So let's consider the sorts of dependence analysis we need to do to decide whether code to be run in parallel. There are two different types of dependencies that exist. This is parallelism 101, I'm sorry for boring those, ready for me with these basic ideas, but we've got a broad audience here, so I thought I'd first better define these ideas because the rest of the material relies on an understanding of data dependencies. We firstly have control dependencies. So here I have a loop that might and might be a thousand iterations which I'm trying to parallelize. But if we look inside the loop, we have this if statement here, which says if some condition is true, then we wanna break out of this loop early. So we can't parallelize this loop that has a thousand iterations because we might find after five iterations we need to stop not actually do the other iterations and doing so would produce an incorrect result. So control dependencies can get in the way of parallelization. I'm however, so I do consider those, but it turns out the data dependencies are the harder ones to analyze. And we have three different types of data dependencies, the obvious. Can you have a question? Yes. So on the control dependency, wouldn't it be a case of the ratio of the size of the array versus the processing? Because if you had enough processing, then it might be worthwhile to just put the output somewhere else. Sure. And then when you get done, copy where it's supposed to go. So there are workarounds for dealing with these control dependencies. The point is, in the presence of them, you can't just trivially parallelize stuff. So that's how it was true of data dependencies. It may be data dependencies in their way of working around them, but it means simply in its current form at least, it can't be trivially parallelized. So for data dependencies of flow dependence, in this case here we've got some statement that writes some value and then later on some instruction which is reading it. We need to preserve that order. Otherwise we won't be reading the correct value in the second statement. So that's referred to as a flow dependence and we have similarly output dependencies and anti-dependencies which result from a right followed by another right. You need to keep them in the same order or when we're reading some old value before it gets overwritten by some subsequent instruction. We need to preserve those ordering constraints, otherwise we might get a different result. Well, let's say it is sufficient if we preserve all these dependencies, then we are guaranteed for it to produce the same result. Now that was trivial in that previous slide because we have lots of very simple A and B, just integer variables. You can tell A is different to B. They contain simple values. We get to arrays as we might find in a scientific application that's typically run on these high-performance computers. We ask, can we parallelize the i-loop? Can we parallelize the inner j-loop? We have to ask ourselves, are there any inter-iteration dependencies? So any iteration of one loop that reads some value that's read by one of the earlier iterations. Now, automating parallelizing compilers can address this sort of numeric kernel by asking, does there exist iterations i and j for the reads and the writes such that various constraints exist? And you can solve this sort of problem. It's an NP-complete problem, but you can still solve it and you can get a yes or no answer. So there are automatic parallelizing compilers that use these sorts of techniques. Now, the fact that it's NP-complete is an issue, but that's not really the main problem. The main problem is that this sort of analysis only works for all of these loop bounds and these index expressions are very simple, whether they're affine. When they start getting to be more complicated expressions, the compiler can't analyze them anymore. So if the compiler can't analyze them, it says, I give up, maybe there is a dependence we better not parallelize this loop. So they must be conservative. It gets even worse when we move beyond four-trans scientific codes which deal primarily with arrays to modern RO type languages, C, C++, Java, where we have pointers and references. You might have two different methods being called here, one on object A, one on object B, but we don't really know whether A and B might be the one and the same object. And to actually track whether they may be aliases or not, we have to analyze the entire program, which it's very complicated when we have very complex inter-procedural data flow where these references are being passed all around our program. It's complicated by these pointers and references. Object orientation itself complicates things when you call some method, a virtual method. You don't know which actual implementation might get called unless you know the subtype involved. And a component-oriented code, you might not even have the code. You call some method, but the actual implementation of that method may not even exist at this point. It may be developed by some third party that'll be plugged in dynamically at runtime. So all of these make it very hard to do static analysis that allows to prove that a particular loop or part of code can be run in parallel, which is why most of the majority of the techniques to do automatic parallelizing compilers have been restricted to Fortran type languages where many of these complications don't exist. But even in the Fortran world, the problems are not solved. We can't still parallelize many Fortran programs. The programs are still too complicated to do precise enough analysis to prove that various parts of code are parallelizable, even though they are actually parallelizable. Now, so when we do any sort of static analysis, it's conservative if it's inexact, which means we need to overestimate the dependencies that might actually exist, which means that we're underestimating the amount of parallelism that we can actually exploit. So for really large object-oriented complex programs, if you try to do any sort of static analysis on it, most times it would say, well, there's a little loop here and a little loop here can parallelize, but all of the really interesting stuff, the outer level, coarse-brained parallelism, it's just gonna be too complicated to analyze. So it will say, no, I can't parallelize it and you've got virtually no parallelization from an automatic tool these days. So the question is, how much parallelism are we missing? There's been a lot of work parallelizing scientific codes, Fortran codes for supercomputers, but what about your everyday business application written in Java? How much parallelism is in these sorts of applications that we're not exploiting today? We don't really have a good feel for that issue. So I'm trying a different tact since we can never be precise and perfect with static analysis ahead of time trying to predict what will happen. Why don't we wait to run time and measure what actually does happen? Because at run time, all that uncertainty about whether this pointer might be an alias to this pointer over here, which method gets called at run time, we know all of those answers. We know for each instruction, as it executes, it's actually what memory locations is at read and write. We know exactly what methods get called. We know what branches are taken, which exceptions get thrown. So the proposal then is to instrument the code to not just do what it originally did, but also keep track of all of these memory reads and writes. So it's actually measure which data dependencies do actually exist or materialize at run time. So the basic idea as the program runs, we've got some instruction A, an instance. So I'll pick up a machine code instruction or IL level instruction. It executes and it's going to read or write various memory locations. So let's say this instruction here reads this particular memory location here. Later on as the program executes, then to some other instruction B that also happens to read that same memory location. So if we can actually record that fact at run time, we can infer that there was a data dependence from instruction A to instruction B. No doubt about it, it definitely exists. I've shown it here in this program run, it definitely exists. Now the observant among you will realize there are obviously problems with this approach. It may not find all the dependencies. It only works for this particular data input. If we give the program some different data, it may find some additional data dependencies. So we can certainly run the program time and time again, giving it different data sets, and we'll find more and more data dependencies. But we can never be certain we've found all of them. So I can't use this technique as a way of saying, well, based on the output from this tool, you can definitely parallelize this loop. It can point me in that direction, it would can tell the programmer, hey, this looks like a section of code that appears to be parallel. It can also tell the programmer, hey, this loop here definitely cannot be parallelized, at least in its current form. Maybe by transformation, you can change it. And it can tell you not just that it can't be parallelized, but it can tell you precisely why. It can tell you exactly what data dependencies exist at the moment that would need to be removed in order to facilitate parallelization. So it's interesting therefore to look at the relationship between static analysis and dynamic analysis. Static analysis gives you an overestimate of the data dependencies and an underestimate of the parallelism. And for runtime analysis, it's the other way around. We underestimate the dependencies and possibly overestimate the parallelism. Now the truth obviously lies somewhere in between. So if we use a combination of static analysis and dynamic analysis, we can squeeze the actual true amount of parallelism somewhere in between. Okay, so to support this kind of approach, this is a very simplistic crude description. I'm leaving out all of the optimizations. But the basic approach is to create a node for each instruction as it executes that might do a read or write. We then keep track of two dictionaries or hash tables if you like, which map from the memory location to whichever instruction most recently wrote to that memory location. That allows us to compute flow and output dependencies. If I also want to compute anti-dependencies, I not only need to record who wrote it most recently, but a list of all the instructions that have read that location since it was last written. And of course, if it's a garbage collected language, then we need to worry about that because we're simply tracking war memory addresses, but the garbage collected could come along and move that object from here to here. Now logically it's still the same object, but the actual physical addresses have changed or it could be garbage collected and that memory now means something else. So we need to be aware of that as well. Now if we want to determine whether a particular loop is paralysable, we can't just have this big collection of instructions. We need to know that this instruction belongs to a particular loop. So we create nodes not just for the instructions, but also for method iterations and for method calls. So what we're doing at runtime is keeping this dynamic call graph effectively. So as the program runs, we might have method A calls method B and inside of method B we've got a loop and in loop iteration one, inside of that loop we call method C. Method C has a loop inside of it and inside of the fifth iteration of that loop, then we execute some instruction which happens to write to this memory address here. Somewhere later on in the program, some other instruction executing in some different call chain also happens to read that value that was written over here. So there's a dependence between those two instructions that in order to actually make this meaningful and to present it to the programmer, we need to lift it back up, we need to follow this tree back up to its greatest common ancestor up here. And we can infer that there's a data dependence from this loop iteration to this loop iteration which tells me that this loop that exists back in method B cannot be paralysed because there's a inter iteration dependence between them. So the actual reads and writes are happening way down low in the call chain that we lift them back up to find the level at which the actual dependence exists at a single method level. We can then present that to the programmer. The programmer considers one loop at a time and it'll say yes, well I couldn't find any dependencies between loop iterations or if I did, I visualized them to the programmer via a graphic error. So I say there's a dependence from this statement here to this function called down here. And if I wanna know precisely how it arose, I can click on the arrow here and it'll give me the two call chains corresponding to the source and the destination of the data dependence. So we're trying to integrate this sort of visualization into an IDE in the tool that I'm developing and also support for refactoring. But it tells me that it isn't paralysable, perhaps I can do some sort of renaming or various common refactoring which will eliminate dependencies. So those sorts of refactoring will be built into the IDE as well. So what else can we use? We can determine which loops appear to be parallel and for those that don't, we can tell precisely why. So the example I've described before, we consider all types of data and control dependencies, flow, output and anti-dependencies. But if I consider only the flow dependencies, the writes followed by a read that sort of true genuine dependencies, if I leave out the output dependencies and the anti-dependencies, I get an even higher degree of potential parallelism. And that's the parallelism that I might potentially get if I can perform transformations that eliminate output and anti-dependencies. So there's things like loop privatization, variable renaming that can eliminate output dependencies and anti-dependencies. So by considering just the flow dependencies, then I can get a higher degree of parallelism. I can also, this will give me the amount of parallelism and say, this loop's parallel, this loop's not. But there are other types of parallelism, not just loop parallelism. There's parallelism you might get from recursion or from task-level parallelism. So if I want to get some sort of measure of total parallelism in the program, at least from a theoretical point of view, I can measure, in terms of instructions, the critical path. Let's say this instruction can't execute until this one is executed, which can't execute until this one is executed. So if I take the longest path through the program, that's how long it will take to execute on a processor with infinite number of processors. Divide that by the total amount of work, the total number of instructions, that ratio gives me a theoretical measure of the amount of parallelism. So what's this tool about? Well, it certainly can help programmers guide them in parallelizing specific applications. I've got their program, they're trying to parallelize, this tool can help them with that. From a research point of view, I'm perhaps more interested in an alternative use, where I'm not trying to parallelize a particular application, but I just pump in every open source program, Java, C++, whatever that I can find, general purpose, all sorts of genres, and simply measure how much parallelism do I typically find? Is it in loops? Is it recursion? Is it map-reduced parallelism? Is it coarse-grained? Is it fine-grained? Does it fit the patterns that we already know, or do we need to develop new analysis and new parallelization techniques? So I'm using this as, I'm hoping to use this as a meta-research tool, that not just helps parallelize existing programs, but will measure the actual true parallelism in a whole range of different programs, which will help us better understand that task of parallelization generally. So we have a prototype implementation, which we've developed, a number of different front-ends and back-ends. The front-end is a part that does the instrumentation, the collecting of the data, and the front-end is, or the second part, is the visualization, the IDE, for presenting the information to the programmer and allowing the refactoring of the program. So my prototypes, the first one we've developed is for the .NET platform, which uses the mono-open-source .NET implementation. It's actually built into the JIT. So we take ILL code, and in the jitting process, we add the extra native code instructions to do the instrumentation. I've also started work on a front-end for instrumenting native code using the Intel PIN tool for doing instrumentation. Both of those different instrumentation phases will produce the same form of XML output file that has record all the dependencies which were measured at runtime, and then we can then feed that into some sort of IDE. We're working on both a visual studio, one at the moment, and potentially an Eclipse one in the future. That's it for me. If we've got time, I'm happy to take questions. Thank you very much.