 In this presentation, we will cover the static timing analysis. There is a tool available with the development kit that is called the SPU timing tool. So it will give, once you write your code, you want to see how many stalls were there, how many branch operations were taken, where were the delays, where were the dependencies. So this is a static timing analysis tool which will parse your code, look at the assembly file and see and basically try to give you an analysis of how the code is interleaved. So basically this is how it is done. Take a simple source code, example.c, produce the assembly file and then produce the annotated assembly source like it will, in your current working directory, when you run the command with the timing tool, there will be a file.s.timing. If it is an example.c, there will be an example.s.timing tool in the current working directory. And this is the command options. This is how you say SPU underscore timing. And these are the options that you can give on the command line. Help will display whatever the verbose thing and then you can specify the architecture. Are you trying to profile? Basically you can provide the target architecture with the minus m option. And then you provide give the name. If you don't want the default naming, you can give a different kind of name on the command line. And then you can count the number of cycles from the start of each instruction by giving this parameter my dash running dash count and then the input file. So now let's look at a source file and see how basically this tool is used. In input vector, there is an input number n coming in, float, one float variable coming in. There is an array of vector. There is an array of a vector underscore float and then there is another array y, x and y. So, SPU underscore splats take a value, takes an input value of a scalar variable and then produces a vector variable. So, what this vector element a will have. Now, alpha is what a float variable, right? So, this vector a, it's a vector. So, there will be four floating point variables in that vector a. So, SPU underscore splats will take this scalar value alpha may say it is 4.5, right? It will inside the vector a, it will put 4.5, 4.5, 4.5 and then 4.5. So, there will be four values of replicated across the whole vector. And then you do a multiply add. It's basically multiply x, i and yi and then you add it to a. So, this is what this simple scalar code is doing. Once the assembly code is produced, this is how a typical assembly file looks like, right? For any source code, a simple C program. Now, this is the annotated source, example.s.timing. That was the example.c And this is the annotated source code. Now, let's look at the different features, how to basically understand what this source code, this timing file is saying. So, the first thing to observe is the running code. Over here, they're basically the, a few important columns, right? This is the first column to observe. This is the second column to observe. Again, we will go into further details of what these columns mean. And then there's a third column over here, which gives you the pipeline an idea of how the instruction pipeline is looking like. And then the text portion and the data portions over here and the instructions and everything. Okay. So, the running count, right? The cycle count for, basically running count is a cycle count for each instruction. Every time an instruction starts, it tells you how many cycles is that instruction taking. So, in other words, if it is a shuffle B operation, it takes two cycles. So, by adding up all these values, right? For our example, our loop is 17 cycles, right? Because, so, you know, this is where our loop is starting, right? So, 23 and then 6, 23 to 6. So, this is how you analyze how many cycles it takes. And this is, again, a static and access tool, there's more tools available, which make it a little more simpler, give a little more simpler output. This is the execution pipeline. We always keep talking about dual issue, dual issue, dual issue, right? How do we make sure that there is actually dual issue happening? So, in this one, every time there's a zero or one, the first column in this, or the first sub column in the second column always tells you which pipeline it's going into. So, if it's a zero, it's the even pipeline, right? If it's a one, if it's a one over here, it's the odd pipeline that this particular instruction serially is using. Dash D, there, if you notice, there is some D's are capital letters and there is some D's that are small letters. So, the, so, the second sub column in the second column talks about the basically the dual issue status. Is it, is the dual issue actually happening or is there a dependency? So, if there's a small D, basically it's saying dual issue is possible, but there is some dependency. So, that instruction can be issued along with the even pipeline or the odd pipeline, but there is some dependency that's stopping the, you know, the, the, that's stopping the issue of dual instruction, dual, that will stop the dual issue at runtime. And if it's a capital D, it indicates that yes, it will be dual issued. So, in other words, it's a good thing when you see zero capital D and a one capital D. That directly means that yes, capital D dual issue is possible and it will happen. And a zero and one means what? It's an even and odd pipeline. So, ideally, your second column should always look like zero D, one D, zero D, one D, things like that. That means your code is really efficient. Otherwise, you have to see, oh, there is some dependencies now and try to rearrange, restructure the code. So, the dependencies is removed. Now, let's look at the third column. This is the pipeline. Instruction clock cycle, which denotes the instruction clock cycle occupancy. So, there is different digits. If you see a digit 0 through 9, basically it will, it is, for every instruction clock cycle, there's a digit that will be displayed. And if there is a, if there is a dash, you can always accept basically, that means there's a dependency. The instruction is ready to be issued, but it cannot be because there is some dependency with the previous instruction. So, there is a stall basically. Therefore, if there is a steep sloping of these, of this instructions digits, that's actually very good because that means less dependency and it means efficient issue of instructions and shallow sloping. Usually, you know, I think even this is probably okay, but it could be better. I mean, if it goes like that, like the slope, it's very good for your program. If it's like this, you know, shallow is not good because it just denotes that there's more stalls. Now, let's see. The inner loop over here, for example, we're seeing that these two portions are showing a lot of dependency stalls, right? So, basically the load of y stalls, one cycle for address increment, if there's one dash. So, therefore, the resulting, the store of the resulting y, right, stalls five cycles. If there's five dashes waiting for the FMA operation, this operation to complete, I think they're looking at the over here. So, for the FMA operation to complete, there's five cycle stalls over here. So, let's see what else. Now, dependency stalls can be eliminated by unrolling the loop. However, let's keep one thing in mind. Even when we are unrolling a loop, make sure that you're not using local variables that are not the same. For example, if there is a variable A that needs to be computed in the first unrolling of the loop, use different variable names. Otherwise, if it's the same variable, the compiler gets confused and it thinks, oh, for the second version of unrolled loop, I need that variable. It's not needed, but it introduces false dependencies. So, even though you give the optimization o3 level to the xlc compiler, nothing might happen until you make sure that these false dependencies are not introduced in the code. Okay, profile markers. There's something called, there's a header file profile.h, which has instructions which read prof underscore cp and then there's a number. It could be 0 through 1 or pick any numbers that are consecutive. You can say prof underscore cp0 and then prof underscore cp1 and keep the loop in the middle. Let's look at an example before we talk any further. So, for example, this is our loop, right? This is our vectorized code. So, now we want to see what is the performance number. Now, we don't want to basically capture all this overhead. We want to capture, any time we want to profile code, it should only be in front of the computation just before the computation main, computation kernel starts and then before, after it ends. So, in this code, when do we want to start profiling right before the loop? And we want to stop profiling right after the loop. So, if you see the for loop is starting right here, we start prof underscore cp0. What it does is it starts capturing the cycle information. So, it will start capturing the information for all the loop iterations and then it will print number of cycles it used up, number of stalls and it will produce all this information in the timing tool and the statistics on the simulator. And in the assembly code, you can you can identify it because it introduces this instruction over here, the l no op instruction. Let's see, there are some limitations. If you are writing your handwritten assembly code, it may not work as well. This is a language that the compiler understands because it is trying to do the disassembly of the code. It is trying to just generate the assembly. So, it understands its own effort. It's like someone else cannot understand your code, something like that. So, if it's handwritten assembly, running the timing tool on handwritten assembly may not produce the same results. It's always the most beneficial use of it is when you're writing the source code in a high level language, cc plus plus whatever and then we're trying to profile it via the producing, you're running the timing tool on it because the compiler itself is generating the assembly file. So, that's the static timing. Let's go over to the dynamic runtime analysis. How do we do that? So, the class objectives for this is to see how to do runtime dynamic analysis and how to produce that output on the simulator in a very user-friendly, understandable manner. So, when we do these profile checkpoints that we covered in the previous presentation, so this is how you can introduce profile checkpoints, prof underscore clear, prof underscore start and prof underscore stop. Now, when you do that, it will produce an output of number of cycles consumed and here is the loop part. This is something that you will introduce in your code. This is the output that it will generate on the simulator. It will say, depending upon the number of the SP used, it will print the SPU number over here and then it will say CP31, that's a checkpoint and then it will tell you the instruction counter, total number of instructions including the NOAP and the outer braces and then in the braces excluding NOAPs and then the total number of cycles used. So, for these many instructions, the total number of cycles used were these many and in order to extract that kind of information, we need to tell the simulator that you cannot just run in fast mode. We have to change the mode of the simulator to say pipeline mode. So, in other words, when you put the simulator in the pipeline mode, it will record every single cycle. These are the instruction classes, all the instructions that we are dealing with arithmetic memory, all these instructions fall under certain categories and that category is defined and the even and odd pipeline is based on that category. So, all single precision floating point operations are called FP6 and they are always in the even pipeline. All the branch instructions because it is a kind of a decent memory. So, we will be falling under VR class and it will go to the odd pipeline. The NOAP loads and stores branch and store branch hence all fall under this category LS and LNOAP and then there is the loads and stores can be they take six cycles every load and store and then they go to the odd pipeline and all this information is important is necessary when you are basically writing your simple application and because if you are trying to do an add right, you will know that it will definitely go to the even pipeline because it is an arithmetic operation. So, after that if there is an if else statement you want to structure your code so that if there is a branch taken make sure that it is right after the computation or right before it. So, the compiler can automatically restructure the code and issue it in the two pipelines. So, it is it is concurrent. The point of you making it forcing the dual issue or make writing the code so that the dual issue happens is so that you save the cycles right. It is all about the cycles we are going to reduce the cycles at any cost bottom line. So, there is a the simulator provides these statistics and you can basically type on your on your command prompt like this particular command and it will print some data we will see that in the hands-on session if you want to use a GUI stat controls. So, in other words you run your application yesterday you run hello underscore BE right you run your hello underscore BE and then you go over to the to your SPU and you do this command SPU stats and it will print it print this whole window provided the simulator is set in the pipeline mode. Now, the simulator output will be looking like this. So, the first give the first section gives you the complete summary it tells you the total number of cycle count the instruction count and CPI. So, number of cycles per instruction in the code. So, it is a very useful thing because a lot of times actually ever ever before I joined self-programming the most I used to think about is how to write object oriented code and how to make sure that it is the coding standards are good and but I never really worried about performance after until I started working on the on the on the self-programming and then there is a few things that I realized I could just do real neat and I could apply it back if I went back to my previous position. But it is really neat to see write the code and see exactly that oh god I suck at this is terrible and you go back and you improve your code and you restructure do things that just make it so much more efficient. So, it gives in other words if your cycle count is way too much and you know that it is a small code obviously there is something wrong going on and it is not visible to the naked eye but you know you can have you have all these tools to run on the code to get to basically observe the behavior and improve it. Okay so this is the full summary of the output right now let us go deeper down the first section gives a total number of cycles and the instructions complete instructions executed by a program and then the cycle count usually it should be somewhere less than one it is really good if it is less than one if it and usually it should be between one and two otherwise there is something not right if there is more than two or three or four. So, total number of cycle count is here instruction count is here CPI is here and then the let us see the performance cycle count basically the performance cycle count is basically when you started profiling the code right before the total cycle count is all the instructions for the entire code performance cycle count is for the parts that you are trying to profile and optimize the that particular kernel and then you can also get performance no op count and you can also get performance instruction count you can also get branch and hint status the hint statistics so you all you can you have all these the total number of branch instructions is 14,000 something something over here right 143,000 so you have taken actually you have taken these many branches no wonder the cycle count over here is so high 8.55 branch is not taken is 280 so you can get all these statistics and that determine the nature of your code and if you have any used any hints in your code it will give you a summary of how many hints actually worked there is also this efficiency statistics single cycle you know number of dual cycles how many instructions did use the dual issue pipelines there is also no op cycles that really need summary of everything that you can possibly know about your code. Prefetch miss dependency stalls if there is an dependency stalls it will give you a complete list of how many dependency stalls were there so in other words the point to notice over here is dual issue cycles so the point to observe here is basically to see how many dual issue cycles were there how many prefetch miss stalls were there the most important thing the dependency stalls reduce dependency reduce dependency reduce dependency in the code that is the key because otherwise no matter how good the code is you are trying to vectorize it if there is dependencies it cannot be parallelized the compiler has got immense capability to parallelize the application but if it is not well written even the compiler cannot do anything unless it is you want to write your own assembly code all right what else hint target stall bypass stall all this information it is a and this can be very useful presentation for you to refer back to when you are right going you know when you go back from this class and you write your own application code you know refer to these you know these features available and see what they mean and how to use them okay and then there is another command line option available which says on the command simulator prompt once it is the simulator is running you know that window which gave all this garbage all this in the addresses and all this trace output was coming over there you can go to that window and hit control c it will give you the command prompt or on your GUI window you say stop so it will halt the processor and then you type this command and my sim my sim spun display statistics so it will hint it will print a complete you know complete recap of the number of hints the number of branch instructions that were not taken you know the number of branch instructions that were taken the number of hints the number of hits so this is actually very good there were so many hits for the hint instructions or branch hits and there's another option that's available where you can provide the parameter hint so spun will be you can see by when you hit control c you can see what spu was used was zero or a seven or four so you provide this that as a parameter and you enter this command and you can get all the hint status data and then you can see if you are really you know low level person then you can see what is the usage of registers the register reads the register writes and it gives a very good idea that if and once you write your application how many register reads and writes were done and then there's an option on the simulator which says which prints the event log in other words when you try to print the event log it will show you the summary over here in this little neat box it'll give you the number of cycles per instruction counts right dma operations counts how many gets were there how many puts were there how many translation faults were there on the PPE side there's no average address translation remember on the spu so that can only be for the ppu how many fetch groups were issued issue groups this is a control panel where there's a you know more data gets printed and this is just a general execution and then there's a visual visualizer summary display where you can get the number of dma gets inputs cycle instruction count the channel come stalls everything and again there if you see there is like all these eight windows for eight different spus and when you do a pipress this is how the output looks like there's some more information available you have the materials on your cd but do try to go ahead and check this link out it's on developer works there's a really neat paper written about how performance is expected on the cell systems immediate or some processes that that are used and it's a very user friendly it's a small paper that you can that's very good beneficial information