 Welcome to another edition of RCE. This is Brock Palin. Again, you can find us online at RCE-cast.com. There's a nomination form. Also take a moment to head out to iTunes and give us a rating there. So more people are visible of us and find out more about the things that we are doing. You can also find all of our Twitter handles and blog accounts and everything else off the RCE-cast.com website. You can also find all the old back episodes there and everything else. This is the second time ever since the very first episode that I am going solo. Jeff unfortunately couldn't be here this time, but he will be back for the next show. We have with us today the group that is working on the Perf Expert Performance Optimization Tool. Guys, take a moment to introduce yourselves. Hi, I'm Jim Brown and I'm probably the oldest living settler in half-performance computing and we have our two associates here and here they go. Hello, my name is Ashay Rani. I am currently a PhD student at the Department of Computer Science here at UT, UT Austin. And the way I am associated with this project was that prior to joining UTCS as a PhD student, I was working on the Perf Expert Project handling most of the research and development along with Dr. Jim Brown. Hi, I'm Leo Fialo and I'm the last transition from Jim Brown's group just to take care of the gap that I should make when he left to become a full-time student and well basically yeah this and the future of Perf Expert and other tools that we're working here. And I need to mention one more person who is not here who was the person who with me really founded the project and his name is Martin Bircher and he is an Associate Professor of Computer Science at Southwest Texas State University now and we started the project together several years ago. Okay, so the project we're talking about today is Perf Expert. Can one of you give us a bit of a background of exactly what is Perf Expert? Well Perf Expert is a simple comprehensive performance tool. Performance optimization has four stages. Measurement, analysis and diagnosis, recommendation of optimizations and implementing the optimizations. And right now Perf Expert automates the first three of those four phases. And our goal when we started was kiss. Keep it simple, stupid. So and you invoke it with one line and it gives you recommendations on what you need to do to improve your program at the current time. So why do you think the Perf Expert project is better than existing tools out there? Well what's better? Better is in the eye of a holder. If you are an expert in performance evaluation and compilers and code structures then there's no doubt that you can do very well with HPC Toolkit or TAO or IMP or any of the other many tools. If you're a domain expert and you need help Perf Expert is for you. The other tools give you measurements, tell you where the code, where the associate the measurements with the code and then quit. Ashay would you like to add to that? Yeah so as was apparent through many of our personal experiences while using these performance tools is the case that these tools are really good at measurement and they can give you a ton of information but you need to have some sort of a background to be able to understand what is it that you should be looking for otherwise you might just be lost in the myriad of information that's out there. And so in that sense the thing that we really wanted to work on was building some kind of an analysis engine on top of these measurement tools. So given all of this data instead of just showing it to the user as just a series of numbers essentially arranged in some row and column format we wanted to find out if we could tell the user that you know out of all of this this is something that's really important and conveyed in a more intuitive sense so you know so one of the things that we do in the analysis that we do in Perf Expert is to break down the overall performance into multiple different categories and when I say multiple it's actually just a handful it's just six of them and through those categories we basically aim to tell the user that you know it is your say for example it is your program variables or your memory or say your floating point execution which is currently a bottleneck and this is how you can fix it. So it is the analysis where Perf Expert differs from many of these tools and also the the recommendation process and these suggestions for optimization. I'll add something that is how come we happen to start this project in the beginning in this August of 2008 I went to a meeting of National Science Foundation computer users called Terriguid. There was a panel there where all the people who have performance tools IMP and Tau and HPC Toolkit all gave presentations. There were about 100 and some odd people in the room and when the time for questions came I asked the audience not the panel how many of you have used any of these tools three people raised their hand did I went around and ask why and it said there's too complicated I don't know enough about architectures and compilers to use these tools then that set me off and that's how it started. So is Perf Expert actually another tool like Pappy or something like that that's reading hardware counters or it sounds like you're trying to make something that's a little bit more approachable for the average domain scientist as you referred to. Well actually it relies on some counters like from for example it could be Pappy or it could be VTune or using the framework of HPC Toolkit but actually what it tries to give to the user is just a summary of okay this section of your code we have a problem here we have a bottleneck here and we can work on that and how we can work and we give some recommendations it's simple like that the point is we run we take some measures we analyze them and then say okay you can improve this code this section here and this other section with those recommendations that's why it's it's really different than the other tools. In other words we address a different audience than the other tools the other two Pappy our HPC Toolkit or TAO are wonderful and powerful systems in fact we depend on them we depend on them to do our measurements for us what we do is we add intelligence human like intelligence knowledge of compilers architecture code segments and things like that to the measurements so our added value is not measurement our added value is knowledge you can think of perfect expert as a small dedicated expert system which interprets performance measurements. So are you getting information more like at the function level kind of like a GPROF type thing or what kind of information are you actually reading back from these tools? So currently in the perfect expert release that's out there we we build on top of the measurements that are provided to us by HPC Toolkit and HPC Toolkit gives us measurements at the level of functions and loops so that's our level of granularity that we have for our measurements and analysis. Okay so how about we expand on this a little bit if say I'm a domain scientist and I work on something and it's like man I'd really like to see why my say I've got my prototyped version and I want to try to improve it I run it through perfect expert what's the process for doing that actually? So the first step would of course be to run your application with perfect expert or in fact if we can go even a step beyond even a step earlier than that when you install perfect expert there are certain things that are done as part of the installation process to figure out certain details about your architecture so say some latencies to certain instructions or to your caches so really the whole measurement process starts at the installation itself of course this is all handled transparently when you actually run your application with perfect expert your application is run in sort of a controlled environment multiple number of times to read certain values of these performance events so for instance how many cache misses occurred how many instructions were executed how many branches were taken or mispredicted and so on and all of these measurements are gathered into a single file which is essentially produced by HPC toolkit that's the measurement phase the next step is to run the analysis and you do that using a special command in perfect expert and that for your application will tell you the hotspots in your code okay along with these hotspots it also as I said it runs some analyses and it tells you the overall performance in terms of certain number of categories these six categories as such these six categories are data accesses instruction accesses data and instruction tlb floating point accesses and and branches so it so it gathers these measurements and tells you in a more intuitive format as to what exactly the problem could be now you could decide if you have additional information or if you're aware of the architectural details you could just take the analysis output and try to figure out what's wrong with the code and how you could improve it but we do go one step further and in the next phase which is the recommendation generation phase we match these perfect spurred outputs with a database of recommendations that we have built into perfect spurred and that's essentially the last stage where we suggest certain changes that we could make to your application that is the detail that if you're a user and you don't really need to know all that detail and you don't really want to try to interpret you haven't the knowledge base to interpret the data what what can happen is you issue one command line that encapsulates your that initiates your job script multiple times at the end you get recommendations out so in effect without modifying your program without annotating it just taking the existing or recompiling it just taking the existing binary and having one command line you come out with recommendations for how to improve your code where and what to do so does that mean that perfect spurred only has to like understand the language itself like right now perfect spurred only supports c or it only supports for tran or it only runs in java like does it care actually perfectly don't have to understand the language itself because it relies on the hpc2 kit to extract the structure of the application and so we can find the bottlenecks on the loops basically on functions and all that and so we don't have to understand but to have to correlate this information with the source code which we use hpc2 kits for that we do however you we cannot deal with interpreted languages we have to have a binary and in fact from the bottom line is we can deal with four trends c c plus plus or any language for which there is a compiler which generates a binary which we can execute and measure and it works with any compiler by the way okay that was going to the next thing I was going to ask okay so it doesn't care about the compiler and it basically just needs a compiled language and so one install of hpc2 kit and perfect spurred together can support all your necessary compiler combinations that's right so there's there's one thing that I need to add and this falls into a little bit more detail but the thing is when it comes to these optimization recommendations we have these recommendations either in the form of code changes that you could make to your program or in the form of compiler flags because you know really if you think about it the compiler is the tool which should be generating the optimal code so instead of you having to make these annual optimizations by hand why not give some compiler options instead now that said the current database of our recommendations for these optimization suggestions uses or shows compiler for the Intel sorry flags for the Intel compiler we've thought about including flags for gcc as well but as it turns out for most scientific applications people use the Intel compiler and they stick to that oh so this actually gives me an idea so you have regular code changes recommendations you have compiler flag recommendations do you have anything like number of the compilers have pragmas for like tonal compiler it's safe to vectorize this um do you have anything like that included uh not at the current time but we we do have some catch some phrases to tell people that that's something they should be doing you want to add to that ashay uh yeah so uh one of the good things is that uh the person who designed the original database which is martin Bursher he made it in such a way that it is completely extensible and it's it's easily modifiable so uh so coming to think about it uh adding these pragmas uh indicating that okay certain loops are dependence free and can be vectorized uh i believe that should be an easy extension to the current database that we have in fact we will come back to this later because we are in process of doing some things that do and modify the source code automatically and we and and when we deal with issues like that we're going to come to later like optimizing for gpu's you have to introduce these pragmas into the code okay so this brings up another question you guys are extracting information from each or each pc tool gets subtracting information from the application being run and you're analyzing it why does the compiler optimizer itself not catch these issues if you're able to present example of a code change why doesn't the compiler just make that sense well remember two things number one is they come the algorithm for determining the optimal code is exponentially difficult and so compilers do not use a complete algorithm for determining optimal code they have a set of patterns which they can recognize if your code and by the way these scientific users are very ingenious and inventive does not happen to fall in the pattern they recognize uh the compiler can't deal with it the second thing is that compilers don't have the information we have we have all this information about runtime they only the compiler has only the information that can extract from the source code so that includes compilers with like profile guided optimization those compilers with profile guided optimization don't have the opportunity to do the kinds of analyses that we have and they're pretty uncommon big the standard commercial compilers don't do much of that okay so what is some of the most common recommendations you're you're finding that perfexpert pretty much the problems you're finding for most of performance bottlenecks fall into category x uh that category x would be the memory accesses as it turns out many many of our of the codes that we run with perfexpert turn out to be bottlenecked on memory you see this actually goes back to why did we decide what should be the focus of perfexpert essentially if you think about it perfexpert is addressing problems that you have regarding performance within a single node so it looks at your floating point execution the efficiency of your memory accesses branches and so on and as it turns out you know as as it actually happens to be the case memory is still catching up it's still it's still lagging behind the cpu speeds maybe in the years to come we might see some improvements in the form of 3d stack memory but as of now the way commercial systems are built that's that's the primary bottleneck and when you go when you go to multi-threaded codes the the effect that using these individual threads on the available memory bandwidth is quite extreme in the sense the the amount of contention that that occurs when you use these multiple threads grows quite significantly and so for many of the codes that we work with we find that as you scale to say four threads six threads eight threads then you start seeing memory as the primary bottleneck and that's where in fact if you look at the optimization database the size of our database is actually proportional to what we are seeing on the field so you'll find a lot of optimization dealing with gashes or prefetch chips in order to improve the memory performance let me return for a minute to your previous question we actually run the application about five or six times each time we record four performance counters which is all you could record in a single pass for many chips so compilers just don't have the time at users compilers don't have the time to do that or not set up to do that we just have a lot more information than they have okay so one thing you mentioned there was uh you're basically measuring the performance of an application on a single node and that is becoming more important to make sure to actually run the multi you know multi-threaded so that you see the total memory memory contention when using all the cores on a single socket that are possibly sharing memory channels right what about distributed memory applications does your database include things like MPI communication no that's not we don't do that yet that was not a problem on the ranger architecture and we don't expect it to be a problem on the stampede architecture with modern architectures the networks tend to be fast enough that the two problems come in terms of intra node and in terms of IO systems those are the places where there are current bottlenecks for the most part but and we have talked about extending the perfect spur and perhaps we'll talk about that later so does perfect spur not even support MPI applications currently of course it supports MPI applications that's what most people on ranger use MPI people use on ranger used MPI for intra node parallelism you put 16 processes on a node MPI processes yeah so essentially to clarify perfect spur does work with MPI applications the only thing is that it will measure the performance of your intra node execution only it won't take into account your network usage in the form of these messages that said say for instance here say you have a very naive version of a matrix multiply code let's say which does all to all communication and if you're a communication itself is taking a lot of time because of the function the all to all say communication becoming a hotspot it will show up in the perfect spur output but it will be analyzed from the perspective of your intra node performance so yeah to summarize essentially perfect spur does work with MPI applications there is one change that you need to make to the script basically include the MPI exeg or MPI run command in the perfect spur job script to make it work with MPI but that's pretty much the only change that you have to do it also works of course with open mp or p threads or either other internal node internal parallelism okay so the perfect spur isn't going to choke on trying to measure performance counters for MPI run it will actually get to the underlying executable being operated on absolutely yeah yeah and in fact you can extend that notion to just about any share script which will ultimately invoke a binary so if we have some wrappers around your binary which say set up the workflow or something of that sort then perfect spur will still work with perfect spur will still measure the correct portions of your binary instead of measuring the shell script execution okay so you mentioned that a perfect spur will actually run your application multiple times to accumulate different counters because the CPUs can only measure so many counters at a time what is the overhead of perfect spur then well uh first of all there is the overhead of the tool we are using to take the measures which is HPC2 keys and so it's 1.5 times percent 1 to 5 percent 1 to 5 percent sorry and more than that we have to run that's another limitation of the current architectures we only can have four we only can measure it take measure of four performance counts at a time and so we have to run maybe six or five times the application take measure of all the performance counts we have but if you can take measure of more than those performance counts at a time we can reduce it to four for example so we can we can understand that we have to run the code five or six times and every time you run the code it is one to five percent that's actually not very much overhead at all compared to sampling type performance counters and other types that's actually really low that's that's right which is why we love HPC2 is good we have HPC2 the HPC2 okay people at Rice John Miller Crumley and his team are our great friends and we are there we are their most widely used user i think i'm going to have to get them on the show next hey i would i would recommend that and tell them that the perfect expert team suggested it we'll do okay so how does perfect expert present information so it presents these code recommendations but does it actually just give me a summary of kind of how my application where it's waiting well it gives you it gives you a simple bar diagram telling you how well you're doing on each of the six categories and if you are a fairly sophisticated user and need to know that information it will tell you it it has a bar diagram that tells you from horrible to good and if you get to good you need to worry about anything but if you're anywhere beyond fair that means you have a bad some issue with that particular piece of code and remember perfect expert ties its measurements to a particular loop or a particular function it resolves each of its measurements are resolved to a loop or a function so it will tell you to begin with i have and you can tell it how many functions you want i want the top three functions that are causing me problems in case it will give you that same output for those three functions and no more if you don't want to look at that data you don't have to it will just tell you what you need what it thinks you need to do with your to your code to make it work better so what if i have one of these codes where you know the most heavy function is a very very large collection of code can i actually then drill inside that function and look at loops and functions inside of it yeah actually so so that's that's essentially another reason why we really like hpc toolkit is because it gives us this information at the granularity of of loops now now realize that if you have a code excuse me if you have some code which is running for a really long time it is either because your you have a tight loop in there or you're using recursion now many of the codes that we deal with from the scientific community don't use recursion and so it's and it's actually using loops and since the since hvc toolkit gives us measurements at the granularity of loops we can still you know even if it's a really big function it'll ultimately give us information as to which loop in the entire function is causing the slow down or is taking up a lot of time so that way you can you can drill down into a particular into a smaller portion of the code that said it's you'll find a lot of people in the performance community advising not really against but but asking you to be a little careful once you try to go down to a level which is finer than the loop granularity so because of the way these architectures work it's hard to say that a particular instruction was the reason for your performance bottleneck because there could be delays and say you know when these counters get incremented and so staying at the granularity of loops is a much much better idea than going any deeper yeah you don't want to go down to the statement level a loop nest is about the lowest level you really want to go to okay so it gives me little you know bar charts uh you also mentioned that you like to you know keep this thing as simple as possible is this like a GUI interface or something that's you know point and click but then my users have to have an x server installed and teach them how to do x forwarding all that kind of stuff from the cluster head node or is it different than that just log in with s with a standard ssh ssh and uh text scripts and it'll come out as a text script you can see it on your screen just asky asky art yes asky art that's a good way to put it but no explaining windows desktop users how to get an x server and do all that stuff no that's that's yeah no no way we are we follow the kiss principle we are ourselves some of us at least me or you know we're sort of simple minded one of the reasons to don't have this visualization maybe in the user desktop is maybe because we have to run the application on the target machine that you're running you want to run this this software okay and so maybe it's it could be interesting to have the analysis in a different computer just in terms of visualization but you're already running your software in the target machine and so we can put the output there and that's okay that's enough so it seems like multiple things have been mentioned about architecture possibly being architecture specific recommendations um does is there a way to like run Perfexpert well one does it really is it really architecture dependent or is there a way to kind of like run it in a this is a generic good good best practice this is a best practice our recommendations are architecture dependent and however as ashay mentioned before when you install Perfexpert we automatically as a part of the installation script run a set of micro benchmarks that generate a characterization of the architecture in terms of about a dozen or fewer parameters that the user knows nothing about they are just like black magic they just appear because the script generates them and they characterize the architecture and we use those and we are certainly not architecture independent yeah so i think that needs to be elaborated slightly so um a few of the things that we do measure at the time of installation are latencies to certain instructions or to certain components of your architecture say your caches or the TLBs we also and this is actually in some ways more important than these latencies we also measure which performance events are available on that particular architecture now as it turns out you know even if you consider a single vendor say uh say intel uh and if you consider the the performance events that are available in one generation of a processor and if you look at a different generation uh there it's it's hard to say or it's it's hard to say with confidence that you will have the same set of performance events supported across these different generations um so we do measure uh which events are available and find out which one which which one some of those are of interest to us and uh uh as dr brown said in one of uh in in these measurements in these analysis outputs we we basically have a bar chart uh now i figured that you might be interested in knowing about this that um the way we figure out whether your performance is good bad or horrible is by is is based on uh the the the measurements that we obtained from these micro benchmarks so so during installation time we measure what is the best case performance that you could get on this machine and so all of your analysis output is tailored to that particular machine um so uh if you have um so if you if you have the analysis output on one particular machine um then uh it's uh it's customized essentially to that uh to that architecture okay so you're telling me that but back when perf expert wasn't compiler dependent like say some other tools may be where you need to build it you with every compiler that you want people to be able to use it with if i have say you know i have one login environment and i have two sets of users each with different generations of hardware i need to install a copy of perf expert on when i install i need to run it on each hardware type and i need to have two installs one for each group for their different sets of hardware yes for example at tack we have three systems that are used on a large-scale ranger which is about to go away it's on amd chips loan star which is on uh west mirror intel chips and stampede which is on sandy bridge chips now of course every piece of software for each one of those machines is distinct anyway so the compiler for the west mirror chips is different so perf expert is just like the compilers you have to have a different one for each architecture okay so that actually brings up a point uh perf expert is hosted on attack website um it most documentation refers to tack machines you mentioned it started at a terror grid conference i assume you know it's going forward with all this exceed equipment like stampede um does perf expert since it's all architecture dependent and everything it does it really only run on tax systems or is this something you can use anywhere well it's just starting here but now we have more than 50 sites across the world using it and so if somebody i want just to try it can you know go to the tag website download it and try on your computer and we actually can help you to do that if you need and we have more than 50 sites today and it's growing that's the point it's you can use it you can use it on your machine okay one i've actually already installed it you guys have answered some of my questions and some of these questions i've asked have been loaded based on my own experience yeah so one other thing is uh we've alluded to a number of things so you guys have done a lot of this stuff what is the what are the recommendations you would make to somebody who wants to sit down and run their application with perf expert like don't share a node with anybody else keep the hardware type constant you know what what are like the make sure you do these things or your to make your data are correct the only thing you really need to do carefully is to design the input data set properly we do provide instructions on this now hope you found them by the way you need to have your data sets sized so that whatever scale you run that's the amount of data which is on a single node is representative of the volume of data that you will have if you run a production run does that make sense yeah yeah the uh keep the amount of data proportional on every processors because things like a matrix multiply scales the number of memory requests to flop scales with the size of the data you got it so and what you want to do is you you typically don't want to run perf expert on a thousand nodes you want to run perf expert on two nodes 10 nodes 16 nodes because otherwise the data set begins to be very large and the processing time will get large but it doesn't make any difference since we're basically just optimizing intra node as long as you have the proper amount of data representative amount of data on a node the recommendations that are made are valid okay so what if i'm a compiler manufacturer you said right now mostly it's intel compiler you understand but there's a couple other compilers out there what if i wanted to build a database and you know contribute it to the project how would i go about doing that well we give you you can download the database it's just a piece of Java code and you have first thing you have to do is make sure you've got the data so that the accurate references to the proper entries in the database are made so you'll have to know something about what you're doing you can't just add to the database without adding to the measurements or adding to the interpretation of the measurements to make the recommendations valid and effective but given that we we regularly add to the database different optimizations we just have figured out how to automatically compute the sizes for tiling of loop tiling for example so we can now compute those automatically so but it's not a matter of just adding to the database although you can just add to it it's a matter of having the data and the rule set there's a set of rules for interpreting the perfect for data this written in java which is interpreted and it picks out which recommendations are indicated by the particular patterns for example if you have a very large contribution to your performance bottleneck from a tlb miss then you're almost certainly have a loop have to do a loop interchange because you're you're taking very long strides so so you does this make sense now yeah yeah so what about the process of like the license of the code like if i'm a compiler manufacturer and i go through all this work for my hardware and think i've got things and i want to contribute it back is that possible we'll be happy to accept collaborative offers of collaboration from anybody and if they want to take on the job of of adding to the to the software perfect expert it's open source they can do it so what what is the license of a perfect expert yeah it's it's currently released under lgpl version three i believe so yeah we've recently had one request from a company and that's when we we went back on we just essentially went back to the drawing board and figured out what license we should have and how should we release it so as things stand at the moment i believe it's it's uh it's open to uh any sort of tweaking or tinkering from any commercial vendors it's subject to the nsf license if the vendor this license is the one that's approved by ut of the nsf so if the vendor modifies it anything that he takes from us has to be continued to be released as open source okay so let's talk a little bit about the future of perf expert sounds like you're constantly adding to this database what's some of the big things you want to kind of extend and add to perf expert in the future i'm going to mention two of them and i'm going to pass the microphone to ashay and leonardo the first one is we have been working for quite a while now on extending our analyses and recommendations to graphical processing units gpu's and we have done some work on that and i'll let ashay say a few words about what we've added to ask ashay to say a few words about that project sure um so uh to give you a broader picture um we have these fantastic compute devices these gpu's which can do a lot of work and um in sort of these massively badly ways uh the question is uh you know you have that hardware but then well first thing you have to program it which can be hard and but even before that even before you actually get to programming it because of the high amount of effort that is involved in converting your code from cc++ or fortran to use CUDA calls essentially um is that uh you need to know which portions uh do you um do you want to convert you know which portions of the code are you should should you be converting more to use CUDA um and that's where we thought we could help with some of the analysis that we do with perfect squirt so um we have uh the runtime information from these uh from these performance events uh so the idea is can we find out specific characteristics about the source code and uh make an estimation as to how well this code segment run on the gpu or whether it's just you know not scale or whether it won't run at all or with you know it's how hard is it to to convert that so in that sense we look at um the floating point execution efficiency the data structures scale uh the scalability as well and uh we try to make a recommendation um uh about the particular code segment as to its suitability to run on the gpu we've had a couple of papers on this we've um along those lines in fact um uh just like the gpu's the other big hardware that we have here at tag are these mics these many integrated core architectures from Intel and um uh in some ways you could say they're they're similar because they they uh they require that your code be massively parallelizable um and so this process of recommending certain segments to run on gpu's could also be changed slightly and run on uh and run for the mics uh that's essentially uh what we've been working on in this context i might mention that we are working on adding recommending what pragmas to add but the other big thing we're doing is we have just almost completed the first prototype implementation of a project which actually automatically generates the recommendations and i'm going to pass this to Leonardo because he's the one who's been doing that pioneering work yeah actually what we are we are prototyping here is uh to apply those recommendations after uh we get this information from Perfexpert and maybe the information from Perfexpert is not uh how can i say we need some more information and we have another tool for that for that we call macpo and from macpo we can extract different memory access patterns and we use this information to parametrize the code transformation we want to apply and so we have the recommendation from Perfexpert we parametrize this uh information for example if you have to apply um loop tiling okay what's the tile size we have to tile this loop section and so we can finally apply all those recommendations automatically with the optitron tool that we are prototyping now and we can restart the optimization process with Perfexpert and so we have this kind of optimization look that after actually uh one round or two rounds we can find okay this is the situation this is the source code where we can find the best performance for this architecture specifically and of course we can uh we have the three databases at this moment the first database is for uh recommendations from Perfexpert the second database is the number the the kind of uh code segments that we can recognize and what kind of organization we can do over this code and we can extend all those three databases to support different optimizations on uh on the original source code so the there are two two things I'd like to summarize number one is our ultimate goal is the user runs a single command line and gets back a program which is pretty closely optimized to a particular architecture automatically now this is going to take a long time we currently have implemented just three commonly occurring optimizations so don't hold your breath waiting for us to do this to completely but that's our goal and that's where we want to go with Perfexpert in the next year or two now I should mention there's one other tool the mac po tool that Leonardo alluded to which was Ash's contribution he generated this just about on his own and it adds to the the code intent the code specifications that you get from Perfexpert it characterizes the execution behavior of each data structure in each code hotspot and it is the source of most of the improvements in the database and the additions to the database that we are currently doing and at some point in time you might want to talk to ashay about mac po now one more thing in the very long run I want to do an end-to-end system for optimizing memory transfer starting with the disk and going all the way down to the CPUs but you know I probably won't live that long there's always another bottleneck I love it the other thing we are going to do in the shorter run is we have to build up a library of optimizations for the night's corner mic chips because there the optimization set will be quite different than that from the conventional multi processors multi core chips okay well thanks a lot for your time everybody and uh where can people find Perfexpert and get more information there is a website which is www.tech.utexas.edu slash Perfexpert okay thanks again for your time thank you thank you much brock thank you bye bye bye