 So yes, thank you for surviving this long and being here for my talk. I'm a computer scientist so there will be a precious little geosite, well none, in this talk. So moving ahead because I know this is a common question or rather assumption, I am at the University of Oregon and there's also this thing called Oregon State and this there will be a quiz later. But how do you tell the difference? This is the University of Oregon, not our stadium, but that's the mascot, that's Donald Duck there, and this is Oregon State. Now I wasn't being you know mean to them, this is one of the first images that comes up when you google up their mascot so I just use it. Okay, so ducks versus beavers, green versus orange, all that good stuff. So acknowledgement for a few funding sources that have generously funded our work. So first I'm going to talk about performance of software and then I'm going to move on to what we do about developers. And typically to set up this, I don't get to choose the software whose performance we will be optimizing. This is basically application collaborators, so scientists in other domains who may want to improve the performance of their code and it doesn't have to be massively parallel, it could be running on a single workstation, but they want to just have it work faster. And so we have sometimes multi-year collaborations that focus on this. And the goals are usually to figure out what the goal is to that you want to achieve. I mean you can't take a code that currently runs in many hours on a single machine and hope that in a month it's going to be running in a few seconds on a supercomputer. So the goal, identify the problems then that may be addressed and then eliminate those problems and then relax. So the performance engineering cycle as it's called among my CS friends has a few obvious steps that I'll go over briefly. But basically as with anything whose performance you wish to improve, you basically have to measure the current state of things and then based on those observations you want to create some model of what's going on and then because those are all domain codes typically for me really understand what's going on means you have to talk to the scientists about their performance at the moment and then we need to talk about what we can do we cannot do. So I can't tell them oh yes switch to a less accurate method that will be faster right if the accuracy is something they care about. So this part here is pretty important here how do we communicate the results and then do we understand each other's language absolutely not and so we spend a fair amount of time in that box and then just cycles through eliminating problems one by one and I will talk about what that means in the last part of this section. So what do I mean about when I say understanding? So when you have a piece of code you want to know how much is possible to improve so your code already exists you have written it you've spent a lot of time on this implementation and you have maybe even have some architecture in mind that you wish to run on and then you want to figure out well what could I possibly achieve here and then also if you want to be able to run on a future architecture you may want to know in advance how well would you be doing on that new architecture and then when I say architecture those things matter a lot. I was originally at the Department of Energy Argonne National Lab so there was a lot of of course with each new supercomputer you want to be able in advance to prepare to run on those machines but even now with new Intel Xeon generations and other accelerators coming up you wish to be prepared for them before they actually come out and that never happens by the way so but we think about it so we do this through modeling and basically we do some through measurement and some through just pen paper mathematical models of performance which is you pick something like time execution time and now you want to create a model for that that helps you characterize some behavior and then we also use tools that look at the code itself to try to build a model of what's going on and those are think of them as compiler like tools that basically second the binary or the source code and build a model based on that so those three things and the most common one which is measurement also known as empirical analysis what we try to do here and this is where maybe I don't know so let's see a show of hands how many of you actually care about speeding up any code that you're currently working on okay so there's quite a bit of interest I should have asked this in the beginning but that's okay so this is sort of there are many many ways to do this and they're currently mostly most of them are pretty painful but our goal here is to make some sort of streamline the process for people who don't want to learn myriad of tools that are you know kind of steep learning curves you don't know whether to trust someone out and so forth so we're trying to generalize this analysis workflow for as many types of applications as possible and to collect empirical data and by that I mean not just timing so not just oh you know this function took seven seconds that function five but actually look at the memory performance in other aspects if you're multi-threaded or parallel in other ways how well is your code paralyzed all those things can be automated and so that's what we are working on at the moment I can't boast that we have a generic way to do this it's still you know once we establish a collaboration with somebody we can implement such work flows and they can use them easily but that's kind of the grand goal we use actually those are a lot of tools here that I throw around the names of but we use a lot of Python data analysis tools because there's so many and they're wonderful and so should be pretty low learning curve so we mentioned Tau Commander you may google it it's not that one instead there's I don't know if you can see it but it's TauCommander.TypeParatools.com which is the Tau Group is a big group that's been at University of Oregon way longer than me and they're still nice and happy but now there's this exciting Tau Commander on top of it that lets you actually program your analysis instead of having to click on some stupid Java we developed by students so it's really great okay so here's an example I just picked one of the projects that we're just collaborating with now but basically they have a medium-sized C++ code now the language doesn't matter too much although for some languages there's more tools than others and in this particular C++ code that's using Intel TBB for parallelism we want to find out which of their functions scale the worst you know so it's a fairly small amount in a Jupyter notebook of code to get to this kind of result right and so that's going to be the pattern and this shows this particular map seed hits is the worst scaling function as we discover it later it wasn't paralyzed at all which explains why it doesn't scale and so this is something that we started working on and then another thing so that particular function now we can go and say well what the heck is it doing when yes you can go read the code but when the compiler is done with it what is it really you know the instruction mix is one way of looking at it and those are by instruction we mean the low-level assembly level instructions right so as you know memory access is slower than flops and so we look at all these branches those are conditionals they impede parallelism okay if the compiler or library is trying to run something in parallel all these conditionals prevented from doing a great job there's a lot of memory operations going on there's hardly any computing so this these two little segments are all the computing there is so no wonder this doesn't do so well it is very memory intensive it has conditionals so so this is again a quick way to get some information about this function and then then you can say well but I don't know what this memory load store might be great it might be the architecture may be doing a great job fetching all my data as I need it right then we can check that quite easily again and figure out oh here's my level one cache misses so you can go very deep or not that's up to the developers but looking at scaling this is threads at the bottom here right you can see what doesn't scale from the point of view of memory and also the conditionals are here so you can see that you know ideally there should be flat lines they should not be going up okay and when you have the green line going up which is branch misprediction rate that's not good so that's not that's one thing that's anything is skating so this is just a ref analysis to tell you what you should be what problems you should be focusing on and then there's many many more that you can't all be included in a talk just to give you a flavor so then based on all the data we collect we may decide to build a model so is for example this miss rate here is it actually related to execution time does it does it affect it at all yes no right so we could actually check so we analyze with a little bit of linear regression usually simple enough and works and then more complex models when needed including somehow mentioned later so this is to predict execution time on an unreasonable number of threads like suppose I have in the future a thousand threads what's going to happen like things like that and then or suppose I reduce my miss rate by 50% how much speed up am I going to get is it worth my time to you know redo my data structures which is a big change right so the cost of that effort is huge and then we can evaluate it spec so static models those are weird I don't expect you to relate to them very much but they're pretty cool they're still kind of newish and so maybe not applicable to as many causes we like yet suppose you have a simple function that does something and in this case it's doing this very simple vector operation here and this is apologized for the c++ but you could be done in any language because what we do is we just count things at the binary level so it doesn't matter where it came from and you're just looking at the instruction mix except you're not running anything you're just looking at the generated binary code so remember the pie chart from a couple slides back so we get the same sort of data without running anything and based on that now you can try to determine other things so here's an example of that same data in a different form this is believe it or not instruction mix here those four values and the table or the matrix entries represent some transition probabilities because you have if statements and for loops and such where you take different paths in your code and so on so I'll tell you later what we use this for so right now it probably looks very obscure and like what the heck do you do with that so hang on to that okay because next up is performance optimization so all this was just a flavor of some of the I mean showing some of the flavors of analysis we do the goal is obviously to improve performance so eventually we have to do that even though it's so much fun playing with data and we don't want to stop so performance optimization could be simple but it's not and the plan is as I mentioned before as you figure out the problems you fix them you move on the reality is pretty nasty you know I don't know if you have you have your own war stories I'm sure but like we had a period of two three four years where you know Intel just did not let us measure flops which is for scientific computing floating point operations are kind of essential and so not being able to know how many of them your program actually did was a little bit crippling and that's why we started those static tools where we look at the binaries and try to estimate those I mean not manually but without running and then Nvidia of course you know I don't have to explain and then of course the the thing is how many of you have a machine that has Intel or AMD CPU in it and an Nvidia GPU and right so how many tools do you know that work with both these in concert and harmony languages so you have one which one oh very good yes um from the point of view of developing scientific software however it's zero and in fact I think they'll actively prevent you from from doing that because they're competing with each other so um and and yeah using those things simultaneously and well analyzing them simultaneously optimizing them simultaneously it's pretty much a no go if you wait for them to give you something so uh what we end up doing it so that's great because that's what you know why we exist and have cool collaborations with a lot of different scientific domains so so I'll just talk about our view of my view I guess because I tell the students and they do it um of what process we follow to optimize software in general so it's kind of high level um so the first thing we do um is check whether you know you have some time consuming operation that you're doing instead of blindly going and hacking loops we actually will talk seriously and they may laugh at us but we said you really need to be doing this and sometimes the answer is no um so the first step is to eliminate code that you shouldn't be um doing um and so that's pretty obvious but it's funny that you know we don't we didn't always start with that uh the next thing which is just for the brave it hardly ever happens in real life but in some cases it uh it does is that if there is a way to wean people from uh Fortran or C++ or C or whatever the heck they're using at the moment and they're unhappy with and introduce a higher level notation that expresses what they need and then introduce a code generation you know step uh that you can optimize automatically that is ideal and I give an example with MATLAB uh where you know things like uh this being you know if B happens to be a inverse you're not going to end up doing any multiplication right uh if you have that high level knowledge which if you encode this and see you know uh you'll never detect such a thing and you will be doing an n cube operation here right and I don't want to optimize that uh so it's just an extreme example but it does happen uh so wherever you can encode things in a higher um level language uh where you could actually employ some optimizations at that level that's uh that's ideal obviously there's a huge implementation cost to domain specific languages uh but it may be worth it in the long term because it will give you uh sort of a single code base multiple backends very optimized okay so that's for the programming languages so next we look at algorithms and maybe you um look numerical recipes and code that happens every single year I'm not joking that somebody I work with what actually needs an eigen solver and then yeah I mean they either you know found one on math overflow or whatever and or the numerical recipes book which I wish would be burnt um for good and they coded it up and it's really hard to convince them it's not great okay so you have to do a lot of work to prove that uh oh by the way you know they're much better algorithms nowadays than those that people came up 30 years ago so um so sometimes though that's not the case and there just aren't any and then you have to think about them and so we work on new algorithms with people who are interested this is not a typical scientific computing it's more so dynamic graphs I mean they can arrive arise anywhere but in hypergraphs actually um are used in uh computational chemistry and such and uh we and biology is heavy on on graphs um but basically how do you efficiently compute properties of graphs that actually keep changing and we sort of shy away from social networks because we don't care but um uh so so this is kind of part of the work we do is instead of just taking some um sequential graph algorithm that you're very familiar with this is single source shortest pass uh we really rethink what it is we're computing right and um or and then and static by that I mean you compute it once and you're done here new edges keep coming new vertices keep coming or being deleted how do you do this in parallel how do you do it fast um and so this just shows some results from something with it recently uh but I am happy to do this for any domain if I'm qualified to think about it um and then sometimes uh it's not that there isn't an algorithm so it's the opposite problem but you have way too many algorithms okay way too many methods available and this is a case at least for several uh different areas but this example is farce linear system solution right um there's dozens of methods that are uh theoretically equivalent so when we talk about iterative methods right uh so curl of methods they are all um the same complexity and theoretically in practice though they converge at very different rates and for most of them with preconditioning even um you cannot really predict which one would necessarily be the best for your particular problem and so uh this is where uh we uh you know how do we choose and we've given up on thinking and use machine learning and so uh this is a diagram of the complicated machine learning workflow that we use that's completely automatic for um one of the libraries we work with pet z but um the linear system um is you know the matrix is what uh we compute the properties of and based on those and you have to compute them fast and cheaply because that's overhead uh we have a machine learning model that uh then recommends what's over should you use for this particular one okay and there's a lot more uh how well does it work well it depends um if you actually train your model with uh sort of a more limited uh set of inputs from say the same domain okay not the same application just generic um and by domain it's not even the science it's more the numerical method so if you're doing the same sort of PDE solution uh you'll get a better result uh but generally uh we've been able even with a very diverse training data set uh get pretty good accuracy over 90 percent which means that in um over 90 percent of the cases when this thing recommends use this method it will be faster if you didn't use it so that's the the outcome here um the speed ups are uh we don't even consider the the case where you fail to solve it completely because speed up is infinity then uh but it's about uh you know ranges from uh a few percent to seven uh fold and this is for parallel things that can go pretty um large scale okay so anything else on this no oh low level optimizations those i'm gonna skip because i know that you want to see the human part uh but i'm happy to talk about them later if you want so there's a lot of uh details here um so humans how do we optimize humans uh right now i think we are not there yet we're not at the optimizing humans we're still studying the humans so we're doing the measurement plus analysis of humans part then leading into optimization um we only have data to go on we don't talk to anyone so that's one of just what we decided to do in this project uh so we don't not that we don't trust them we just don't want to have to deal with that and we use only available data which is the code people right um and so revision control system has a lot of metadata as well that that we use and issue trackers and such and the fun part the developer communications via email mailing lists or issue discussions and um right now we're just studying them uh different aspects of this and i'll show you a few examples and then maybe come up with some ideas of what to do or what not to do um so what metrics have we considered this is a partial list we've had more uh in the recent um months but uh you can guess you know people doing things i commit number of lines of code additions additions you know the usual um topics of discussion when you talk about the natural language part of this and so on um you can compute a whole lot of stuff so here's a specific example for three projects over uh the x-axis is depending on the project uh but it is at least several years so it's the whole lifetime of that project and those are uh bugs and fixes so you know when you use an issue tracker you submit a bug report things happen uh maybe you hope eventually it gets fixed it gets closed so this is what this is showing uh so this is cumulative bugs and fixes over time these are just issues any issues so community is just saying something um and then you respond or not if you're this project and so this is interesting right because i don't know where the bugs and fixes do you think this is how they i mean this doesn't look like a normal scenario right that you keep opening new bugs bugs bugs bugs bugs and then one day one guy uh and we checked wakes up it's a good day and he fixes them all so i mean i don't know i'd be a little worried if i was in this project and then there's more reasonable stuff and here people are clearly not keeping up in those two uh you know things are getting uh staying open closing rate is not keeping up and then uh you may want to think of well maybe they need more people in this project that they're dealing with this kind of stuff so this is uh so discussions focusing on natural language processing analysis of this so what are people talking about and those are three different um uh scientific projects and then they uh hot topics of discussion this is very standard analysis but basically bugs obviously enhancements bug fixes documentation and so on you can see sort of over time this is just a snapshot of all communications uh what do people email about a lot this is a petty mailing list developer mailing list and uh so it could be complicated the topics around the x-axis the number of emails is on the y-axis and the sizes of the bubble is how long this conversation lasted and um you know this this actually is many years the big red circles is e-years um and so apparently uh this is strange chebyshev estimate so we haven't labeled all of them obviously but you can look it up and configure error is by far the most uh long lasting discussion topic here and so uh bubble charts are not that useful but they're fun to look at this is a completely different project and they have advice and this one do not use openmp with playing lesson learned all right and then uh we did fun stuff like profanity analysis now we can do this for any git project okay so we did this for linux as well as other github open source projects and um we do have the actual words and the actual people and what they like the best but we're not showing this here um however this is not the profanity words this is the words in proximity to profane words so so those are the things you're cursing about obviously they do not like fixing things uh i have no idea who uh where's jeff yeah there's a now their vocabulary is very extensive uh now this is petsey and and they apparently curse about petsey the most so that's that's very you know and then some other topics um and the vocabularies are censored sorry um but those are this actually shows humans and what words they use the most and uh those can be looked up on demand by asking me um but what i know this is this is linux and this is i don't remember which um project to the scientific project but um those are the words on the outside so so there's very many many many many curse words that linux uses and not so many from the phd so you need to diversify your vocabulary i think um then we look at collaborations uh and there's two kinds of collaboration graphs that we create those are really cool i think um not enough time to talk about them but one shows how people are working on the same resources so the same files though they're being touched by multiple people if you if you commit to a file we draw an edge between you and the file and so you can actually see uh in some in some projects um i mean it's hard to visualize but you can do different analysis on this to show what parts of the code are in danger of being um un you know maintainable because too few people or nobody's touching them or vice versa people fighting over the same piece of code okay so and how people work with each other and whether you have one central person because if they get hit by a bus or a train i'm not transportation bias they will you know be in trouble um so here failed projects collaboration graphs so they're a little bit different from the previous ones which were successful and we're trying to learn from that our developers more productive when they're unhappy preliminary answers yes i will not explain this further you have to wait for the paper um and so um i think it's very hard to optimize productivity but what we're making here is the first steps in sort of quantifying as much as possible things that are relevant we think for people to make the right choices and optimizing their own work and so we won't tell them hey don't do this or don't do that but hopefully provide enough information that if fame is being wasted then they would know that as opposed to not knowing it at the moment and that's basically it we consider both the application and hardware performance and definitely like to talk about how to improve the human happiness and productivity so thank you oh before you ask questions here's the quiz um so which one is which what is this come on guys i was at the state very good university organ yes very thank you um i thought i was really thought provoking so maybe we have time for a few questions uh that's exactly what we're trying to determine we want to we see it when we play around with them but we can't quantify it yet and so we're trying different metrics to exactly uh that to tell you that there is a specific difference there's some uh where you have yes a single central node instead of multiple central so you compute high centrality nodes when you see oh there's only one that's a problem things like that and a lot of i call them dandelions because they're like very easy to disconnect pieces of the software if you have a lot of those that's a problem but i've not proven that yet so we're trying to make that more robust yeah well thank you oh yeah with the with the language um one thing to qualify is uh you you uh typically the the people who have ended up doing this in their domains is um are desperate and they have uh they're not going to get anything done um outside um of their project necessarily so the language is developed internally in the cases i'm involved in they know of um and it's not just single project necessarily but sub community decides that they had enough and now they're gonna do something about it and some even do it without knowing what they're doing they don't even call it a domain language but that's what they end up having um and most of the longer lasting projects um have done something like that already um and it's interesting because that's uh the critical point for many of them is when you end up having to develop two or three versions of your code so for example something that runs on code devices and something that runs on cpu's and now if you add another one and you say no way and now we're gonna switch to code generators for libraries if you can't use libraries then you don't really have much of a choice right so yeah thank you bojana um we are a little bit into our break um which was not your fault at all um but i thought this was a really interesting discussion and um part of your talk so i thought it was worth it