 I'm here to introduce a speaker, Iain Romanic, and he's got a speech on reducing memory loss. Thanks. Hi, my name is Iain Romanic. I work in the open source technology center at Intel. I lead the group that's working on the open source open GL driver for Intel graphics chips. I'm going to talk today about some work that I did around the middle, kind of into the last half of last year to reduce the memory usage of our GLSL shading language compiler. My slides are also already up. They're going to be available through the conference eventually, but I figured I'd beat them to the chase. And the slides that are up there will also include the slides and also the notes that I made for myself on them. So I'm going to talk a little bit about Mesa project background, the particular problem and set of problems that we encountered, and then what was done to try to remedy the situation a bit. So Mesa is the open source open GL driver stack. As a fairly large and important part of that there's a hardware agnostic shading language compiler front end and for lack of a better term a middle end that includes a bunch of hardware independent optimizations and code transformations and a bunch of other things like that that happen before you end up generating hardware instructions that actually run on the GPU. So word count dash L as of yesterday says that it's the hardware agnostic code is about 60,000 lines of C++ including comments and license headers and all that sort of filler bits. Part of this includes a TALIC clone that we use for memory management. We call it RALIC. I believe that it was either Carl Worth or Eric and Holt who talked about the compiler architecture and in particular the memory management system at LCA I think in 2011 but I could be off on that. So the important bit about this I'm not going to go into the memory manager as a whole is that most of our compiler stack uses RALIC as sort of a mark and sweep garbage collector. So we have a lot of optimization passes and other kinds of code transformation passes that will create a new RALIC memory context, perform a bunch of operations on the IR tree, add some nodes, delete some nodes, move things around, do some stuff like this and then makes a pass through all of the reachable nodes in the tree, reparents them to the new context and then destroys the old context. So anything that's still attached to the old context sort of automatically gets freed. This has a lot of advantages for us. It means that we don't have to worry about going through and deleting all of things in destructors and writing a whole bunch of code to sort of explicitly go through and make sure that we don't have any memory leaks. Everything that's not reachable eventually gets destroyed. There's one little caveat to that that I'll get to a bit later. So we kind of developed this on the assumption modern computers are big and memory is cheap so let's just develop code so that we can actually have a maintainable compiler and if we run into problems we'll worry about them later. Well we ran into problems so a lot of games have giant hulking piles of shaders. We've encountered a number of Unreal Engine 3 based games that have many tens of thousands of shaders. We encountered a developer build of Dota 2 from Valve that on our compiler stack ate over 4 gigs of memory in the compiler at startup. So the Dota 2 developer build is kind of special because it tries to compile every shader and every variation of every shader at startup even the variations that won't be used on your particular hardware platform. It's just is let's try to compile everything and make sure it all compiles so that developers will notice oh hey there's one of the shaders that's broken that doesn't build let's fix that before it gets out into the wild but it made the machine fall over. There's also a lot of DX games that when you're running them and a virtual machine on a Linux host that's going to punch through and translate the DX to OpenGL that exceeded the sandbox capacity which is depending on the VM that you're using is either a gig or two gigs and we were blowing way past that on a bunch of modern large DX games so we clearly had to do something about our memory usage because people running on other closed source drivers weren't encountering these problems with these applications. So I basically approached this like an optimization problem so when you're optimizing an application for performance it's a pretty well known path. You go through and you collect some data figure out where all the time is going in the case of an optimization problem look for some big fish and or low hanging fruit fix problems and then keep lather rinse repeat until it's good enough and then move on to the next thing that you need to do because there's always an infinity of infinities of work to do it ends up being a little bit different with memory usage because it's not as familiar of a thing to try to tune for, right? I mean if you're going to do performance work there's huge piles of different tools that everyone knows about sysproff, vtune, perf, you know all these different things that people know about since ever to try to diagnose and figure out where performance problems are but it doesn't seem as well trodden anymore for doing this with memory usage. So I picked a couple of representative workloads in this case I used an API trace of a single frame of Dota 2 and used all the shaders that we know about in existence in a separate project called ShaderDB which I'll talk about both of these a little bit more in a moment and then collected a variety of kinds of data collected data about the counts of reachable nodes in the IR peak memory usage and how our data structures are using memory and I'll get to each of these in more detail here shortly so API trace is a really really useful application for recording traces of OpenGL and I guess DirectX calls made by an application so you can basically run your graphics application inside API trace, every GL call that it does gets recorded and you get this sometimes really really giant and by really giant I mean several gigabytes dump of here's everything you did ever and then can trim that down to a small set of frames or a single frame and then can replay that back later almost always when we get a bug report in Mesa about any of the drivers this frame is misrendered in this game or shadows don't show up or whatever the first thing we ask is can you give us an API trace because that's way easier to try to replay this small thing than to try to get the game that we might have to go out and pay 50 bucks for. It's not so much of a problem for Intel because we can go out and buy that but if it's someone working on Nouveau or one of those drivers they may some hobbyist they may not be able to just run out and buy every game so they can reproduce a bug in it and also like all apps these days that have forced push updates on you the bug that appears in version X of the game might magically not be reproducible in X plus one but the problem still exists in the driver and we'd like to fix it so the API trace sort of gives us that reproducibility. The other project that we used is ShaderDB there's two ShaderDB repos there's a public repo and then there's also a private repo that we have between the two of them it has as far as we know every shader we've ever seen in anything so the public repo we've imported all of the shaders from various open source projects and from a small number of closed source projects that have explicitly given us permission to be able to put their shaders in the repository and then in our private repo is basically every shader from everything you can get for Linux on Steam so between the two of them the last time I checked which was before right around the X developer summit last year so October-ish there was on the order of 50,000 shaders between the two repositories so it's a pretty good corpus of real world shaders okay so to collect data from this one of the first things that I did is instrumented my code to go through when the application has been run gets to a point where all the shaders have been compiled and then in the driver just iterate through all the shaders and dump out information about every reachable node in all of the IR and can collect up counts of here's all of the shader IR there's a million of IR variables 50,000 IR assignments etc etc and so we can kind of gauge where what types are eating up all the memory and kind of uses that to direct the later optimization efforts we make pretty heavy use of the visitor pattern adding support was really easy and took a really strong of code I think it was like to add all this logging facility and so it gave a pretty good idea of which IR nodes were most frequently used and represented the largest memory usage but that information is good but it doesn't really tell the whole story and in some cases it can somewhat fib to you also needed to know what is the actual memory utilization of the application and as in all things memory related valgrind for the win the in particular the massive tool gives really good information about exactly how much memory is used at any point in time by the application there's two things about it that are kind of important for this case one you really want to collect data for both 32-bit and 64-bit if you run in both environments in our case most games that we care about are still 32-bit applications so we need that data but collecting data for 64-bit is also a good sanity check and each has different alignment rules and padding rules so something that you do that improves memory usage on 64-bit might not make any difference on 32-bit and vice versa the other thing that massive was really good for is it gave before and after data for putting in commit messages to justify individual changes so I don't want to give a quick peak at what that looks like so the, where's my house there so there's a, when you have a trace that you've collected from massive it generates an output file and the important bit is you get these detailed snapshots of the memory usage at any point in time and for me I cared about finding the peak memory usage so now I can see, of course so right here I can see that after 40 billion instructions had gone by that I was using a total of 71 megabytes of heap 66 megabytes of that was useful and another 5.1-ish megabytes were padding and extra junk used by the allocation system let me get back to where it was so then in addition once I had found where the memory was going which particular data structures were responsible for the most usage could go through and try to micro-optimize them essentially and the PA hole program was exceptionally useful for that you can run that, actually I can go back out I have that ready too so I can look at so one of my, one of the data structures and it will show exactly where there's bits of padding and unused areas of data so that you can see there's sort of wasted bits in your data structures and can try and rearrange things to better utilize and plug those holes the one thing that you have to be aware of with this is sometimes it will lie to you a little bit when you're using C++ and it will tell you things for example in a derived class of oh you have this giant hole in it when really that giant hole is all the space utilized by the base class so it's kind of fibbing to you it wasn't entirely obvious at first and I wasted a couple of hours like why is the compiler doing something so terribly stupid and realized it was just the tool kind of lying to me so there is a hole but it's a hole that's full of useful stuff and again this is another case where you need to collect data for 32-bit and 64-bit because the padding rules are different so this data structure would be tightly packed on 32-bit but would have a hole on 64-bit so one of the first things that we came across was even with our mark and sweep going through and finding additional things that we could release earlier so the biggest thing OpenGL shading language has a bunch of built-in variables so every single shader has access to a bunch of these built-in variables but most applications or most shaders will only use one or two of the built-ins and there will be four dozen extra variables that are just sitting around unused previously we weren't very aggressive about removing those because we kind of didn't care so I added a pass that would go through and delete these and going by the count-the-nodes metric this made tremendous improvements in our memory usage but then when we went back and sanity-checked those results using Valgrind it said no change at all so what the heck is going on there this is why you need two different kinds of metrics to double-check each other we found that all those symbols were still reachable through the symbol table and that was actually the memory context that owned them so even though we had removed the declaration from the IR the variable still existed so we had this kind of pseudo leak it wasn't really a leak because the indicator got destroyed all that memory would go away but in the interim it was it should have been gone much sooner and this is a problem that can occur with actual garbage collectors too where you have these things that are reachable that shouldn't be reachable so then when we fixed that by basically destroying the symbol table and then reconstructing it after all the optimizations the results from Valgrind match to the results from counting the actual nodes a lot of the work was like a lot of micro-optimization work was just going through and repacking the structures and rearranging things and it was really really tedious death by a thousand cuts I think there was about 40 patches of oh yeah I swapped these two things or I made this thing an int 16t instead of an int et cetera et cetera so that was fairly boring so I won't talk about that anymore one thing that I found that was kind of interesting is that there's a lot of data that you can make where you would typically want to allocate some memory for a thing but you can maybe store it someplace else that isn't dynamically allocated in the case of Ralloc there's a lot of overhead by the memory allocator on every allocation because it has to track where that allocation is in the free list or in the free or allocated list has to track who the parent context is and a whole bunch of other things and every memory allocator has some kind of bookkeeping overhead so even if you allocate one byte there's you know the actual amount of memory used is considerably larger than that so by taking some small things and not allocating them you could get more memory savings than it seems like you should have gotten by just looking at the sizes that you were passing to the allocator but on the side this means that you may have a time-space trade-off you're going to save some allocation but then you have to do some additional work to determine whether or not a thing is dynamically allocated or not and this was one of the maybe few cases where we were glad that we had used C++ because you can hide all of this in getter and setter functions because we were using Ralloc we didn't have to do anything clever in destructors because we didn't have to check in the destructor oh should I free this or not depending on whether it was actually allocated if it was allocated it'll eventually go away because it's context will eventually go away but we did have to be clever in clone methods to not allocate a copy of something that wasn't already that wasn't itself allocated so where can you store things if you're not always going to dynamically allocate them well one place that kind of made a few people cringe is you can store things in otherwise dead space so if you have a class hierarchy like this where you have some base class that's going to have some odd alignment at the end and then some derived class from that before that void star there's going to be some space because it has to be padded out and you can't normally get at that space but you know it's there and you know how big it is so you can pad it out in your base class and now you have some space now on 64 bit that's 7 bytes of stuff there and if you have a thing you know say in your base class the thing that you're if instead of being a void star if that was a char star and you've got a lot of strings 7 bytes might be enough for most of your strings so then in that derived class you could point that char star at storage and store your string there and not have the I think Ralloc has 44 bytes of overhead for your short string so we had a bit of healthy debate about this in the in Mesa and I think the general consensus for how the rest of the compiler was architected was you so we didn't do this and partially because we came up with a different method that made this largely irrelevant so what we went with is what for lack of a better term I'm going to call static fly weights for some really common data for example pretty much every C program you ever encounter has a variable named probably has one named J and if you've run out of ideas it probably has foo and bar and a bunch of other really common names well GLSL is the same way you can look at all of the GLSL shaders ever and there's a bunch of name strings that come up really really really frequently so we could store all those strings as a static table in inside the driver itself and when you encounter a variable that has one of those names just point at that instead don't allocate anything just point at this static thing so the first step in doing that is figuring out well what names should you actually put in and it would be great if you had some giant corpus of shaders that you could mine for these oh yeah we've got that so again I instrumented the driver at the end of compiling and linking a shader it'll just run through it and dump out print to standard error var name and then the name and then just run it through a real simple pipeline and now I've got a a count of how many times each variable name occurs and actually if you throw another sort-n on the end there it comes out in order and you can see aha here's the top 500 names that every shader uses and it turns out that a bunch of the names are used thousands and thousands and thousands of times across across all the known shaders so if we put those names in the static table then we don't need to use the clever dead space hack because most of the really commonly used names are short anyway so this kind of gets into the territory of time space trade off in a little bit of a bad way because now you've got this table of I think I ended up with a thousand or so names that represented 90% of the names that were used more than once in the in all of shader db and even doing an actual hash lookup on that for every single time you're going to create a variable with some cost and I didn't necessarily want to take that cost which ends up kind of showing itself at the point of after you've done the hash calculation and then looking into the table and doing the stir comps so what I settled on using you pretty much can't get away from having the hash calculation I settled on using a really clever data structure called a bloom filter and this is basically the trivial reject so a bloom filter works you have a very large number of bits and you have some small number of hashes so you take your data you run each of your hashes on it and that's going to give you some number so now you have n numbers and you look in your bit set and if all n of those bits are set and if you properly tuned n and m you have an extremely high probability that the data that you've hashed in that you're now looking up in the bit set is actually in your set that you care about so experimentally there's very formal methods for figuring out based on your hashes and your data set how many bits you should use but I decided that kind of gave me a headache I'm not very much of a formal methods kind of a guy so I experimentally did this using all of the names in ShaderDB and settled on 8192 bits so 1K byte for my bit set I had one explicit hash and then a second implicit hash and running over ShaderDB out of about 6.7 million name lookups there were 161,000 names that were actually in the bit that were in the bloom filter and out of those only 931 were not actually in the set so it was much less than a 1% false positive rate which seems pretty good and the hash the hash calculation was extremely trivial and basically primed the data into the cache so that then you had to do a stir copy anyway to put it into the allocation and so the time taken for doing the hash calculation just kind of disappeared in the noise anyway so there's a couple of patches that are actually still out on the mailing list these kind of got stuck behind the U patch so you can actually see the really really simple implementation of the bloom filter and of the static lookup so the results out of the whole thing for both 32 and 64 bit on the Dota2 trace it started off with about 67 megabytes running on 32 bit and about 106 megabytes on 64 and after was 65 and 92 so on that running that trace which included the textures the vertex models and everything else involved in the trace cut 13% of the whole thing by just reducing the memory usage of the compiler so it was a pretty pretty significant it also made the debug build of Dota2 that compiled all the shaders ever actually runnable on I think we were able to run it reasonably on a 2 gig machine whereas before it was falling over on a 4 gig machine so that was very nice we still have a bunch of problems on the virtualized systems and this is ongoing work there's a bunch of other places where we can still trim some memory usage but these were the results that we had at that point one of the big areas where we have left is the one of our base classes of sort of the whole shebang is a linked list class so everything can be in a linked list even though most things will never actually be in a linked list so we're sort of dragging our feet on doing the transition to have everything be a linked list node and we'll just box the things that actually need to be in a linked list but it's I don't know it's really really unpleasant to have to go through and make all of that change so we're kind of oh look here's something way more interesting and more useful to work on we'll go do that and we'll do this other thing later because yuck the other thing that came out of this that I found really useful is if you don't know about git rebase-i-x you should know about it because it is maybe the best thing about git rebase ever so you can basically in an interactive rebase after each commit gets applied you can have some command run that will do stuff and depending on the output of that command the rebase will either stop there or it will continue on so the common and sort of intended usage is you're doing some giant rebase of you know your 150 patch series and you want to rebuild after each pass to make or after each patch to make sure you haven't broken the build what I used it for is after each of my patches I ran my api trace and with Valgrind scraped the output and this bit of text right here was able to collect the before and after results and generated a big file that was the short log message of the commit and then this data that it just scraped all of that and collected all that up and in doing the giant rebase while I was at lunch or something and then came back and did another rebase and cut and pasted the before and after data actually into the commit messages but then in your commits you have nice here's the data to justify doing this particular change which I think more projects should require people to put before and after justification data in commit messages because it really is exceptionally helpful when someone comes back two years later and looks at some crazy optimization and says why did they do this then you've got some actual concrete justification of it helped this app in this particular way and here's some data okay so it looks like we'll probably actually be able to go off to lunch early I don't think that will make anyone too terribly sad thank you for attending and are there any questions wait for the mic when you started reducing the memory usage did you notice any actual run time throughput improvements due to better cache locality those sorts of things you know I didn't but primarily because I wasn't I wasn't measuring it and certainly when I was running under Valgrind the running the traces took forever because it adds you know heaps and heaps of overhead I don't think that there probably would have been a lot of improvement just because of the nature of the way that our compiler is structured where we have a lot of individual passes that just run over all of everything so once you get to the next optimization pass things that had gotten pulled into the cache early in the previous pass have been blown out of the cache already so we haven't I wouldn't expect to see a lot of benefit for that on non-trivial applications so when you did these optimizations for memory they have any effect on the CPU usage yeah I mean that was what he was asking and other than a small amount of noise that was added in doing the hash lookups I didn't notice anything but like I said I wasn't really considering for that and most of the methods that I was using to measure for changes the instrumentation added enough overhead that the instrumentation was all the time right I mean running as part of the reason why I used a very small trimmed trace of Dota 2 is even that on my fairly recent laptop took about 7 minutes to run just because Valgrind adds so much overhead and that was also why automate collecting of your data so you just start the whole process running and go home and go to bed or something and when you come back it's surely done I think that's it thank you for speaking as a show of gratitude we have a good thank you