 Okay, so I think it's a good time to start. So this talk is about successful and not yet successful optimization in Valgrind. So the first question you might ask is how do we classify the successful and the not successful. A colleague at my work has said the successful is what I'm writing and the not successful are the ones that the others are writing. And no, in this case it is not that classification that I will use because I will speak about two optimizations that I did and one which has been committed in the SVN repository and the other one not committed, not committed yet. So the contents I will speak about two optimizations, the generalization and optimization of the massive execution tree concept. So I don't know how many here where they have the talk of this morning where I have described the functionality of the X tree and here we will speak more about its internal implementation and as we have seen this morning this has been committed. And then I will speak about another optimization which is optimize the stack trace recording which is done in Valgrind and this is not committed yet and we will examine why. So massive execution tree, quick reminder it maintains a heap profile and this heap profile associates the allocation stack trace with the allocated memory size. And so for each allocation what does massive do? So it has to get the stack trace and it has to insert it in X tree if this stack trace is not yet present and then it has to add the newly allocated size to the stack trace memory size. And after because it has to be able to when we do a free operation retrieve this stack trace in order to decrease the memory used by this stack trace it is adding the allocated block and the pointer to the stack trace stored in X tree in an hash table. So Maloch we do this and this as part of Maloch is done so that when we do the free operation we can retrieve the stack trace that allocated the memory so that we can decrease the currently allocated bytes by this stack trace. And so for each day allocation size using this hash table we subtract the allocated size that has just been released from the stack trace memory size. So this explains basically the massive algorithm using the current way it is done. So here I will explain more in detail so it works on this example where we have function main which is calling X A O and then X again and then you see X is calling Y, Y is calling Maloch. By the way you see it will be lost but that's not the point. And similar call from A calling B, B calling Maloch and then you have function O directly calling Maloch. So with such a call we will end up with such an X tree in valgreen 3.12 so the way the tree is represented in 3.12 is quite classical it is just you have a top node and then you have children and from Maloch if we see for example that main, the stack trace main calling A calling B that is called Maloch this will be stored in the tree with pointer in this direction. Now we also have to store pointer in the other direction because when we add the stack trace in the tree we will add it like this but when we will release when we will free the block which has been allocated by the stack trace we have in fact to start from here in order to decrease the size of all the blocks which are of all the functions which are in the call stack. So this is a classical tree representation with pointers to children but we need back pointers so in valgreen 3.12 we also have back pointers in the tree in order to go up and down in the tree. So the data structure again in 3.12 which stores this is represented is shown here on the screen so we have the code address of such a block so of a certain call in the stack trace so for example this might be function A or might be function B or might be main is a program counter inside main or inside function A or inside function B and here we record for the so this in 3.12 terminology this is an execution point so one of these small square rectangle that I present and we store here the size of this block and for the bottom execution point so the lowest level in the tree it is the size for this precise stack trace so after when we if I go back to the previous thing so here these are bottoms execution point and we see here a non-bottom execution point the size here will be the sum of the things which are pointing back at this execution point and so after we also need the children an array of pointer to an array of pointer to children and because each time we are adding a new stack trace we might have to expand the number of children here we have an array which is bigger than needed and so it's expanded from time to time so this is the real number of children and as a max number of children we can add because for example the size of this is doubled each time we reach something too small so that's the data structure in 3.12 so massive snapshots how are they implemented in 3.12 based on this thing so the massive snapshots they record the evolution of the allocated memory we have the summary snapshot total memory size, the detailed snapshot in 3.12 a detailed snapshot is a trimmed down copy of the massive x3 so this data structure with pointers and so on when massive 3.12 takes a snapshot it is cutting bit and pieces of this tree effectively what massive is doing is that the non-significant stack traces are aggregated so if you use massive you know that it will show you by default the stack traces which are significant with full details but when a stack trace is allocated less than a certain percentage of the memory it will regroup all that in what is called a non-significant stack trace and when massive takes a snapshot it will copy the x3 and then group together the things which are not significant in order to reduce the size of the snapshot so the non-significant stack traces are aggregated in 3.12 and so we have in fact when we take a snapshot we have a parallel data structure which is used to store the snapshot so we had execution point and here we have snapshot execution point which basically has things that we recognize we have the number of children and a pointer to a array of pointer to children here we don't need anymore this max number of children because this is a snapshot and it will not grow anymore it will not change anymore so we don't need to oversize this to avoid permanent reallocation like in the real original snapshot and then we have here something which is for the non-significant part of the snapshot execution point which records the things which are insignificant so in 3.12 we are using a classical tree representation with b-directional pointer and then we have a kind of parallel structure which is derived from the full x3 snapshot so the massive output MS print you recognize here and the production you see the structure of this matches the data structure that we have seen like for example we have here 0.25% in one place which are below massive threshold and basically this structure is just produced from the snapshot x3 data structure that I have explained the slide just before so now this I have quickly described how it worked in 3.12 and now I will describe what were the objectives for the new implementation so of course the first objective was to still support massive heap profiling it would be a little bit sad that the new stuff would not support massive so the other thing is I also wanted to have heap profiling reporting support by other tools so this is I wanted to generalize the memory reporting and a sub objective of allowing other tools to produce memory report was that if we don't use the new feature in memcheck for example or Helgreen I want to have no impact no CPU impact or ultra minimal impact on tools so if you don't use the new x3 functionality with memcheck, memcheck is not impacted so we want to have it also generalize so we want to have this new functionality usable for more data than only the allocated memory like what we have seen this morning we can use it for the number of blocks for the total feed blocks and so on so it's more when we look at memory reports it records more it can record more than just the allocated bytes but also for totally other types of data like for example the leak we can use the same data structure to record to represent leak information or system call information like we have seen in the experimental syscall this morning so an objective was to generalize the data structure so that it could record something else than a size massive execution point only records a size and the new implementation here can record variable things so that's an example of what can be stored also I want to have more output format so the massive 3.12 only produce the execution tree in massive output file and one objective was to output the same kind of data but in a k-cache green call, green format because that's my massive but it is not also usable by other tools so in 3.12 the execution point data structure was specialized on massive and was part of massive and the new 2.2 leak stream is part of a green call and can be used by other tools and for other modules finally I also want to reduce the memory usage for the data structure because if we want to use it for more things in main shake tools and green tools or for example for syscalls and record more data it's nice to be able to have less memory use by this data structure and final objective is to use less memory use with less CPU and not all these types of specifications so how did we look at this or how did I try to achieve this for this object so first observation is that many tools like green, green shake and green they already record a stack case for each allocation and this stack case is recorded as an executable text an executable text is a stack case but this stack case is stored in a html and it has an ecu executable text unique ID which is a really beautiful form so an executable text once we create one it is stored in a dictionary of executable text this html and never designed so it's a dictionary while during run all schools we always add executable text to the executable text usually we never use so they are added to the html never removed and so the ID is a tool registry we reuse the executable text to sort the x3 stack cases this is what allows men shake and a green to use x3 without incurring additional cost during let's say normal operation because the executable text are there they are created in any case and some can be used for other things so all the observations about application behavior which is helping to produce an optimization most applications are doing a lot of course to malloc and free and most applications exist only once I don't mean that most applications that exist twice what I mean is that some applications do not exist so both applications exist once so massive reports like the reporting and more generally all these three base reports that we want to do they are all to the termination time and so what is nice if you want to reduce the CPU usage is to decrease the CPU that is spent when we do malloc free and push this CPU as much as possible to the reporting site at the time we do the reports so termination or when we open so build to the speed of edge minimize the word for malloc free or more generally minimize the word which is not when you are recording a different ring number so first one thing to optimize the CPU is it only maintains the data for the leave in speed nodes massive 3.12 each time it was adding some it was allocating a block which was increasing the size to the leave execution point and it was always propagating that up to the speed and the new implementation doesn't do that it only stores associated to the executable text the data which is being recorded so for example it increases the number of malloc free advice for each malloc free so the addition that we have to do in order to represent in order to do the massive reports for the k-car grid which is main as allocated 20 megabytes because it has all the f1 that is allocated 15 megabytes and f2 that is allocated 5 megabytes the additions who say that main is allocated 20 megabytes we don't do that when we do malloc free we do that when we produce reports so how is the data structure looking like in 3.13 as in there so the basic idea is we have an array of data so the first difference compared to the execution point that we have seen with this massive teamwork work is that the data is a simple where we store the data it's a simple array which says here is a number of allocated advice here is a number of allocated advice and this is associated with executable text using two other arrays so you remember an executable text has a link you can identify which is an integer multiple of four so here we have here an array which is storing pointers to executable text and how do we know when we have an executable text where it stores we have here another array to retrieve the offset of this array where the data is stored so imagine for example that we have a various executable text and so we have maybe an executable text which has an aq which is 8 so 8 is the unique identifier of an executable text we divide this by four and then we obtain two and this two is the we record the offset in this array and this array where we store the pointer to the executable text and the data so this array divided by 4aq to xaq so xaq is a new unique executable text identified which is in fact the offset of this tree data structure so the typical word for an allocation using this data structure is to get a stack trace and search each executable text in the htl and this is part of a standard module of palkypalkypork and so the small node is mencher and a grid they already have the executable text so it's not need to do a new stack trace because many keys mencher and a grid are already recording for each allocation so for mencher and a grid there is no additional unwind which is needed to then we obtain the new isaq so the offset in this data structure by indexing in this array the aq of the executable text divided by four and then after when we retrieve this offset in this array we can just add the data that has just been allocated to this part of course these arrays they start with a small size and then they are expanded if we have new executable text which are whichever they could be here but this will reallocate this array and similarly these arrays will also be allocated so you can see in fact this array because not only these are used only the executable text which are used by a scheme we have an entry here this avoids that these arrays are spiked because these would contain a lot more data this is bigger so this is a single integer that's a pointer and two integers and this is quite a bit size data which depends on the reason for which you are using the heat scheme for example the number of bytes blocks and so on and if it is a six-fold if we put the six-fold x3 it might be a lot more data depending on this so it's gelato so then based on this whole implement snapshots implement snapshots is very easy but the only thing we need to copy is to copy the data array the array of the executable text and this array can be shared between snapshots so that's a big change compared to a massive and two data structures one data structure for let's say the live execution tree which was full and contain all the stack case and then it was copying the complete structure truncating some bits and pieces in order to reduce the memory by the snapshot to the relevant part and that was needed to effectively not use too much memory but here because the only thing we have to save is the data in fact a detailed snapshot we don't have to truncate the let's say not significant part we can just copy the data so this snapshot and the x3 the active x3 can be a massive encode or called green format as I explained so the output call green kcash green format is very simple we scan the executable text in the x3 we output each caller call repair we output the data image so to output the call green format is really easy now to output the massive to be able to output a massive format rather slightly more complex so this is the algorithm which is used to output an x3 in a massive load so the first thing is that we have to store the x3 executable text by their array of protocol so what we want is to if you see for example a depth zero we want to group together we want to store and have to gather all the executable text which has the same protocol at depth zero and then inside the executable text which has the same protocol at depth zero at depth one we will also at depth one sort them by the same protocol so in other words we have a level zero a thing sorted on protocol and then at level one in the subgroup which has the same protocol we will there also again store by protocol but in fact this is also not really dependent on a snapshot because the order of the executable text will always be the same if it is for one snapshot of another snapshot so this is computed once when we do and then it is kept there and may be updated if new executable text is available then after when we have solved this when we have been sorted then we will put the executable text group at your scene so we define the the sorted executable text at depth in subgroups which have the same IP as I explained and which have the same parent and then for each group we have the value of its executable text by the way here is the place where we are doing the addition so we only do the addition up in the tree if you want when we produce a heap of so if you call the tree massive 3.12 was add, add, add, add subtract, subtract, subtract add, add, add, add here is the new implementation we just do add subtract, add, subtract at one single place rather than to go completely up in the tree so this is one thing which makes this city more so this better for city so when we have for each group the value of all the executable text member of this group we sort the group by total size because a massive heap is sorted the bigger group at the beginning and the group has been allocated less half and so for each group sorted by total size we all put the function the total and then we recurve we all put the executable text of the group one at the depth so there I managed to put the algorithm on one slide but if you look at the code it doesn't fit on one page so some stats measurements based on this so first what is the size of the new tree will do so this has added about 1400 clients of city code so some comments and so on included and slightly more than 400 clients of that age which describes an interference to the new tree model massive the size of massive code has decreased 500 clients less and now what is interesting is to see how much code we have to write in order to have a tool using the new model and an example and green in order to have green be able to produce a memory report and also respond to the new tree it's 42 lines of code so you see by doing a module here which is more code than we had before we have it usable by other tools with a very small amount of additional code and we have of course additional functionality because it is more general so now some read life performance measurement so this has not been measured on a very fast machine so of course the correct code and I did this measurement with an EDA application so it's a graphical map application which is starting loads a lot of data and then with this thing I did a refresh and then it exits so it's doing 2.5 million of code 1.5 million free and we have in total in this application 85,000 different allocations 35,000 different fees a starting size of 24 fees so this is the application which has started and which shows all the objects that have been loaded so this is all the objects which are loaded and then put on the map of course all the haters are usually not showing that of objects but it was just to show that it was doing a lot of these are all the the waypoints, the navigation points, the routes and so on by the flights on the air spaces, the control spaces and so on over let's say the air traffic so what is the performance impact of the x3 so massive tool 3.12 only did a change when we went to 3.30 so first the massive needs to have its own data structure and know the x3 as its own data structure so in total for this application we went from 327 megabytes of very great memory to 151 megabytes and simply do this application was running in 28 seconds and it runs in 30 30 seconds so you can I explained this morning that you can also go to this produce x3 memory report for the check our loss just gives the currently allocated memory so what is the overhead of using x3 memory equal our loss using during call if you do this no impact no additional impact, zero impact and of course if after we produce when we end up the x3 file so if for this application it's a file of 40 megabytes and it's produced in about 0.5 seconds fast machine and so on now if we do main check its memory for full which does no work because it has to a little bit more information so this has an impact during run so it's a neglect of an impact during run because I have explained the exact context is already captured and the only thing which has to be done is a few indexing in this array and then do some auditions so it is it wasn't even very to measure what was the real impact because the variation between runs was higher than that so it's very one person to person the x3 file for the memory for full is bigger because we have more difference it's 120 megabytes produced in about 0.5 seconds no that the massif file would be a lot bigger because massif file format does not have any compression features so conclusion about it's 3 it's 3 modulates easy to use for different data you can use it for memory and then we have more tools that can produce each of files massif is faster use less memory and so this is a success full optimization and it was complete ok now let's switch to the other optimization it's not committed and that's L green so this parameter to L green tells L green to record to record two stack trees for a resumption it records the current conflicting access and the previous access and this is an example this is the current access and that's easy to obtain because L green just detects a resumption so just unwinds this stack trees but it has to tell the stack trees of the previous stack trees which has access to this memory you can have a lot of these stack trees if you have 10 trees which are playing maybe a lot of them have access a lot of the memory and so we have to record a lot of stack trees to record this cell and the idea was to optimize to optimize this so as I explained historically never record many read or write instructions I say many, not all because there are some utilization in L green for reading the trees in the same thread segments basically L green has to see the coordination between the threads and you can say between two coordination in the same segment and as long as you are inside the same segment you might optimize the recording and not necessarily record so the vacuum stack tree is very optimal but it's still cost Julian has done some pleasure I think about two years ago to evaluate the cost of the unwind on the x8664 week and to unwind one level of a stack tree so for example you have main point A, B, C to see to detect that C is called by B this is unwind in one frame and this cost about 220 instructions so remember that we might have to record a stack tree for a lot of read and a lot of write so it means that if you want to record a stack tree for one read you have one instruction and then you might, if you record 8 frames you might have to execute something like 2000 instructions but only to do the unwind you have to store it somewhere you see and green it's really a heavy activity and it's storing a lot of these stack trees so what is the cost of this history level equation so if we, again, I've used my graphical application which is a multi-trended application by the way with history level equation it runs in 46 seconds with approximate, with just pre-cox the segment start and segment end of the race option because that's only when we have a set synchronization that a stack tree has to be record and when we use history level equation we have about 17 seconds of significance with this application we have about 120 million stack trees of 8 frames that have to be computed and the measurement I did was 20 seconds of this additional amount time is used to compute a stack tree, the unwinding and 7 seconds is needed for the overhead of storing them somewhere to the specialized data so we see that a lot of the cost of the history control is unwinding and a part of the cost is the storage so, recorded stack trees let's say, for example the 40 half for which I have shown the race option we have here x++ and some unprotected modifications relative to the back end tree and so the question is how many recorded stack trees for this child function so we have one access do you have an idea how many stack trees are recorded by a gene for this I see that Julian is thinking not really but you might imagine there is one access and so there will be one stack tree but it is not it is the code and I have explained where we have a stack tree which is recorded so this function is called and when we are calling this function we have to find the return of the stack and this is a height of the head chart and if it is a height of the head chart we are recording a stack tree yes, I think so I think this is correct then we push the GDP so we write on the stack and it is not recorded in code this one because as I explained there are some optimizations and then we try to see that we have got two rides at the same place in the same segment so they have a bunch of optimizations which I do not understand and so this one we have a ride but we have no record so here this is just the most accurate here we have a ride on the stack again no recording of the bytes here again we are recording the read to this then we add one to the register then we write and we record the write to this and then after we return and here we have no record because it is on the byte but here we have a game when we do the read from the stack and we again record the read from the stack so for such a simple function which is a few instructions you can imagine we have to do 8th recording times 8th and 9th, Y times 220 instructions so you can imagine you can understand why a game runs somewhat slower than the big one so of the creation of the stack is many instructions most instructions in fact they only change the top instruction point so that is a nice observation to see and that is the basic idea with which I have optimized the stack trace recording so the optimization idea is to cache the last component of stack trace and so if the cache stack trace is valid then the blue stack trace is the cache stack trace and we just replace the last instruction point of the stack trace by the current instruction point of course that is easy the cache stack trace of time so the cache stack trace is invalidated by any instruction that changes the control flow for example call, return, etc so where do we store the cache stack trace valid thing the cache stack trace valid is stored in the shadow register because then read does not use them and I use them and I decide to use the stack pointer shadow register so if it is 0 the cache stack is invalid and 1 the cache stack is valid so it is set to 1 when the runtime of a green compute a stack case and then as an instrument as an instrument when the instrument goes then the instrumentation assigns 0 to the shadow stack point for any intermediate representation exit kind which is different of boring boring means just go to the next instruction anything else is a kind of jump and a kind of jump means that we don't just replace the last instruction point so invalidate the cache stack trace so that's a basic idea very easy to understand as everybody understood yes so it wasn't easy to understand so what is the cache stack trace we create for my application so we are computing about 20 million fresh stack trace and then we have used the cache stack trace which is exactly the same IP so it was a loop in 19 million stack trace and then we have 78 million stack trace where we have used the cache stack trace and we create the last line the CPU time is from 7.26 seconds to 58.8 seconds so that's a really nice nice improvement so it's a nice improvement it's an idea that looks simple so why is this so complicated I should have finished the poll that I could go there to say why it's so complicated I see that 15 minutes so if you're right to say why is it so complicated well the first question is how is this patch tested when it's tested with the cache address and with the self-test that I have the self-test series for each stack trace we compare a brand new freshly computed stack trace with a stack trace that I compute with this updated cache stack trace and if they are the same I'm happy and if they are not the same then there is probably a bug in what I have heard about and so the regression test and the self-test everything covers the cases which have complexified the IP so first the unwind is also not self-imperial what does that mean so the unwind algorithm for example it can be wrong result in some case like function problem in 3.12 and still in 3.30 because I did not commit this a wrong stack trace is only shown to the user for a race condition in the problem but with this idea of caching with an optimization idea such a wrong stack trace can be used for the full function in the full function is just a bunch of instructions like that the wrong stack trace compute in the problem will be used for the rest of the function and that sort of thing and so a consequence of this is that we need to have more correct unwinding at each instruction by doing when I tested this patch I discovered some bugs in the unwinding or some mutation and I did one of the improvement in the unwinding algorithm so back to this patch we have improved the unwinding another thing which is very special inside an instruction really bizarre what does that mean so let's see intermediate representation before optimization so we have here the push instruction and we see how this is transmitted so we take the base pointer we assign to t0 we compute the stack pointer minus 8 we assign to t1 we assign the new value to the stack pointer and then we store t0 in LDP and then we go to the next instruction so this is a bit more instrumentation after instrumentation each write might have additional codes to help us which have to do an unwinding so this thing we have to unwine compute the fresh stack trace maybe if it was not in my data and here we are in fact the instruction we have already changed the stack pointer but the instruction pointer program pointer is still so we have unwinding information debugging for but unwinding information describes what to do here or what to do there it doesn't describe what to do in the middle of something being done and that is somewhat annoying again not really impacting 3.12 because this kind of thing unlikely to have a response today but if we use this stack trace for the rest of the function just in the last IP it might become wrong so to see I have added a feature which is I have added properties to instruction which in the case of fixup the stack pointer by this we mark the push like instruction with a fixup property and then the generated helper code will automatically fix the stack pointer so as to redo the addition which compensates for the fact to be in the middle of the instruction so here when at the instrumentation time when this has been marked as fixup sp then the generated code here we call the same kind of helper but indicating that the fixup has to be wrong so invalidated cash stack is another problem in mind of the simple invalidated IP instrumentation has put a stack cash stack valid at the end of the block for everything different of boring well the small problem is that some code instruction do not give an exit instruction because some code are somewhat inline by the function so for this again the idea is to mark the instruction with a code property and then again at instrumentation time if we have a code property we will add an invalidation so this is an example where we have two properties added to the instruction a code property and a fixup property because this code you supports on the two problems I have described and so we are here calling the fixup and we have added here same test subtlety sometimes the same test takes a different between the fashion one and the updated from cash stack so the same test here there is a bug because and green 3.12 will compute this and the cash stuff computes that when sometimes the unwind is incorrect because what I have discovered here that for example missing fixup in the middle of instruction by green 3.12 also suffers from that so compute the wrong and for all the instruction when we have a fixup in fact the stack 3 is computed by and green was wrong but it was not a big problem because it was incorrect so cash stack logic in this case produce a correct result and now in the same test I have added some logic to detect the case where the new implementation is better than the previous implementation so no I have run the regression test and I think to my knowledge that only in 76 and in 76 64 bit I think this is over okay and the same test detects that all is okay except the known case where it is better than the previous implementation but the same of what the platform is completely ignored so should I completely send little utilization to the too shoe because we need to somewhat change a little bit of X properties and it's untested on some other platforms so maybe I could commit with a common line option to enable this daily optimization need to see if before commit we should not validate all the platforms and maybe we have to change a lot the approach because maybe the property concept that I have maybe could be useful to have for green on difficult platform again I don't know anything in this area but I understood that of course easy and easy to have called green as some difficulties and maybe using this kind of property might help all green and then we should do some green more general and so maybe we need a tool independent module a tool efficiently track all stack changes of age maybe of a name a user might have green called green and so on rather than this rather specialized and green optimization optimization cannot be designed for very simple observation both for the massive stuff and for the end games so if the patient however are not that simple like we have seen in these end games so is it okay to add this complexity when for massive yes because it was additional complexity but usable by all tools and so extra benefit but so if we have additional usable module but we have when we have various bit of logic which are in called green then in a green and so on it is not that clear to me that this is a nice tool at least not without discussing it a little bit more in this so called green as we have explained is some complex logic to make a stack and green is a different path logic to make a stack is not very important what are the different styles of some questions yes it's an idea to look at in fact as far as I understand that's I think what called green tries to do he tries to understand what's happening to the stack trace and maintain it on a permanent basis but I don't know too much the problems that called green has maybe Julia knows more about the difficulty but on some platform like this it's difficult maybe on the x86 platforms we could do this relatively easy but yes it's one of the possible ideas another idea is because these eight stack trace are often the same except the last one another idea I read was to in fact store only once the common thing and only have a pointer to the eight stack trace and in addition to deep planning replace the last one so yes this type of approach might open the door to other alternatives but the question is should we invest more in this approach or should we invest on this nice new model here but yes what you suggest is clearly so possible and the other question you see so he said you usually know there is no plan but it's about one year so I guess around September is the most common but but green is very easy to compile it doesn't have a lot of dependencies you just need to compile sdn to check out and then you can compile the whole thing it's not like if you try to compile more if you want to experiment with an sdn version of the app you can send discussions on my so you are welcome to try the new version and usually at my work and I work in displays if you crush my at my work we would use the the recent version of sdn so yes there is the end version of sdn no no