 Right, so it's not just about a writing tool. It's mainly about optimizing the memory layer of C++ data structure. That's a bit different than what we usually do for memory optimizations with stuff like massive or heap track, where we look at how many allocations do we have, find the hotspots there, and just allocate less. This is about how the data is put into the memory that we have allocated, and how we can, well, lay out it in a more compact way to squeeze out even more, or to make the layout more efficient. Data structures here means stuff like structs or classes. And the development bits that actually require runtime memory per instance is primarily their member variables. So we don't care about normal methods. They don't cost per runtime instance. And we also don't care about static members. So it's really basically just about the members. So for keeping it easy, we will just look at simple structs here in the examples. So yeah, how is the memory layout actually done? Here are a few. It's slightly simplified, but it's close enough that it actually covers about 90 or so percent of the use cases, and certainly most of the things we find in KDE. So in general, memory layout follows the declaration order of the member variables. So they are just put one after another into the memory of the structure. Not really surprising. The next one is even more surprising. It's the size a member occupies is the size of the data type, who would have thought. But with the third one, it actually gets interesting. Member variables have to be aligned based on the alignment their type requires. For primitive types, like the built-in types, that is usually the same as the size. So an integer needs four byte of space, and it needs to be four byte aligned in memory. On 64-bit platforms, the biggest alignment you can find is a pointer that's an 8-byte aligned. And if you have complex types, the alignment of them is the maximum of any of the alignments of its members. Right, the only missing bit for C++ is if we inherit from another struct or class, the memory layout basically just follows each other. So first I have the base class, and then right after I have my derived class. If I have multiple base classes, they also just follow each other. That is fairly straightforward. And then you have virtual inheritance. And that stuff is just crazy. You probably remember virtual inheritance is used for complicated base classes in a complicated multi-inheritance scenario. So if you inherit from two classes that both have the same base class, you would end up just duplicating the actual data members for those complicated base classes. And in order to avoid that, you use virtual inheritance to merge that in the classical diamond pattern. Most people don't use this. There's only one exception, and that is solid. I don't see the dirty people here. Yeah, solid links have a use of that. That's why it's still relevant for C++ to look at this. The problem with this then is that only at one time you actually know the final memory layout. Because the layout of the base class depends on who actually inherits from it. But in most cases, we can ignore that part. And that's just virtual inheritance. That has nothing to do with virtual methods. Virtual methods also have a minor impact on the memory layout in the sense that in the base class that first declares a virtual method, it's basically an invisible member where we are able to edit for the virtual table pointer. So you can imagine that as a VoidStar member. That is always the first member that you can move. And that the compiler just injects there. But the way that the layout is done for this is otherwise the same as for normal members. So lots of theory. Let's look at an actual example. So we have a simple structure with a B member that one byte, one byte aligned. Then we have an integer member, four byte in size, four byte alignment, and another Google. So total, some of member sizes would be six bytes. But how big is that actually in memory? 12? Right. Because due to the alignment, we can't actually put stuff next to each other. So the first rule messes up the alignment for the integer. So we have three bytes, just nothing in there that are unused. Then we have the integer and the Google. And then because the alignment of the whole thing needs to be the maximum alignment of the members, we have another three bytes just unused. And there you can already see how to improve this. But if I just reorder it to minimize the padding, I can cut it down by four bytes. So if I have a single instance of this, saving four bytes in my application is probably not going to make a difference. If I have 200,000 instances of this, I'm cutting down memory consumption by 30%. And that's where it gets interesting. So in order to find this, it would be useful if we actually have tools that allow us to introspect data structures and tell us, OK, this is where you have padding. Consider moving the stuff around. And there's two kinds of tools that need to know the memory layout, the compiler and the debugger. So looking in there at vicinity, we probably find the information for us as well. And GCC actually has a warning switch that will warn you about every bit of padding you have. I mean, that's a start, but it's extremely noisy. And it would also give you the warning for this example where we actually can't optimize anything anymore. I mean, we would need to change the whole data structure, but just by reordering, we can't make this any smaller. So getting warnings for all of the unavoidable cases for KDE scale code base is not really useful. Then there's a tool from a set of tools called dwarfs that uses the debug information. So for those who need a full debug build, and it extracts the memory layout out of there, that works fairly OK on C code, but it fails on inheritance. It fails on static members and all the C++ specific stuff. And that's what actually I've got me into writing my own tool. Yeah? Just to add one more thing. Yeah. If you try it on C++ code, it gets completely confused by static members who have absolutely no impact on the memory layout. So it's a start, but it's, for our use case, not really usable. That's why, as part of other tools in the binary area I was already working on anyway. I implemented a tool called Elf Pack Check, part of the Elf Desector Git repository in KDE, that actually also supports most of the C++ scenarios. The only thing I can't handle yet is SOLID. So the virtual inheritance there gives arbitrary wrong results. It's doable, but it's a lot of work because you basically need to recreate the complete memory layout with the four different kind of virtual tables and virtual table tables involved in constructing it and then execute some dwarf expressions on there and find the right offsets. But for all the other cases, it's actually producing useful results. And a couple of months ago, Laurent got his hands on it. So if you run it now on the KDE code base, especially in PIM and some of the frameworks, you won't find that many problems anymore. But in some other areas, you might not have applied it. But then also related to the tools, you can verify the stuff with static asserts. That's useful if you did some careful optimization of a memory structure. And then the next guy comes in and just patches a bool into it and makes it go up by 30%. So you can basically unit test certain assumptions about your data structure and get a compile error when somebody accidentally makes a larger and then you can decide, OK, this is unavoidable, or, OK, let's try to squeeze it into something more compact. So yeah, what you can do to avoid this unnecessary padding is basically we order the member variables. The general rule of thumb there is sort them by alignment. Sometimes you find the rule sort them by size, which is not actually correct, but in practice is close enough. But there's a few surprises to keep in mind, especially when working with C++. And that is looking at the alignment of the base class. And there is one common scenario that we have in Qt related code that is private classes of copy and write classes that inherit Qshare data. And Qshare data has a 4-byte alignment, where you most likely will have member variables with an 8-byte alignment, like a Q string or pointer, which if you follow the basic rule of sort them by alignment, you have a 4-byte gap in the beginning that you are not using from the base class. Even Qt had cases of that. Qt data still has it. I wasn't allowed to fix that. So if you have a few booleans at the end, you can basically move them in the beginning, fill that gap, and say 8-bytes in total. When you do these kinds of optimizations, keep in mind that the memory layout is actually different on 32-bit and 64-bit, because the pointers have different sizes and different alignments. And you, of course, don't want to optimize just for one case and then make the other one worse. So usually, if you optimize for 64-bit, 32-bit will be also fine. The other way around isn't the case. And then there's something that makes fixing those issues in KIO somewhat annoying. There's lots of member variables that are compile-time conditional. And if you don't want to totally mess up the code or duplicate it, like finding a way to reorder it so that it works in all possible combinations can be a little tricky. But then, in general, you get to the point where it's a trade-off between, is it's really worth optimizing the 4-bytes out of this class compared to the maintainability, right? And then we have the really fun stuff. One class that was showing up very frequently, and that is actually very high volume, is Q-hash node. And that slightly simplified has this layout. So you have an integer first for the hash value, then you have the Q and then you have the value. Q and value depend on the template arguments. So if you have a Q-hash of int and Q-string, the hash node is 16-byte, has no padding. If you have a Q-hash of Q-string int, so just switch key and value, you have 24-bytes and 30% loss in padding, which for such a class is actually unfortunate. There is a way of fixing that with a bit of template metaprogramming. And based on the alignment of the various types, have two different implementations where you swap the members and then with an ADF select the right one. That's unfortunately not binary compatible, so we can't fix that in Q5. But if you have such kind of high volume template classes, there is actually ways to compile time besides on different memory layouts. So yeah, with that approach, we can reduce the waste caused by padding. We need less memory as a nice side effect. We have better utilization of the CPU caches. So in general, that also helps with overall performance. Unless you have totally tricky cases like the hash node case, the impact on code maintainability is actually fairly low. And it's just swapping around a few member variables. But yeah, it's also, and is that really everything we can do? Is there more we can squeeze out of this? And of course, there is. So what we looked at so far is basically just the byte level layout that the compiler does. But if we take a step back and look at the information we actually want to store there and ignore the actual in-memory layout for a moment, that shows us that there is still a whole lot we can get out of it. I mean, the extreme case is a bool. That stores one bit of information, but it needs eight bit of storage. I mean, you can hardly make that less efficient. Enums is often another such example, right? If you have eight different enum values by default, that would occupy 32 bits. If you don't use them in a flag configuration, actually, three bits would be enough. Even in pointers, we have that. If you take a QObject pointer on a 64-bit system, that is eight byte aligned, which means the three lowest bits are always zero. So conceptually, there's only 61 bits. That is actually valuable information of the pointer. In practice, it's actually less because you don't have that much addressable memory. But so there is, looking at it like from a theoretical point of view, there's obviously more space we can fit stuff into. But for that, we need to kind of look at the sub-byte or bit-wise layout of the memory. There is some support for this in the language with bit fields where you can actually specify after the variable name how many bits should be used for this. So if you know that the integer only needs a smaller amount of bits because you don't have that large numbers, you can squeeze in some other stuff at the end. If you follow Qt, there have recently have been a bunch of changes to actually get rid of this. As Mark found out, that GCC generates invalid move constructors for bit fields. I mean, that is, of course, a compiler bug. But so it's also worth looking at alternatives to that in case it's causing problems. The obvious alternative is just manually do some bit shifting and masking to find out the bit you want. That's usually hard to maintain and annoying to do. But that's kind of the ultimate option is that you can basically arrange it in whatever way you want. And then we also have some higher-level classes. If you think about the Boole example, I mean, one Boole is already bad, but you could have an array of Booles or a Q-list of Booles. Then per entry, you waste 80% of your storage. So there is a few special case classes in Qt, like Q-bit to Array, that actually do the bit-riddling and store them in actual one-bit entries. And standard vector of Boole is also a special case to store this in a much more compact way. Yeah, for Enhance with C++11, we have the ability to actually change the storage type. So if you know you only have 255 different values at most and you never want to have more, you can actually specify that this should go into one byte. And of course, you can combine that with bit fields to make it even more compact. Unlike the reordering of the memory levels, this actually has some CPU cost. Usually it's just a few bit operations, so it's not that big of a deal compared to the performance you get with better cache utilization. But then you get to the point where it's actually something that might make sense to actually measure the impact. And it certainly has an impact on code maintainability and readability. And there is another problem. Pointers and references can only address one byte. Or one byte is the smallest unit you can address as a pointer. If you now start to put two different variables basically into the same byte, which you can do with bit fields, you can't take the address of this anymore. Or you can't pass this pointer into any function that expects a rule pointer. So there you always end up copying a bit around. The effect check utility can actually, for some type, I think for Booleans and for Enons, already measure how many bits you actually need. And it can show you how many bits of your data structure are actually used. So this helps with finding ways where you can optimize the sum on this level. And then there is a few more, I would say, dirty tricks that you shouldn't only use in case of emergency. First of all, you can disable the alignment rules. That works on some platforms like x86. It has a certain performance impact. And then basically, the compiler doesn't care anymore about the alignment. And everything is nicely packed directly together. This gets you a really interesting runtime behavior on ARM because there, this actually crashes or triggers a CPU error. I found this in one or two places in the QML engine. Yeah, that is kind of the last result. Slightly less bad, but also somewhat shady is actually using this pointer alignment gap. So as I mentioned earlier, if you have an 8-byte aligned pointer, you have three bits that are actually unused, that are always 0. So there's a bit of masking them and always resetting them to 0 before you de-reference the pointer. You can actually store some extra bit of information in there. Well, if you're always reset it to 0, so yeah, you need to be really careful with using this kind of stuff. At least luckily for that, Qt has some non-public classes in the QML engine, Qflagpointer, and Qbypointer that do that already as a template class. So you probably want to steal that rather than try to implement it yourself. But this is also something I would only use if absolutely necessary. And you have some super high volume classes where you would have 8-byte extra costs for one bit that you need to store somewhere. Yeah, I mean, this is used in a few places in Qt. And it's tempting, right? If 8-byte overhead for one bit that you need to squeeze in somewhere. But yeah, it has interesting side effects, especially if you forget to unmask it at some points. Right, and then this all has its downside as well. The memory layout is essentially what defines the application binary interface. So as soon as you change anything in there, you're binary incompatible. I mean, we are in a fortunate situation that most of the classes actually having a significant amount of member variables are actually private classes. So those we can reorder as we fit. And in applications, it's, of course, also fine. But the Q-hational example, that is something you can't actually fix before moving to Qt 6. And that is one of the really high volume classes where this would actually make a difference. CPU cost is something I already mentioned. There's also downside to the improved cache utilization. So if you have a data structure that is simultaneously used for multiple threads, and the threads are actually running on different CPU cores, you'll end up with a cache reload ping pong effect that actually makes it a lot slower. So in those cases, it can actually be beneficial to move things further apart so they have different cache lines. Mark has interesting benchmarks to show this. That is, for most of the stuff we see in KDE, not the case. I mean, this kind of heavy multi-spreading is relatively rare. But if you work on that kind of stuff, I mean, that is a whole different area of problems you might run into, where the layout has actually totally other effects than in a single threaded case. Yeah, portability, we already mentioned it on ARM stuff is far more fragile when you misalign memory. The more you start with these dirty hacks, the harder your code gets to maintain. If you reduce the storage size of the minimum, for example, you run into problems with future extensions. Like I said, you need one extra flag, but there's no space left for it because you really minimize that. That's all things to consider before really going down to this world. So, yeah, in conclusion, it's fairly easy to avoid the unnecessary padding and just the low-hanging food on memory waste that you might find in a few places. Once you get into high-volume classes where you really have many, many instances of, it starts really interesting to think about what is storing in there, not so much from the implementation and technical point of view, but from the information theory point of view, what is the actual content I need to store here, and how many bits do I conceptually need for that. And then based on that, look for a memory layout that tries to minimize it. And of course, none of this is a replacement for actual memory profiling and trying to avoid to allocate stuff in the first place. That's always going to save you a little more. So this is only the step after you've done that, and there's really some classes where you need a huge amount of, and then look on how this can be further compressed. Yeah, that's it. So what's the most dramatic reduction in memory use you've managed that could be achieved with this? I think the biggest ones we had were 30% to 50% in really small. I mean, the example I had was 30%. Cuehational would be one of the cases where you actually say 30%. I think that is probably as extreme as it gets. You can never save more than 50%. Well, yeah, unless it's like, well, you can probably construct in, can you? Actually, if you store your Booleans in a 64-bit integer in a one-bit fear, one of them never sure. Yeah, 64 bits. Yeah. I mean, an existing program that you've managed to reduce by a total memory. Well, in actual real-world code, in most of the high-volume classes, apart from hash node, I think it's usually, you get an 8-byte over a multi-hundred-byte structure. So it's not that much. So it's a few percent. And once did many of you with Falker's tool, you can see for free, like finding these issues in KML, probably two years ago or so. And there, I think it saved like, I don't know, 20, 30 max from a total of, I don't know, 500 or so. So it was a few percentage, quite significant, very significant to fix. So just mention it. Well, I mean, there's two cases where it's worth looking at that. Either you have an application where you know what your high-volume classes are, or you're working on a framework where you have classes that a user possibly might be using in a fairly high volume. For most of the classes, you have so little instances that, like, especially doing the more tricky optimization is not worth it from the effort point of view, from the maintainability point of view, and so on. But just reordering members, I think that, for me, it became kind of part of the habit when I just write the class. I pay attention to put it in the right order. So you at least avoid the unnecessary waste, the low-energy fruit, and the rest is really for some high-volume cases. Thank you.