 Before we get to the meat of this presentation, I want to kind of set the scene a little bit and kind of frame where we are in computing history. So let's go back a little bit to 1971 and Intel writes computing history by releasing the first commercially produced microprocessor, the Intel 4004. And this was, like I said, pretty significant because it was the first commercially produced microprocessor. And Intel didn't stop there. As we all know, kept producing more and more chips. And just in those three years between the 4004 and the 8080, they more than doubled the number of transistors in these chips. Just a few years after that, again, they produced one of the most infamous chips ever produced, the 8086, with 29,000 transistors. You probably already understand where I'm going with this. This is something that one of the Intel co-founders Gordon Moore coined as Moore's law. Initially, he said every year the number of transistors are going to double. Eventually, he kind of retracted that statement and said, okay, maybe only every two years. But basically, throughout computing history so far, this pretty much helped true more or less until now. Last year, Apple gave a really cool keynote when they were releasing the M2 chip, where they're using, this is kind of a marketing term, the 5 nanometer technology. Nothing's actually 5 nanometers in this stuff. But they made a really interesting statement in this keynote. They said some of the components in these chips are now 12 nanometers wide. The reason this is important is we have known physical limits of how small we think transistors can get, unless there's a completely new breakthrough and we may not be producing silicon chips at all anymore. A single silicon atom is what we think is the theoretical physical limits of how small we can make a single transistor. Going after Moore's law, assuming that we can actually still economically produce chips that get smaller and smaller, we're still pretty close actually to the physical limits known to us. That means at least the way that we know Moore's law, it is definitely ending. However, my thesis is basically Long Live Moore's law, although not in the way that it originally was stated. Basically, the way I think computing history is going to continue from this point onwards is in two ways. We're going to continue to build faster systems through either building specialized hardware. We're already seeing this in the AI space. There's a super interesting company called Grock, G-R-O-Q, that are producing specialized chips only to do one very specific task, which is inference, for example, for the Lama models. They're three times faster than anyone else using the most high-performance Nvidia chips at the moment. That's something that I have no expertise in, so I chose not to solve that problem. The other thing that I think is going to happen is that we'll actually use the hardware that we have much more efficiently. This is basically where I think we are in computing history and something that we need to solve. The way that we can solve this is through actually understanding where our resources are being spent. Therefore, we can do something about that. Up until now, we didn't really have to care about how much CPU resources are being spent by our applications because they were doubling in resources every couple of years anyway. I'm Frederick. I've dedicated the last three years of my life to solving this problem and probably the foreseeable future of my life as well. You may know me through my work on Prometheus. I created the Prometheus operator as well. A lot of things that connected Kubernetes and Prometheus I worked on for a very long time. I was architect for all things observability at Red Hat. For a while, everything, databases, distributed systems, performance engineering is kind of my thing. The reason why we're here today is because of this live stream that we've been doing on a weekly basis where we pick an open source project. Probably we'll see a lot of the projects that you're probably using or that you've seen at this conference already. We try to run them, probably we already run them ourselves in production if we don't. We try to find someone in the community who does so that we can get some real representative profiling data from these projects to analyze, to improve, and then ultimately hopefully to merge those requests and make all infrastructure in the world use less CPU. We've had a bunch of really awesome successes with this. We reduced the baseline of all Cilium installs worldwide by 4%. We've made some of the software. We write 99% faster. We made Container D installs worldwide use less, 4% less resources. We made Coupsite LDIV 10 times faster. A bunch of really cool wins that we've already gotten just in a couple episodes. How did we do this? Like I said, we do this using profiling. Just to make sure, we pick up everyone from zero. Profiling, basically the way profiling works, is that we X amount of times per second. We look at what is the current function call stack. Using this, if we see the same function call stack multiple times, we can build statistics and essentially infer, okay, if we're seeing the same thing multiple times, that statistically means we spend more time processing this function. The longer we do this, the better statistical significance we get. The more representative it becomes, basically. We can then use that and do some really interesting analysis so that we can aggregate this and understand where is all of this time being spent? A typical Let's Profile episode, we've done just about 20 of these so far. Looks something like this. We find some profiling data, like I said, either from someone in the community, like you. Or we run it ourselves already. We then use this profiling data to kind of figure out what is the thing that we want to optimize, because we have actual production use that this profiling data is describing to us. We write a benchmark for this particular function, for example, and then we try to optimize it. Most of the time, we're kind of successful with this, but sometimes also not. It's completely unscripted, by the way. We only grab the profiling data beforehand. We don't try out to optimize anything. There have been episodes where we were not particularly successful, but I've already shown you a bunch of examples where we were very successful. Let me quickly give you an example of what this might look like. I'll pull up our demo instance from the parka server here. This is an open source project that we created at Polar Signals, where we kind of profile all of your production infrastructure all the time. Then down here, we can basically see where all this CPU time is being spent, and then we can do stuff like we want to look at all the profiling data for container Dshim. Then we can dive into this and figure out what is the function that we want to optimize. Like I said, we write a benchmark and try to optimize that. Typically, what we try to do is we try to publish the profiling data like a day in advance or something and publicize it on Twitter and let the audience kind of participate in this. It's kind of a fun format. This is basically our starting point. From here, we then create a benchmark that may look like something like this. This is actually a real example of when we optimize container D. Like I mentioned earlier, we create a benchmark. We run it so that we have a baseline where we're starting so that we can then have a quick feedback cycle to know whether we're actually improving, how much we're improving, and so on. Then we come to the actual optimizations. We've grouped all of the optimizations in a couple of categories. The first one that you're always going to want to make sure is that you're actually using the correct approach, the right data structures, the right algorithms. This is kind of the highest level. You always want to make sure this is where you start with. This is also typically where you get the biggest wins. However, it doesn't hurt to always just have a look at the profiling data across your entire infrastructure because there's always something that we may not anticipate while we're writing the code of how the software is going to behave. We've definitely seen a fair share of allocations, like memory allocations that people didn't intend for there to be there, but writing software is hard. You definitely always want to then check out removing allocations, inlining, and eventually vectorizing your code. But I'll go into each of these with examples. The first one, this is one of our earlier episodes where we were optimizing the Kubernetes cooblet. We figured out that one of the really resource-intensive things that the cooblet does is it basically keeps checking what are all the volumes that need to be mounted. I guess this makes sense. We have a bunch of Kubernetes pods that are running on our hosts, and it needs to make sure that all the volumes that these pods may be mounting into containers actually need to be available. It turns out even though there's probably not a whole lot of volumes on every node, it still uses a data structure that is meant to store a lot of items. There's actually a possibility to optimize this because it typically deals with relatively few entries, maybe in the tens of thousands, but not in the hundreds or millions of entries. This is roughly what the code looks like, but it's not really all that important. The more important part is this is what the data structure looks like or conceptually what maps look like in Go. You have an array of buckets, and then you follow a linked list, or this is more a generic example of any hash map. However, because basically all the volumes are just file system paths, we can actually do way better with data structures that are optimized for prefixes. These are called trees, TRIE, where the tree essentially has all of these prefixes, and it just walks the tree. It's way cheaper to walk this tree than to iterate over a map in Go. This actually resulted in a 10% baseline improvement on the cooblet. This means every cooblet in the world uses less 10% less resources. This means it uses less energy. There's more space on our nodes for our actual workloads and all of these things. The next thing that we keep seeing over and over and over again in many episodes is memory allocations, or generically we can talk about this as escape analysis, because basically what happens in a garbage collected language like Go, the compiler decides at compilation time more or less where memory is going to be allocated. Can it be allocated on the stack, or do we have to go into main memory and allocate on the heap? That is an expensive operation. To amortize that, one of the good things that you can do is to pre-allocate all the memory that you're going to need, so that you don't need to make all of these tiny allocations all over the place that are all very expensive. You can do one very big allocation and therefore save doing all of these very expensive small allocations. One example of how this can be solved, which is just by reusing a piece of memory that you just allocate once and you keep reusing the same buffer. This is one way that you can also, or another way that you can solve this. What's problematic about this is if you're in a multi-threaded environment, this is not thread safe. In that kind of case, you would want to use something like a buffer pool, where every time that you require a buffer, you take it out of the pool, you do your operation and then put it back into the pool. What this still prevents is that you keep doing new allocations every time you need a buffer. This is essentially the optimization that we used for container D. Again, made all container D installs on the planet by about 4% less CPU. Again, more space for our actual applications on our nodes that we're paying good money for to the cloud providers. The next one is inlining. This is actually a compiler optimization where the compiler decides, okay, actually it's not worth doing an entire function call to this next function. I'm just going to include all the executable code in this function that it is calling. This is actually surprisingly effective of an optimization because we don't need to do all this work of setting up the stack and returning back to executing the code that we left off from. In this case, I forced the compiler not to inline the add function over here. What it does is it does the setup of the stack over here. It then calls the add function. The add function does its thing, returns, and all this back and forth is super, super expensive. When the compiler decides that it inlines, then we can see over here, all of this is just all part of this main function. There was no need to set up a stack. There was no need to return from the other function. If all of this happens in a hot loop, for example, this can be very, very effective of an optimization. Let me give you an example where this can be and how this can be super effective. In this case, I have a function except interface that accepts some random interface and it calls some function on this. We do this a million times. Obviously, this is not, this is just conceptual, but what has to happen if I don't have inlining in this case, sorry, what happens here is something called dynamic dispatch. The program needs to figure out which implementation of the interface do I need to call to then call it. This prevents us from doing inlining. In this case, we have to do dynamic dispatch a million times instead of being just able to call something and inline it within this hot loop. If you're doing something like this, there can be a huge, huge saving if you can just skip that. How can we do that? Well, there's a really cool feature that most recently landed in the Go compiler, but just about any compiler out there can do stuff like this where you can feed the compiler profiling information because then the compiler knows actually one of the most common ways of which interface is being called in this hot loop is actually this specific implementation. It's essentially the same thing as introducing a type switch in Go or just making sure, okay, one of the things that we definitely see often is this specific implementation. And that allows us to do inlining again. And so what we can see here in this particular case, we could save 66% of CPU time simply because we've given our compiler profiling information. We've not actually changed any code here. We've only done profiling in production and we've given our compiler representative code for our, sorry, representative profiling data for our code. And once you've done all of these things, that's when you can think about vectorizing your code. And I'm not going to talk about all of this in a lot of detail because this would be like five talks by themselves. But long story short is hardware these days can actually perform instructions that could perform multiple tasks all in one instruction cycle. And so we can actually squeeze quite a bit more performance out of existing hardware today simply by doing multiple things all in one cycle. If you're interested in this kind of stuff, I recommend checking out Daniel Lumiere's blog. He blogs about this kind of stuff, especially also in relation with Go all the time, but he talks about vectorization and very general sense as well. So kind of going back, this is our cheat sheet for profiling and optimizing just about anything. You profile, you benchmark, you optimize and the way that you're most successful at optimizing this first, you want to make sure you're using the right approach, the right data structures, the right algorithms. And only then do you go on to avoiding allocations, make sure that inlining is performed where it makes sense. And then very last, you vectorize your code. So we've been doing this for a very long time on our code as well. And even then we see on like a weekly, bi-weekly basis still 25, 30, 50 percent sometimes improvements simply because we're always doing this, always having this data and we can immediately jump into our data in production and see where all the CPU time is being spent. So now you also have the tools to deal with the end of Moore's law. So long live Moore's law and let's profile. Please subscribe to our YouTube channel and tell us what we should profile next. Thank you. I think we have a couple of minutes for questions. Thanks for the talk. Can you share which versions where you're fixed in kubelet and container D and so on? Or do you have it somewhere in the slides? Be very helpful. So the kubelet patches have not been merged yet. The container D ones have been merged, but I don't know exactly in which version. I believe in the most, it's rolled out in the most recent like GKE version. So like 128, I think we didn't see this anymore. But yeah, it takes a while for this stuff to land in production. So do it as early as possible so that we can, you know, keep getting more resources for ourselves to spend and not on these infrastructure components. Hello. Thanks for the talk. It was very nice. I have a question. How are you like checking how the escape analysis working go? Do you have any like techniques? How are you approaching it? Because I think that if you look at everything, it may be hard. Any tips and tricks that you want to share? Sorry, where are you sitting? I don't ask here. Oh, okay. Sorry. Can you repeat the question one more time? Sure. Basically, I want to ask you how you approach the analysis of the escape analysis in go. Do you have any tips and tricks? Because in my experience, it could be a lot of basically noise and how you approach to basically just check the things that you're mostly interested in. Yeah, great question. So one, there's actually a compiler flag that you can pass. I don't know it off the top of my head, but if you search for this, you'll find it that basically prints out all the decisions that the go compiler makes in relation to escape analysis. And basically, you know, the way that you need to think about it is the go compiler needs to make a decision of is this going to fit on the stack? And if it basically says no, or I don't know the size of this thing, I can't predict the size of this thing, then it's going to go on the heap, right? And so that's kind of I think conceptually how I mostly think about it. And then I use the tools that tell me, you know, this thing went onto the heap, because I think it's going to be too large. Okay, but the thing is that when you use it, you get the output of the whole, for example, application, right? Like, it's a very big kind of file. So my question is, how do filter out you the only thing, the part that you want to optimize? Yeah, fair, fair enough. So I think there, it's not really magic, I try to build the smallest benchmark that I possibly can, so that, you know, it also outputs as little as possible. Yeah. Thanks. Yeah. Oh, okay. Yeah. Thank you. Great talk. So I think my question is that you mentioned PGO. So is there any like best practice, like how we can integrate PGO into our pipeline, maybe CI, and improve our maybe performance over time across maybe different releases? Yeah, great question. So actually, the goal implementation of PGO is like completely state of the art. And one of the really cool things that the Go compiler can do that many other PGO implementations struggle with is source stability. And more specifically, it doesn't require source stability. So in other implementations, let's say in LLVM, you have to compile your code, profile it, and recompile exactly the same code as you profiled, right? So this is, this can be complicated because if you actually want production profiling data, it's kind of conflicting, right? In Go, this is not a requirement. And so basically, it just tries to search for which code hasn't changed and tries to apply PGO to all of those things, because it basically acknowledges that most of the time, your patches are relatively small, code changes are relatively small. And so PGO still ends up being applied to just about anything like 95% of all of your code, right? And so this means that you can basically take a weekly snapshot of all of your production profiling data, put it on an S3 bucket or whatever, and then download that in your CI CD pipeline, compile it. And a weekly snapshot is actually sufficient to optimize most of the code paths for that week. Okay. Thank you. Yeah, I wanted to build on PGO question. How do you think open source project can use it as part of their pipelines? How do we know which production is the best production? Great question. So the cool thing about PGO is that it most of the time isn't evaluating the profiling data itself. It just wants to understand what are the code paths that have actually been taken, right? And so actually the best thing as a community that we can do is collect all of our profiling data together, give it to, I know you're an SED maintainer, give it to the SED maintainer, and therefore SED for the entire community can be built using PGO with all the possible code paths that actually exist in reality. And the cool thing about PGO is you can basically only win. Like the worst thing that can happen is that the binary gets a tiny bit larger, but you know, it doesn't hurt to optimize the code paths that have actually been taken in reality, right? So basically as a community, we can only win. What we should do is all put all of our profiling data together and give it back to the maintainers of all these projects. Awesome. Thanks. So when will be let's profile for SED? Come join us next week.