 Hello, everyone. My name is Ben Slas and this is Dressed Up Performance with Closure. But about me, I'm a backend engineer at AppSplire. I'm working on large-scale systems which ingest billions of events every day, and I love Closure. Most of our backend is written in Closure, and you might ask yourself, can Closure remain viable in such a large scale? Spoilers for this question? Yes, but you need to change your mindset a bit and to adopt some behaviors and practices which will help you to solve performance-oriented problems. Every discussion about performance should have a bit of a meta discussion about performance and starting with Donald Knuth's Immortal words, premature optimization is the root of all evil, which we should put in context. We should forget about small efficiencies, say about 97% of the time. Premature optimization is the root of all evil, yet we should not pass up our opportunities in that critical 3%. What is our critical 3%? That is our infrastructure. Those are all libraries and our frameworks. You are a developer. You want to provide a good API for users. You want to provide readable code for developers, both yourself and colleagues. You're working on a library or a framework, and you want it to be good, and you want to not only provide a good API, but you also want to provide a performance API. This talk, I consider a spiritual successor to the naked performance talk by Tommy in Closure Trade two years ago, and in it he provides three big ideas. One of them is little things, and those are little things which accumulate, and on your hot path they have a significant effect on your performance because they accumulate. The other big idea is data and compilers, and the other is algorithm. I'm going to be focusing mostly on the first one, but we will see later how it connects to the second two. And after this talk, I wanted to do some tinkering because that was a very interesting idea, and in it Tommy mentioned, for example, that getting in a map and an unrolled implementation of get-in have significant performance differences, and boy, what a difference. Five times faster is a dramatic improvement, and that's nothing to sneeze at. There's certainly something there, but we need some new methodologies, some new approaches to even understand how to take advantage of it, and how to squeeze the most out of closure here. And we need to get into the right mindset to solve those performance problems. But where is your problem? Most likely it's not in closure, right? I'm going to assume that your problems are first in your architecture. Maybe you're blocking, maybe you're calling multiple services and waiting for responses. Maybe just your algorithm is not correct, or you're not using the correct data structures. And that's all I have to say about that, because this is not what the talk is about. To solve the low level performance problems, you need to embrace some things. You need to embrace your runtime, which means you need to embrace the JVM first. You need to understand that JVM is a bytecode interpreter with a just-in-time compiler. A just-in-time compiler can turn that bytecode even down to assembly. And to do that, it needs to see predictable behavior. And what does predictable mean for the JVM? The JVM just-in-time compiler optimizes hotspots. Hotspots are places in your code which you hit frequently. And for example, it can inline small methods. And predictable things are things like monomorphism or branch prediction. What is monomorphism? When you dispatch through an interface to a single implementation, that call-site is monomorphic. Branch prediction. If there's an if statement, but you always take the first branch, this is quite predictable. Unpredictable is the exact opposite. Code which is very polymorphic. From a single interface, you dispatch to a variety of implementations. Allocation is unpredictable for a variety of reasons we'll talk about later. And some negated assumptions. Things like array bounce checking or branch prediction. If we can't know how they will behave, we can't optimize them away. Another thing we need to embrace and understand is the hardware. Modern hardware, modern CPUs like contiguous things. What is contiguous? Arrays are contiguous. Things that fit in cache and closures, data structures actually take advantage of it. You can see that they're implemented with very wide-backing arrays. And they're in a width which actually fits to cache loads. The opposite of that would be cache misses. Now what causes cache misses? Loading stuff from main memory. And when do we need to do that? Memory is loaded in pages. And if the object you are loading is not in the page, you need to go to another page to evict the current cache. A linked list would cause you to do that. You always point to the next node in the list and you will have to go find it. It may not be in your cache, most likely not. Every new object you allocate is allocated on the heap and you need to load it. Another thing you need to embrace is closure. That might sound ridiculous to you. This is a closure convention. This is a closure talk. Of course I embrace closure. But you need to roll up your sleeves and get right in there. You need to understand the APIs. You need to understand the implementation. You need to not be afraid to read closure code or the Java code. You need to understand why some APIs are fast and some are slow. You need to understand when and why allocations happen. And once you do all of that, it's time to profile. Always profile first. And not just profile. Profile living applications. You don't want to test your applications in a synthetic environment. And you have a variety of profiling tools. You have visual VM in the Java flight recorder. Very useful. They can give you lots of information. But I'm going to be focusing a bit on flame graphs today. Now, what are flame graphs? A flame graph is a visual representation of percent of time you spend of your CPU doing stuff and on the horizontal axis and in the vertical axis you have a visual representation of your call stack. If f calls g1 and g2, g1 and g2 are stacked on top of f. g1 calls h. h is stacked on top of g1. And we cannot spend more time in g1 and g2 than we spend in f. That wouldn't make sense. Then you can see that their stacks together are slightly smaller than f. Same for h. This is a real flame graph. I produced it from stress testing HTTP Kit with some code with rate it and JSONista and the entire metazine web tech stack, let's say. And each one of those stacks, each one of the vertical stacks we can see is laid down on top of a white block which is a function entry point. And we can look at them and ask, what's taking so much CPU? Because each one of those stacks is 10 to 13, 15 percent of our CPU. What's going on there? Where's my CPU behind stack number one? Keywordized keys. Stack number two. Parsing the query params in the request. Stack number three. Building the request map inside HTTP Kit. All of those things have been turned into actionable conclusions, pull requests to the respective repositories which produced performance improvements. Once you find out the pain points you want to fix, you can start benchmarking. And your workhorse for benchmarking is Criterium. Like we saw previously. You can use it a lot during the exploration phase. You can try a variety of implementations and to try to compare them and understand why there are so many differences between different implementations. I used it for my pull requests. You can test it for a variety of inputs and see how the results changed based on the different input. And we can see here, for example, a rough two and a half to three times improvement. Another tool is CLJ Async Profiler. Let's profile the cases we've seen previously. When we get in, we can see that we passed to a function called reduce one. And it iterates over the sequence of keys. And we need to understand what's reduced one. Why is it slow? If we unroll it, we see that when we go through a series of gets, we only pass through closure runtime get before we hit the val at method. If we dispatch through keywords, we can see how we go through the keyword method before we go to val at implemented by the map. And if we invoke the map on the keyword, we see there's almost no indirection. What's going on here? We'll see in a bit later. CLJ Async Profiler, very useful tool. You can use it interactively. You can embed it in your application. But read the documentation. Make sure you set it up correctly and of course understand the results. How to read a flame graph. Setting it up correctly. It's all in documentation, but it mainly involves setting the JVM flags correctly and making sure the kernel will let you see profiling events. Another tool very useful is JMH closure. For when you need the most accurate results, this is your go-to tool. You can set up elaborate scenarios, elaborate test matrices, but it requires some learning. This is your go-to tool. It's important to construct good test cases. Make sure you don't test your code in isolation. If you do, make sure your test cases are not synthetic. You need to understand how the JVM optimizes code. And when you understand that, when you understand how the JIT behaves, you know that you need to make sure that your function doesn't see just one type, that it doesn't just see one type of collection, one type of input argument. Those things matter. Make sure that in a single test case the same function might see a variety of inputs, which would affect how the just-in-time compiler behaves. Some of my findings. This gets into the territory of things you can apply. Not just tools, but conclusions. The big little things. They have a performance overhead, which stretches from bad to terrible. Terrible, for example, is satisfies. The rest are mostly bad. Now, you might say, well, merge, that's just 200 nanoseconds. Why should I care about that? When your performance budget is under 600 nanoseconds, 200 nanoseconds make a huge difference. These matter on your hot path, not when you're loading a config file. And there are some fast little things. Mostly reduce, reduce KV and transduce, which goes through reduce. And they are fast because, like I said previously, vectors and maps are backed by arrays. And reduce and reduce KV iterate directly on those arrays. That is very friendly when it comes to cache behavior. And plenty of my findings are taken from CLG Fast, which is a library I maintain and develop. And it implements all of those things heuristically, and I'll talk a bit about that later too. One of them is that the JIT makes a huge difference. Array clone versus arrays copy of three times faster without JIT. JIT should be turned on and it is on by default. Just ensure that. Another thing is the cost of iteration. Compared here is the throughput for get in versus the inline version of get in. And we can see that it is about two to three times faster. Here you can see the throughput change in percentages. We can also see the same for a source, for example, which can take a variadic number of arguments. And here we can see the relative speed up 50 to 60%. And we can also see a similar behavior for select keys, which is even more dramatic. We can get three to four times speed ups. Another thing to understand, and which I found is that indirection has a cost and it is measurable. Why is invoking a map faster than calling get? Two reasons. One is that when we call closureRT get, we pass, win, if, cast, reflect, and only then we dispatch to valet. Alternatively, when we invoke a map, we dispatch to valet directly. This... What's important to understand here is that RT get might see a variety of types, not necessarily I look up, which would make the if statement, for example, not be inlined, or just elided because it cannot be predicted. And then the call to valet won't be inlined. On the other hand, when we dispatch via invoking a map, we dispatch to the direct implementation directly. There is no indirection and no code that can't be optimized for a variety of reasons. And we can apply this knowledge then to write a small macro, which takes the input argument and applies it to the second argument. Then when we eliminate the indirection, we get more speed and now we understand the results that we can see in the flame graph. We from invoke call directly to valet with no indirection and no overhead. Another thing is the cost of allocation and iteration. Take equals, for example, when it gets more than two arguments, sorry, it iterates over them. Now both the rest function allocates a sequence, then that sequence has to be loaded into memory. And then when you call next, you allocate a new object. So you both churn allocation and you always have to load that into cache. Now the JVM is garbage collected. It can handle that pretty well, but not optimally. And that's what's important to understand here. What can you do then? What you can do is ABC. Always be compiling. And what does that mean? It means that you need to do what you can at compile time. And to turn yourself into a little human compiler. And what does a compiler do for you? It does loop enrolling. It does type specialization. It does partial application. And we will see now how you can do some of those things. For example, how can we enroll loops? Multiple arities. Let's take a source, for example. It takes a key and a value, and it might take a sequence of keys and values, then iterate over it. However, we can provide it with multiple arities for one pair, two pairs, three pairs, which will then unroll the implementation. Then we dispatch, actually, to the correct implementation instead of iteration. That can also be done via a macro. Same thing. We take the sequence of keys and values, and we enroll that into a sequence of a source. Same with get in. We take a sequence of keys, and we enroll that into a sequence of gets of the keys. We can define everything static. For example, if we have a function, which maps a composition of two functions, we can define that composition. This has two advantages. One is that we don't allocate a new object every time we call comp, and the second is that that object can be optimized, too, every time we hit foo, instead of maybe optimizing it a bit and then throwing it away. It can be optimized many more times because we hit its method, invoke method, plenty of more times. Another aspect of ABCs is to always be closing, and what's important to understand here is that at JVM compile time is all the time. We close over everything else. Let's take get in, for example. We don't know the number of keys until we get them. When we do get them, we can count them, and then we can take them apart and enroll the call to get. But what if we know the keys before we know the map? Then we only take the key and take the map later. We return a function that will take the map and already return the unrolled behavior. We close over the unrolled behavior inside the function we returned. Where can we use that? Here's an example from a library called O'Doyle Rules. We can see it calls get in on an unknown sequence. It calls update in on unknown sequences. Assos in on unknown sequence. So those are cases, both where we know and don't know where we can enroll those functions. And this is the bread and butter of what the library does. If we can speed that up, the library will be faster. And another example of closing over everything when we know it. An example is double dispatch. Double dispatch, multi methods. They take a dispatch function, they dispatch to the implementation via dispatch table, then apply it. We take an argument, apply dispatch function, get dispatch value, get the function associated to that dispatch value, and apply that. Starting from the end, we start from the unrolled implementation. First key, first function. Otherwise, second key, second function, et cetera, et cetera. Once we have the fully unrolled implementation, we need sort of current rest implementation. The current key and the rest of the behavior. Here we have a representation of that. Is the dispatch value equal to the current key? Otherwise, call a function representing the rest of the behavior, which is currently unspecified. Then take everything we don't know and turn it into an argument to the function. And let's assume we don't know anything. We don't know the dispatch value, argument, rest function, current function, or current key. Then, we actually do know the dispatch table in advance. So, we can bind those arguments first. And the key in either among you might recognize that this function can be passed to reduce kv. We can reduce over the dispatch table and compose a function which will implement the unrolled behavior. And that is how we unroll a dispatch table. Now, there's another thing here which we can eliminate is indirection. The equals function is highly polymorphic and it reflects on the type of its arguments to dispatch to the correct implementation. But we already know k's type. So, let's assume there is a function which will take the key and returns an optimized function which will check if the value is equal to key. We can just dispatch via the type. If it's a string, dispatch to equals. And voila, we have close over that as well. Now, this is not a high in the sky principle. It is used right now inside closure itself. This is the transducer implementation of map. What does it do? It closes the known things when they are known. You map over a function, you later take the reducing function, and only then you get a function that you apply to the sequence of inputs and accumulate a result. And you can carry those closed over values around. If you are familiar with the concept of operators fusion from reactive streams, it's exactly the same. And this principle writ large is the idea of data in compilers. You can create data DSLs and use them to emit very complicated and involved hierarchies of classes or functions which implement the desired behaviors. Libraries like Mali and Rated do just that. But if you think about it, what is a sequence of keys passed to get in if not a sequence of instructions how to get? So those things, those ideas actually converge. Just those are more ad hoc and specialized implementations for the same idea. And that's how you turn yourself into a human compiler. And when you do those things, you make things much easier for the jet compiler and the CPU. Your type specialization makes call sites less polymorphic. Composed functions are nested classes and the jet compiler can inline them. When you go through reduce or other implementations which dispatch on the arrays, you make the CPU itself happier. And that leaves us facing towards the future. What can we do in the future? In the future, we can hope for stronger inlining support to be added for example to close your score. Currently we have the inline-arities metadata, but we could perhaps have an inline predicate which checks the arguments themselves and let us examine them. And we can see if they are something we can unroll. We can have multi-pass compilation that will do all the things I've shown you heuristically. It will do them for us and we won't have to do that manually. And we can also have some more specialized code pass for example in the implementation of equals. You can look at it. It's quite interesting, but it is not the most optimal code path at the moment. Thank you all. My name is Wozben. This is my Twitter handle. You can find me on Twitter or on TheCloserian Slack. Good luck, enjoy the convention and happy hacking!