 I decided to give an overview of what to do when you have some performance problem in your closure project, how to detect the problem, how to improve performance. So this is what this talk will be about. It's not super deep. It's mostly an overview of all the possible solutions so you can investigate more on your own. Okay, so suppose you have a closure system in production and you have to improve performance, right? So that's the first thing you do. You find the bottleneck, right? Before that, you shouldn't jump straight into finding the bottleneck and trying to optimize. It is super fun to optimize and tune the code and extract milliseconds. But if it's just your hobby project, it's fine because you're doing it for fun anyway. But if you're doing it professionally, it's very likely that you should not be optimizing this. Even if you know that there is some part of your code that's not optimal, it's very likely that you bring more value to your business if you implement some other feature instead of reducing from 30 milliseconds to 20 milliseconds. Some endpoint that almost never runs, right? So first, make sure that you really want to optimize that it's worth the effort. And also, sometimes it is already super fast. So even if you make it 10 times faster, it was super fast to begin with. So making it faster is not making it noticeably faster depending on the application, right? Okay, so suppose you really have a problem. So you decided that you have to optimize. You have to find the code that is using the biggest chunk of your CPU, the biggest part of the time. That is the process that we call profiling. So there are some tools for that. There is one closure-specific tool that I've used a couple of times before. I think it's pretty useful and easy to use. And there are some general JVM tools like VisualVM and YorkIt. I think YorkIt is free to use for open source projects but paid for commercial projects. VisualVM, I don't know if it's the same thing. Timbre is free to use. Timbre was a login library that also includes some profiling features. And now, some time ago, it was split into two different libraries. So that's why I put the parentheses. So now the profiling part is in the new library. For benchmarking and to know how long each, if you have just one snippet of code, you can run using one of these two functions. So there's a very easy convenient function, time. It's in the closure standard library. You just wrap your block of code with a code to the time function. And it will return the same thing so it won't affect the result of your program. But it will print to standard out the time it took to run. The problem is, it is very inconsistent, not because time is wrong, but because of many particularities of the JVM. So for example, the JIT compiler, the just-in-time compiler, it will make the second run much faster than the first one in your function code. So if you run with time twice in a row, or if you run once and think it's slow, and then you make some change and run again, you say, ah, it's much faster, but it's not because of your change. It might be because some other function that is invoked by your program was optimized by the just-in-time compiler. There are so many pitfalls that someone decided to create this library called Criterium to avoid all of these pitfalls. So I really recommend to use that. You only use time because it's super convenient, and if you just want to have a rough estimate of how long it takes. Okay, now suppose you have found the part of the code that it wants to optimize. So what do you do? There are closure-specific things to do, and there are more general principles from software programming that could be applied to any language, right? So first of all, the first thing you should think of is if there's a better algorithm to solve the problem. There is no micro-optimization that will make a bigger impact than changing the big O complexity if you have, like, n to the third or fourth, and you reduce to n squared. It will make everything much faster than if you make a lower level optimization. I'm sure you can come up with some acceptance of this, so that's another thing that's very tricky with optimization. I will make a series of recommendations of things that can change to make it run faster, and they won't always make it faster. So you should always change and profile again to see if it really made it faster. Sometimes it's counter-intuitive. Even if you have a good algorithm with a good complexity, make sure that what you implemented is really what it's supposed to be. Sometimes you know that one algorithm solves a problem. You open the Wikipedia page and you see the complexity, and you think that that's what is implementing your code, but due to some mistake, that's not what is running. I've seen this happen more than once, so make sure that you're not traversing a sequence multiple times by mistake. And make sure that you're using the right data structure. If you want to look up by index, make sure you have a vector or not getting the end element in a sequence so that it would traverse multiple times. Only after you have thought of all of these algorithmic changes, if nothing can solve your problem, then you start to think of more specific implementation details, more closure-specific things. So usually we write iteration with mapReduce, filter, with a four, the list compression. They are very readable, very nice to use, but they are often not the fastest way to run the code. So if you are doing this in the part of the code that you have decided to optimize, you could try. First, if you're using mapReduce filter, try to use four and vice versa. Sometimes the difference is significant, like 20% or something. You can also try loop recur. If you do a loop, you won't generate intermediate values. You won't have laziness, so it might be much faster because of that. If you just do a normal recursion, you might grow the stack, so it could cause a stack overflow or it could take longer to run than just because it's allocating a lot of time. If you do a tail call, tail recursion, either with loop recur or just using recur to call the same function, then it doesn't grow the stack and it's supposed to run faster. If you have something like mutual recursion, then loop recur won't work. You have to use trampolines, which I don't think I will cover that in detail now, but instead of returning the value, you change function to return a function that is a function that has no arguments that returns the value. It's not really hard to make this change, and then you won't consume the stack, so that might help. If nothing of that solves the problem, you can consider other data structures, you can use transients. Does anybody know what they are? Has anybody used in production or something? It is? Okay, so I'll explain briefly. So transients basically, they make the structures, they're a different type of data structure that is not immutable, like the other closure ones. So when you make one change, a sock or conch, it destroys the old one. So it gives you a new one with a change that you made, but it destroys the old one, so it's kind of changing in place, which kind of defeats the purpose of closure, so you have to use it carefully. Since it changes somewhat in place, it is much faster, but don't pass it around. Don't write one function that expects a transient as an argument or that has that as a return value. So if you are, for example, if you are generating one huge vector, you want to prepopulate it, you can create one function that receives no argument or the size of this vector or something, and returns the vector. If you only create the transient in this function and then transform it into the persistent version, then it's fine, because the damage is very self-contained. It's only in that function that something can go wrong, but you're not changing how the rest of the project sees the data structure. Another thing you can do is, since Java Interop is so easy, you can use the Java data structures, and then it's the mutation in place. You can use it with care, but sometimes it's what you have to do. They're faster. We all love immutable data structures. They're super easy, convenient, but there's a price we pay to use them. Depending on the problem domain, you might make use of parallelism. So the first thing you can think of is PMAP, because it's not super powerful. It doesn't help that frequently, but it's so easy to make the change. You just type one character more, and sometimes your program becomes much faster. You can use all the cores in your processor. So keep that in mind, because when it's possible to use it, it's really worth it. It's super easy. You can use reducers, which also help with running in parallel. So if you have a normal reduce, depending on the properties of your problem, you can use reducers. You can start normal JVM threads during interop, so you can manually coordinate them if you want, if PMAP doesn't solve your use case. You can use agents that are not as manual as instantiating the threads, they are also a bit tricky. They can introduce bugs in your code, so use it carefully again. Okay, and another trick. You should avoid Java or JVM reflection. So JVM reflection is when you look up the method of an object that you want to call by name. You pass a string to some Java function to retrieve the method, and then you call the method pass and some arguments. But this lookup by name is much, much slower than calling the actual function directly. In our case, in Clojure, it is usually when you're doing interops or open parenthesis dot and some method of some object, then it will be converted automatically to this lookup by name. So if you call this set one on reflection true, every time there is a reflective lookup, Clojure will print a warning. So you will know that you're doing something not optimally. It's not hard to do, but take a look at the documentation. You should put, so when you detect that you're doing reflection, you have to provide type int. So you mark that one argument or function or the return value of the function is of a certain type, and to do that you have to put metadata. And the metadata must be in the symbol, not in the value itself, so it's a bit tricky, but it's not hard once you learn how to do it. Just make sure that you're putting the metadata in the argument, not the value itself. I can show you an example if you have time. If you're doing some number crunching, so again it's specific to the problem domain, but for some cases there are many things you can try, and those things are also not trivial, so there are many libraries that help us with that. So one thing without libraries, you can avoid boxed objects. So in Java you would use lowercase int instead of the object integer with capital I. In Clojure there are some functions that help create, for example int array, it returns the int array for you instead of an array of objects. And there are many, so the libraries that can help, so core matrix, if you're doing matrix operations, matrix multiplication, it helps you with that, helps with avoiding boxed objects. There are these other libraries that they can run on the GPU, the graphics card, like CUDA and CL. You can take a look at these libraries, but again it's not for every use case. It's a very small subset of your problems that you can run on the GPU. So another thing you can do is always keep in mind that you can move one level down in the hierarchy of abstractions. So if you write most of our business logic in Clojure because it's nice and convenient, but when you want to optimize, you can write one library in Java and export the jar and just call the jar, one function from that jar because it's convenient to call from Clojure and it will run much faster, it can be optimized by the compiler. If that is not enough, you can use JNA, JNI, which are closer to calling to running like C code from Java. So you can have one chain of levels, you can go down a few levels. I've never had to use JNA or JNI in a Clojure project. I think personally I think that if you have to go that down, you probably shouldn't have started from Clojure. You have a specific use case that you really need the very best performance, so you probably shouldn't have started from Clojure to begin with. And that's it for the slides. You can see there's some source code if you want, like the reflection warnings. The problem is that I can't see from here. Can I try to change this? Can you read the code in the editor? Okay, so to use the tift library, that is the profiling library that was extracted from the logging library, you just import this line and call this function in the beginning. And then if you have some block of code here, you can wrap some of those, like this is the range. You wrap them in a code to this P macro. It is this. Then when you want to call here, you call profile, compile time map, yes, okay, good. Then it prints the number of calls to each of these things. So you give an identifier and tells you the number of calls and how long it's called to the minimum, maximum, average. So you can do this in your server and production without changing anything. Like it still returns the same return value. So you can make this change in production in some parts of the code or in some simulation of production. It doesn't matter. In this factorial example, it is what I mentioned briefly. So if you have a factorial of a number, if the number is 0 or 1, return 1, otherwise return the number times the factorial of x minus 1. This is the basic recursive method, recursive way to write it. If you want to know the performance of it, you can just wrap the call. So for example here, the call, okay. If I just call this, you get the result, right? So if I call with this quick bench around it, it will take a while. It is running, you can see here in the bottom. It runs, so this quick bench which comes from a criterion runs my function multiple times. That's why it takes long. It takes from 10 seconds to one minute depending on your function. Okay, while it's running, let's see the other ones. So I wrote, okay, good. The evaluation count, so it ran, so it's the odd number of calls, samples, the execution time, the mean. Yeah, so I usually just take a look at the mean, but some people want to know how, like, if it varies a lot from the minimum to maximum of these things. Okay, the idea was to show that if you use an accumulator and you convert it to a tail call, it doesn't take as long to run. But we can skip this because it took longer than I wanted. Here's an example of how to create the unboxed arrays. So if you have a look at this range, if you call it toArray, it creates a, so this is the type, the representation of the type that the GVM gives. So it's an array of objects. If you call the same range, if you call int toArray, instead of toArray, it returns an array of longs. It's no longer objects. Now it's an array of longs. And if you use intArray, it returns an array of int. It's not longs or anything. So the same theory, I guess, should be faster to run. But you can, as I said, you should experiment. For reflection, so the thing is suppose we have this, like one big string here, right? So this is the contents of the string. It doesn't matter. It's just one string. If you have one function, it gets the bytes from this string. So if I call here some string, it's returning some byte array. But this dot gets bytes. This is the reflection, the reflective call. So what it is doing is it's similar to calling this. So if you start from the value and you get co-reflect and get the members, this is what closure is doing under the hood when you call this dot gets bytes. Get bytes here. You see in the bottom that there are five results. And these results have different signatures. So the return type here is a byte. So it has no parameters. It's just a string and returns the byte array. But there is another variation of another implementation of this method that receives the byte array as the parameter and returns void, like nothing. So how does it know which one to call? Well, in this case, it's just by the number of parameters. But the implementations could be different only by the type of parameter. So it would have to go over all these implementations if you're calling by reflection. It would have to go over all of these to figure out which one matches the types of the arguments you give. That's why it is slow. If you turn this on, if you turn this Warner reflection through, now when you call this function, when you define the function I have, you will get this reflection warning. Reference to field get bytes can be resolved. So it doesn't know which of those functions it should call. How do you solve that? If you define the function this way, so if you put this string metadata in the argument, now the cloud compiler knows that it is a string. So you see that when I define here, I don't get the warning. So there's no reflection call. I ran this before, a benchmark with the first version with reflection. It took 11 microseconds, and without reflection took half the time. So considering that it is a very small change, sometimes it's hard to know where exactly you have to make the change, but it's not super hard, and the improvement is significant too. It's a good trick to do. I have some examples with PMAP that I mentioned. So you have a basic function that is expensive to compute, but in my example I just put a sleeve for one second to pretend that it took one second to compute something. And you want to do, you're doing like five times and getting the results of all of those. So if you do with MAP, and now using the time function as I said before, I ran this, it should take five seconds because it's doing a sequence. So it took five seconds to compute this one for each of those. If I just insert one character here, now it is PMAP, and I run again, it takes one second to run for all five because I have more than five cores. And here the transition is that I mentioned. Okay, suppose that we have just run this reduced. So you see that we are creating one MAP from one number to the same number for all numbers in a sequence, right? So zero to zero, one to one. And this could be one way you would write. And I chose this way because it's the one that maps directly to the transient version. So instead of using the persistence here, we're talking to the persistent immutable MAP. Instead of that, if you call transient into the empty MAP and change a sock to a sock bank, so it's the destructible version, like the version that changes the argument. And by the end of it, when you finish running everything, you call persistent again to convert to the persistent MAP. The first version, the normal way to write, would take 91 milliseconds. And the version with transients took like one third of the time. So sometimes it's a good improvement. Before writing this way, you might try with four comprehensions and make MAP filter reduce. But this one is, I think, the fastest one of these. And that's it. Any questions? Okay, so I think you're done.