 Hi, I'm quite excited to be here and today I will be talking about performance optimization using code as data in a language called closure. So a quick show of hands, how many of you know about closure, okay? And how many of you are familiar with Java or the JVM, cool, so we have the right audience, cool. So I work at a company called Concur, Concur does expense and travel management and is reasonably big now and is a part of SAP now. It is a 23 years old enterprise company where we use closure, we have a team of about 12 people who do closure for the last two years. And we have got a reasonable success with using closure and today's talk will be centered around some of the performance benefits that we found by using code as data in our work. I also wrote a book on performance with closure and I write some open source code at GitHub. So before we begin, let us look at some of the vocabulary that we will be using in this in this talk. So whenever we talk about performance, we also have to talk about how do we measure performance because there can be no performance discussion without some form of measurement unit and the way the process to measure things and to find out the details that is known as profiling, performance profiling. There are different kinds of profiling, we will go into those things in a bit and beside profiling the two important terms are latency and throughput. Latency means basically the time that it takes to execute something. Now there is a certain difference between latency and execution time and waiting time and response time but for all practical purposes, we will just assume here that latency means the total response time. And while talking about latency, it is easy to talk about one instance of measurement where we measured one execution. So latency could be a number of milliseconds or seconds. But when we see that a system runs for a long time, then the latency might vary from time to time depending upon different circumstances like system load and so on. So to characterize the latency of a system, we usually talk in terms of what is the median latency and what is the mean latency and what is 99 percentile latency. Because latency samples are a distribution and if you create a histogram, you will find that a lot of the latency numbers would be centered around the lower blocks, lower intervals. And at the same time, latency impacts the kind of throughput that we have. Throughput means the number of operation in a certain time unit and just like latency, the throughput might vary across the different time windows. And to truly observe what kind of throughput we are getting, we have to see the kind of system load that we have at that time and whether the throughput can sustain over a long period of time. Let us say that you have an app that has a memory leak, then you will see that the throughput slowly drops as you leak more and more memory until the app becomes totally non-usable. So that was throughput at that system load. And while talking about profiling, we have to actually talk about a few more terms like there's benchmarking, where benchmarking is a means of running a code, testing a code for performance, not for actual production usage, but for testing performance. And the counterpart to benchmarking is to collect performance metrics in different scenarios, for example, in QA or in production. So that is a different kind of profiling. So profiling can be done in an prod as well as development. And while benchmarking, we sometimes, in fact, often do something known as comparative benchmarking. Let us say that we had a software deployed at version A, and we then deployed version B and we found that the performance either improved or dropped. So in this case, when we do a comparative benchmarking, just to figure out that how much of loss or benefit we are getting, the A would be the baseline and B would be the new, new metric that we are getting. And while testing performance, we also have to see that what kind of load that we can generate. We will shortly talk about how to generate load and see how to get the kind of performance in a representative scenario. And normally, so the profiling tools that we have today, they are of two kinds, sampling profilers and pressing profilers. Sampling profilers run by running your code many times, thousands of times and pick a couple of samples from those. Now, what is the reason that we have to run the code so many times? So it turns out that the system cannot achieve a stable running state until we continuously run the function because of the simple fact that our computers are more of a, you know, you know, stochastic systems where things are, you know, they have probabilistic characteristics about the kind of performance that they exhibit. So sampling profilers work by getting your program to a stable way of running things and then pick up samples. And the other kind is called pressing profilers where the running code is instrumented in a way that each point-to-point execution is measured. And that can actually slow down your system by a great deal. So, so, pressing profilers are to find out the actual time that something takes instead of sampling some of them and just to find hot, hot spots. So these days, there are certain tools that also do selective tracing, the profiling that you can do in production as well, via instrumentation. So we'll begin with microbenchmarking. How do we do this in enclosure? So in closure, the closure is a list, and being a list, it can create a code as data, and by that, I mean that in various ways, when we use macros or when we use eval, we can, in fact, pass arbitrary code that can be executed in or manipulated in a whatever way before that can be processed. So in closure, there is a library called Platerium, which has a very scientific way of doing microbenchmarking. For example, in this example, we can see that we are running a very simple small code for sleeping for one second, and that is benchmarked. So the way Platerium runs is, it runs a lot of times with your code, and it waits for the JVM to warm up completely before it can show you the true, true stable runtime performance. And then the kind of numbers that we get are the mean and the standard deviation, and the lower and the upper-upper quantiles. Though an interesting fact in software performance is that the distribution of the latency is rarely a Gaussian distribution. It is not really the kind of distribution that you see in statistics textbooks. But for the sake of communication and terminology, we still have tools that still communicate some numbers to us in terms of things that can relate to normal distribution. So the Platerium works, by the way, of a macro called Bench. Bench can take an arbitrary code that can execute, and macro can actually treat code as data. Therefore, it can take any code that it can execute, and it simply, simply evaluates that many, many times. But benchmarking a code is not enough if your microbench marking is happening in a single thread. Because when you do microbench marking using Craterium, what really happens is that your code is running in a single thread. And all of your machine is sitting idle at that time. And that is exactly not the scenario that you will have in production. Because in production, your system will be loaded with too many threads. It will have a lot of contention. And none of that contention exists when you do microbench marking using Craterium. And that means that the picture that you are seeing while microbench marking is going to be very untrue when you are hitting, hitting production with the same, same code. So for practically using microbench marking, you have to do something known as comparative benchmarking, where you can see, let's say, the baseline and a new use case side by side, where you can see things clearly. So in this example, we are using a library called Citius. Citius uses Craterium underneath. But it manipulates its result in a way so that it is easy, easy to compare the two. And the two kinds of code, they are not run side by side. They are benchmarked sequentially, but they are reported side by side so that it is easy, easy to compare. In the first case, I'm not sure if everybody can see this. So it says that the second use case, it is 237.92% faster than the first one. But this is only when the thread count is one. This is the default scenario. So to find out how will the code perform in end production, you have to simulate the same kind of load that you would have in end production. So this library, Citius, it can simulate the load by running the same code concurrently across many threads. So in the second use case, you can see that this is running at thread count 40. That means we are running simultaneously 40 threads of the same code across different threads. And the kind of performance difference that we see that is astounding. In the first case, we can see that the mean execution time is 844 nanoseconds. And in the second one, it is 14 microseconds. So the difference is easily 15 times. How did the code become suddenly so slow? Because most of your machine resources, they are being being contended. So CPU, your L1, L2 cache, and memory, all of these resources, they are being heavily contended by the different threads. And in production, you have a similar scenario. So no wonder that we have a code that is microbenchmarked as something, and in production, it gives you a much, much higher response time. So part of the solution is to do load simulation by means like this. And while microbenchmarking simulates the load, and cc for yourself, how would the latency really be under the load? So even CTS is an extension of a criterion. Therefore, we are doing the code as data again here, because they are macros. And talking specifically about code as data, there are actually two kinds of mechanisms that we can use. The first is macros, as we were talking about so far. And the interesting thing about macros is that the macros are compiled time only. Means macros are passed some arguments. The argument can be code. And those things can be read and understood by the macro, and can be manipulated in various ways before the macro can emit backcode. So macros are basically functions that accept code and return code. Another way to do code as data, and it is rarely used in most of the programs, is eval. Closure being list, it has a way to accept code as data and evaluate it at will. But most of the closure programs do not use eval, because eval does not have much, much use practically. But we'll see that how eval can be used to improve performance in a while. And at the same time, while adopting these two techniques, there are certain challenges that we have. And we have to do the trade-off based on knowing that what those challenges would be. First of all, debugging macros, debugging generated code is extremely hard, because you do not have a clear idea of what code has been generated. You either have to guess, or you have to inspect the emitted code. And often that is not obvious, because that's not the code that you wrote, you used a macro and it emitted something. Similarly, the eval created something, and you do not know about it. So the stack process may not make sense if something like that happens. And macros and eval, they are also very hard to compose, because they, in fact, they can manipulate code in various ways, and what the return is another block of code, the other data. And those things are actually not easy to compose, because you do not know what is the kind of things that they will return. And beside this, there is one important point that may impact the kind of performance that you get. The JVM can improve the performance of your code at runtime by detecting some of the functions that are called repeatedly. And if they are small enough, they would be in line in your code. So JVM can inline certain small functions. But if you are using macros and if your code becomes bloated as a result of manipulating it, then it may no more be in line. And this is a quite real challenge. And therefore, whatever benchmark that I'll show you in this presentation, you should try to do your own benchmarking with your use case just to verify that you are not being impacted by these things. So the first use case that we will talk about is the faster string concatenation using macros. So the way that this solution started forming was when I saw a post by an angry user on the Clojure meeting list, where he simply complained that Clojure is lower than Java. And I then asked that user that can you give an example, then he showed a piece of code that was using Clojure's str function to concatenate string a couple of millions of times. And at the same time, it had some Java code that was doing string plus string plus string. So I quickly realized that the main difference between these two approaches was that Clojure's str is a function. So every time you are using a function, you are paying for the invocation of that function. And at the same time, when you write a string plus string plus string in Java, the kind of code that the Java compiler emits is to create a string builder. And it appends all of them right there. That means all of you string concatenation in Java, that is right away in line. So that actually causes a lot of difference. And we don't have any such thing in the Clojure core. So I set out to create a library called stringer that actually began with this use case. Now it has a couple of functions, not functions, actually. They are macros to do similar things, like in Java. This is a macro called strcat, which can be used in place of Clojure core's str function that can concatenate various tokens in the same way that the Java's inline plus operator can. Similarly, strdel is another macro that can do things like delimited string concatenation. And they both use the string builder manipulation that are in place. And of course, they are macros and they are not functions. Therefore, you cannot use them in the same way that you can use str. Means str can be passed around like a function. But strcat and strdel cannot be. But at the same time, while they are not functions, they give you tremendous benefit of performance. Let us see. This is a chart for latency. So when I simulate the load with 40 threads, so the red one is what you can see when you use str function. And the right small one, can you see that? So blue one, that is strcat. There are different use cases. And in most of the use cases, you will find tremendous difference when you have many small tokens, which represents most of the use cases that we have in programs. So the actual numbers would actually vary based on the machine that you have, the hardware, the software that you have. But the relative numbers would not vary very much. So in Concur, we were profiling certain things. And we found a couple of bottlenecks while doing some heavy string concatenation. And when we used strcat, so the bottlenecks simply disappeared. So this is speaking from using this macro in production. And at the same time, while using str del, we have a similar performance difference using str, the closure dot string join versus str del. So currently, in our best practices at Concur, we normally favor strcat and str del over the functions that we find in closure. And we, of course, benchmark things differently as well. But for most of the things, we have found these two macros to be hugely beneficial. And these are both macros. And they are actually doing coders data. So another use case in a string concatenation is about about formatted string. We all know that closure has a function called format, which is like printf. So you give a format string, and then you give a couple of parameters. The format string can contain certain formatting specifiers, for example, percent test, percent D, and so on. So the way format works is that it leverages Java's string dot format, and they all work at runtime. And at runtime, they actually parse that whole string. And though it is possible to probably memorize some of the format strings that are called, but the way that it works the first time is that they parse the entire format string. So Stringer has a macro called strfmt that tries to do a similar thing, but without being a function. So the way it works is that it expects the first token to be a format string that is available to it at compile time. And that means it knows the format string at compile time, and by going through it, it can know that what kind of format specifiers are embedded within it. And it can emit the code just like the inline String builder concatenation, instead of resorting to calling String dot format from Java. And the performance benefits are again, they are quite quite too much. So you can see that the difference again, again, so relative difference is huge. And this thing, interestingly, is even faster than Java's String dot format. So you can call String dot format in Java. And you can benchmark that. And side by side, you can benchmark strfmt, and you would instantly see a huge jump. So at concur, about 15% to 20% of our code is about logging stuff. And when we log, we actually do a lot of String String manipulation. And just by using strcat, strfmt in a logging code, so our latency have dropped by quite a bit. And there is one more use case that I'll talk about here. This is not about macros at all. And this is just a reminder that every optimization may not be being backed just by coders data. You can choose cheaper abstractions, cheaper data structures to achieve performance. So strtbl is actually an alternative for the project dot.tprint slash print table, which actually prints out a table. We actually used this, we were using print table code in production because of a certain report that we were generating at runtime, regarding latency. And then we saw that by using strtbl, it gives us a great deal of benefit. And the differences, again, they are quite, quite tremendous. And just to remind you, this is not coders data. This is a simple, simple using arrays instead of the other ways of processing data within your code. So this is not coders data, but just by using a cheaper abstraction, a cheaper, cheaper data structure, we are getting this kind of performance. And the second area of the use case is about logging. Here, I'll not show any performance charts, but rather I'll talk about how some of the abstractions that are built in Clojure for logging, they actually conserve keep time. So the first library that we'll talk about is Clojure's contrib library called tools.logging. It has got, got, got loggers for various levels like info, war, error, and so on. So if you have seen the code in Java, whenever you do debug logging, you say that if logger.isDebug enabled, then do certain things. Because whenever we are logging stuff, we are also, also also mm-mating certain, certain values there. And those values must be computed before, before the, the, the log, log, log, log, call, call can, can be made, right? So, so in Clojure, since we have macros, and it can, it can read the code ahead, ahead, ahead of, of time, what, what, what, what, what, what tools.logging does is that it goes and reads the state of the, of the, of the, of the debug level being, being, being, being enabled ahead, ahead, ahead, ahead, ahead, ahead of time, and, and it wraps the code in a way so that it, it can, it can check, check the status before, and, and it evaluates the code only when that level is enabled. And, and that completely makes, makes, programmers avoid the need to, to, you know, check, check the status beforehand. And it poses certain, certain challenge for, for the libraries that want to, to extend the Clojure tools, tools.logging, for example, the library called Cambium, that actually extends tools.logging to provide the context support, because, because logs today, they are not merely string, because we want to, to log events, and the events can be maps of various attributes, for example, the logs that we do in, in concur, that have, normally they have like twenty, two, twenty-five key value pairs, and, and they're, they're maps. So, so, so we need support for, for, for, you know, the, the, the contextual logging, and, and Cambium lets you do that, but, but extending the, the, the macros in, in tools.logging is hard. So, so the way Cambium does it, and, and it is that it has macros, that emit macros in order to become composable. This is, this is something that is quite, quite, quite unusual, that, that you would normally not, not find, programs doing that, but when that is required, and this thing, this thing can, can be, be a solution. And one more use case that I'll talk about is, is routing with, with web, and, and this use case does not use macros. It, it, it, it rather, rather uses eval. And we will see briefly how, how, how, how, how that, that is. So, so let us say that, that, that we have a bunch of routes. And why are we even talking about our routes first? Because, because, so the fundamental abstraction in, in, in, in, in enclosures, web stuff is a ring, right? And, and a ring, your spec says that your ring handler is a function of erity1, that accepts a request map, and it emits, and, and, and it processes that. But, but in, in other web apps, we have many routes that we need to, to, to respond to, right? So the way that, that we currently solve the problem is, is to have one more library on, on, on top of, on top of, you know, ring, as, you, you, for example, composure, or, or liberator, and so on, which, which provide you, you await, to, you know, express, express your, your, your, your, your, your, your routes, and, and those things can be, be combined into, you know, a ring, ring, ring, the handler function. So the reason that, that the library that, that I'm talking about, about here, you know, called, called Calfpath, emerged because in our production systems, we had actually 20 middleware functions that we had to add upon the, the, the, the, the ring, ring, middleware. Sorry, I'm, I mean, I mean, the default ring, ring handler function. And after 20 middleware, so those things actually slow down your, your app, the app, yeah, yeah, app, app somewhat. And at the same time, not only slowing down, it also becomes a lot, lot, lot, lot, lot more complex. And, and things that could have been done later, those things have to be shifted up because, because of the way middle, middleware works. So Calfpath lets you, lets you express your, your routes as a vector of maps that have certain attributes, and those things can be, can be dispatched upon based, based, based on the, on the ring, ring request. So the first implementation that we had that has, you know, that, that, that in fact, itlated through the route, route vector, the route, route vector. So that, that, that is a, a straightforward implementation that, that, that, that picks, picks the URI or the method or, or, or whatever match is required. And it goes through the, through the list one by one. Now that, that, that gave us a performance that we were not happy with, compared to something else that we had before that. So, so we thought of how to, how to achieve close to, to, to, to, to, to the, to the same, same performance that we had prior to that. So we turned to a technique called, called loop unrolling. Loop, loop unrolling means that you, you can have a loop. You can, you can either, either have a loop that goes through each element of the, of, of some data, or you can have equivalent code that, that, that, that, that, that, that, that, that, that directly picks up things from, from, the, the, the, the, the, the, the, the various elements and, and directly executes those. So it is actually converting your, your, your, your, your, your computation, iteration, into, into, into direct, direct binary code. And that blots up your code, but often that, that, that, that, that, that, that, that, that, that code, code, code, code, your, your, your code, your code is faster. So the way that we do it in, in Kafka is, is to, to read the, the data ahead of time and we generate code at that runtime. So we emit, emit code that can be, be, be, be evaluated later and, and by generating the code, we, we, we, in fact, wrap the entire generated code in, in, in a function and, and that function behaves like a ring handler function. And this thing is, is, is done during, during the application initialization time because, because our app is divided into, into the initialization time and a runtime. And the, and, and, and these, these kinds of things can be done only, only during initialization time. Because, because Eval is, is quite, quite expensive at a runtime. So, so, so you, you, you use Eval to emit code and, and generate the, the artifact that you're going to use at, at runtime. So, so we, so we use, use Eval at, at initialization time and the function that it returns that is, that is, that is executed at, at runtime. And the kind of, you know, performance difference that we had is again quite, quite interesting. So, so on the, on the left-hand side, the one that you see that there's a composure in, in red and, and the one in blue is, is a, a library called Cloud on which composure is, is, is, is based. And, and the remaining three, like the green, yellow and, and magenta, they are, they are from, from Kafka. The first one is the version that we were using earlier and, and that, that has really tight, tight, tight performance because, because there's, there's a macro and, and, and, and the one in, in, in yellow, that is, that is the version that actually iterates through the data structure instead of the loop, loop and, and unroll version. And, and the, and then the one, one in magenta, that is, that is, that is the version, that is the loop and roll version. So, so, so the difference here is actually the one between, between the yellow and, and then the magenta one. And, and you can see that the performance of, of the, of the loop, loop and roll version is quite close to the macro, you know, you know, you know, performance that, that, that we have and in the, in the green one. So, so, so, so, loop and roll, route implementation is equivalent in, in, in, in a performance to, to, to a, to a macro, that, that, that is hand-hand rolled and, and, that is, that is, you know, quite, quite performant without, without, without using the, the, the, the route, route subtraction. Yeah, and once again, this is, this is a, a, a slide generated from, from the numbers based on 40 concurrent threads. So, so, so, so, so, so the difference is quite, quite much. Now, again, finally, I'll, I'll talk about a thing called, called latency breakup charts because, because, because, because, because, because, so, profiles, when, when, when, when you run, run, run, run them on, on your code. So, the profiles cannot actually find out the bottle in, in something, but, but rather they, they find out which, which, which threads are, are taking most of, most of the, the time. Because, because, they, they're, they're constantly sampling your, your, your code and, and the places where the CPU is being consumed, those are the, the code that, that, that'll be, be, be captured by, by the profilers. And, and, and let, let, let us say that you are doing some kind of IO, then the IO will not even even figure out in, in that, the profiler reports. So we have something known as Espejito which is, which is, which is, which is a library that, that, that lets you, lets you instrument your code. Just, just to find, find latency differences between different layers in, in the code. So it uses thread local metrics context to, to store the, the, the, you know, the, so metrics data between, between, between, between, between various layers. And again, so it uses macros, so which, which is, again, again, you know, the code is data. And, and the kind of result that you get is something like this. So it began with the web layer that is, that is posting the order, and, and, and that thing calls, calls two, let's say, functions. One is the fetching the items, which in turn calls the db.fetch, and, and then posting, posting something in, into, into a queue. So it gives you, you a complete breakdown of the cancel latency that is, that has been, been spent. So it tells you exactly where that the bottleneck is because, because the individual column tells you that, that how much percent of the total latency is due to this particular layer. Right? So for this, we have some additional code in, in concur to, to work with, with Spahito. So we use this in production. So we instrument all of the, so the interesting code, so we instrument them, so that, so that, so that, so that, so that they, they, they, they can, they can work the same way, but at the same time they collect information about, about, so latency, you know, cost. And, and we also have a use case where we collect the total latency taken, and, and we, we report the entire chart to, to our logs whenever it crosses a certain threshold. And this has been quite a lifesaver for us, for, for, for doing different kinds of profiling, and, and, and because, because while, while changing sub, sub millisecond response time, we had found this thing to be and a very, very useful tool for, for, for us. So the benchmarks that I have run and the graph that I've shown here, they use this hardware. It is, it is a physical machine, and the software being used is, is this, this stack. So you can see that the Java has been run in the server VM mode using, using two, two GB heat, heat size, using, using Closure 1.8, which is, which is, which is the, the latest table. And that is pretty much I have, so I'll be happy to take any questions, if there are. So the question is that is code as, code as data only, only relevant for, for macros, annual, and not the, yeah, and, yeah, and, and not, not the functions being, being passed, right? Yeah. So, so the answer, answer is that when, when, when you pass a function, so a function is an object. It is completely opaque, and you cannot, cannot see any, any, anything, any inside that. So it is not data. So a code that you, you, you cannot, cannot cannot inspect inside, is, is, is, is not, is not data, and, and, and, and it, and it can, cannot, cannot be, be, be manipulated in the same way that code as data can be. Yeah, for example, in, in macros, for example, in STR FMT, it, it can look at the, the parameter that it was, or, or, or, or, or, or, or, or, or, to give you some other example. So the kind of instrumentation that we did with, with SPA or, that, that, that is a macro, but, but, in, in a macro, you can, you can, you can really see, inspect what, what, what is, what is in the, in the, in the, in the, in the, in the code, and, and that thing can be manipulated better than the, the functions can be, because, because functions are opaque, and they can be observed only, only at the topmost level. And you can, can, can, can not really, really see, see, see, see, see inside. I did not get the question. So in eval, actually, you don't pass functions around, you pass code, which, which is in the, so, so, so, so, so functions cannot be actually evaled in the same way that, that, that, that it means in, in, in, in, in some other, you know, other, other languages because, because functions are invoked. They are, they, they're, they're executed. They are not evaled. A code is evaled. A code as data can be evaluated. So, so that is, that is the fundamental difference in it, in, in the notion between eval and invoke. So the question is that the loop and rolling, you know, the optimization in, in, in, in Kafka path does, does it depend upon, upon Kafka path or, or it depends upon, upon some, some runtime feature of, of, of, of the JVM. So, so it is, it is a function in, in Kafka path that, that, that does it. So it, so it takes a look at, at, at the, at the, the, the routes data structure that we pass and then it, it constructs the various fragments of code that would, that would be, be basically implementing loop, loop, loop and unrolling. And, and that, that, that, that, that emits a very tight code that you would otherwise do using, using hand, hand, hand, hand roll function. So it is similar in, in, in performance to what you would otherwise do when, when, when, when, when coding, coding by, by hand. And that is creating, creating the, the, the, the difference in performance. Can I say it again? So, the, the, and CTS can be used for, for, for, for, for, for, for Java, Java as well because, because Java code can be called from within Clojure therefore you can, you, you can write a code that can call, call Java and that would be the same byte code and therefore that would be the same performance. So that's all I guess, thank you and, and if there are any Any more questions, I'll be around so you can meet me and talk to me about both things. Thank you.