 So hi to everybody. So this is the London Closure and Meetup. Today, we have the pleasure to have Justin with us. Justin has been developing in Closure for over 10 years and he's passionate about microbenchmark, about JIT compiler and dynamic languages. He's been working on JMH for Closure, a library developed by the JDK JMH library, which is the microbenchmark harness. And he's gonna tell us more about why this is a good choice for microbenchmarking. I don't know if you guys have any experience with microbenchmarking. For me, it's like a terribly frustrating that when you run a benchmark and without any change, you repeat the same benchmark you get completely different results. So there is actually a paper on, from Mikosky, I think, 2009. And that shows that there is up to 40% of layout biases just for the memory layout. And this is now a long L1 cache and L2 cache that can influence your performance benchmark. And plus, there is on top of that all the statistical error that you can do. There is also another researcher called Emery Berger that ran the little nice presentation on Strange Loop that shows that even the Linux username can influence your performance benchmark. So they were running the same binary on the same machine with the same test, the same configuration, same setting seven, just two different username and they were seeing different performances. So they actually built also a tool for the C++ compiler to amortize this problem. So the JMH is doing quite a lot of internal things and statistical measure to avoid all these things. And Justin will tell us a lot about how it works and how to use it, how to get the best out of the library. So I will now pass it to Justin so they can start his presentation. We're gonna have the presentation at the end. Justin is gonna take some more questions. So please collect your questions so that we can at the end have a discussion as well. So Justin, you can start sharing your screen as well. Okay, hi, my name is Justin and I'm here to talk to you today about my library for benchmarking on the JVM enclosure. First off, I do wanna say thank you again to Bruno for inviting me. This is my first time giving any talk, giving a talk of any kind, but I've been a language and performance enthusiast for many years. Thanks for having me and I'm happy to be here. So a couple of things to get out of the way upfront. This talk is not gonna be about teaching you how to write fast code. This library is a great tool to have available if you wish to do so, however. Second, benchmarking and performance tuning is one of the more complicated topics on the JVM and since I have no way of knowing the general level of experience of the audience here, I'm going to be assuming just a casual familiarity with the subject. And third, to save some syllables, I'm gonna be using the terms benchmark and test interchangeably, so don't be confused. So this is a little bit about me and how I kind of got to this point. A few years ago, I was writing a compiler enclosure. It would take data that basically described the superset of regular expressions and construct optimized parsing code. For the compiler, I was using another library that I wrote for functional bytecode composition. So at one point, I had multiple different buffer implementations. For example, there was one that used raw byte arrays and there was one that was using a custom immutable type that I wrote and so on. And I was benchmarking them against each other for different workloads. This is how I usually work when I'm benchmarking by testing different implementations and algorithms against each other. As I expected, the byte array version generally was the fastest, but I just wanted to know by how much I was going to be picking just one implementation in the end to use. Inexplicably, however, sometimes an implementation would just slow down by a significant amount, seemingly at random. And I was doing the testing via criterion. I'm sure most of you have heard of this library. It's a great, well-written library. That's the go-to library for most people when benchmarking closure code. And I want to say upfront that this talk is not about batching criterion or any other library. But in my case, I would benchmark the parser with the different implementations and I started to notice that the results would come out different depending on what order I ran them, how many times I ran them and so forth. When nothing else had changed in the code, kind of like what Bruno was talking about before. Whether in a REPL or from a script, the results would very rarely be as consistent as I expected. So I examined everything I could think of at the time. I checked for configuration issues, profile, and this was on a dedicated box with nothing else really running on it. So this got quite maddening until I finally realized I could get better consistency if I ran each individual test separately in a new JVM process each time. The timings still weren't as stable as I would have liked, but at least it was something. I knew basically how much faster certain implementations were relative to each other and better than that, I never saw the random slowdowns. So at that point, I decided to learn more about benchmarking on the JVM. So this kind of thing wouldn't come back to bite me again in the future. But I also started to wonder what alternatives might be available to the existing libraries that I had been using. This is a quote from a talk given in 2018 by Gil Tenne, a person that I think many would agree is one of the foremost experts on JVMs. To my knowledge, it seems the general consensus is that if you're doing serious benchmarking or micro-benchmarking on the JVM, JMH, the Java micro-benchmark harness is a tool you should consider using. JMH closure is a functional data-oriented interface to JMH. I hope in this talk to give you some reasons why someone like Gil Tenne would recommend it. I'm going to touch on why these results a little bit were the way they were a bit later. But for now, I'm going to look at a benchmark I created for this talk. I focused on making this as easily digestible as possible. It's not meant to be a realistic example. It's for demonstration purposes. It's meant to shed some light on a larger issue here. The results were taken from multiple runs of the entire benchmark suite using the latest versions of each library. I'm showing only a sampling of the runs here. This sample is representative of the whole, however. I'll be creating a GitHub repository for this talk that will have all the code, data, and materials for this and all the subsequent examples. They're all runnable code, so feel free to try them yourself if you want to compare with my results, which is always a good idea. Anyway, this took many hours on my hardware, which is why none of the benchmarks in this talk are going to be run real time due to constraints and not wanting to overload my Amiga machine when I'm streaming here. Benchmarking correctly on the JVM takes time. Okay. We have a simple protocol. It essentially just does what closures and function does, except that it's extensible. It should be pretty self-explanatory. I'm not, if you're not familiar with the index interface, classes like persistent vector implement it to provide access by index. You can see at the bottom some basic examples. Pretty simple. Now, we're going to run some tests using inputs of different types and sizes. First, we'll be using criterion. This is an excerpt of the runner script. I'm not going to show the entire code that generates these results for this first example. Again, this will be in the talk repository. We're simply going to create a vector or a string of a given size that access a value by index in the middle. I should say two, if you've never done serious benchmarking on the JVM before, note that both libraries will both run for a warm-up period to get the optimizer to kick in before doing the measurements. Okay. Here's some results. These are measured in average time, lower being better. We see here, even when extending existing types like char sequence, protocols are pretty quick. Under the hood, protocols use what's known as a polymorphic inline cache and is a simple optimization when invoked. If the current type matches the most recently seen cache entry, it avoids the overhead of a table lookup and this fast path is going to amount to just a couple of comparisons. Anyway, we'll get back to these results in a second. In JMH closure, we usually write our benchmark specifications as data, not as code. There are benefits to doing this, but for now you're going to have to take my word for it. So don't worry about the format of this just yet. Real quick, you will notice that the functions to be tested are given as namespace symbols and their argument names correlate with the state's key below and so on. I'll explain all this in detail in just a little bit. Okay. So now the excerpt of the code that uses this data. Importantly here, we're not passing a function object in memory to be benchmark like with Craterion. We are providing the data from our file along with some options. I'll explain why this is later. The options I'm not showing here are just configured JMH to report results in a comparable format. And for anyone who is familiar with JMH, note that I have configured it to be as fair as possible to Craterion. So again, some results. We're not going to dwell on these exact numbers for the time being. I'll get back to them in a second, but right away we can see Craterion is telling us that string access is slightly faster than vector access while JMH is telling us for small vectors, they're basically identical. Okay. The one thing we might be able to derive from this so far is that once a closure vector gets to a certain size, here 100,000 elements, it no longer maintains constant time complexity comparable to a string due to it being a tree or a tri depending on how you want to pronounce it, which is expected. Before I continue, keep in mind, I'm not showing all the data points in these examples. This is just the primary result to keep it simple. I have preserved the whole dataset, however, so things like standard deviation and percentiles, et cetera are available. So what happens if we interleave the implementations? So before we had string, string vector vector, now we're doing string vector, string vector. So with Craterion again, here the first table is the previous result. Notice the arrow here on the second string row. For the interleave result, we see something interesting and this is a trend across all 10 runs. The second string test here in the second table is slower than the first and is now around as slow as indexing the vector. Notice the arrow next to the row again. Now this is actually somewhat misleading. If we instead interleave with the vector test first, like vector string, vector string, displayed here in the third table, we see that the vector result is now faster than the previous and the string measurements that follow it. I can talk a bit more later about why this is happening, if anyone is curious, but I'm fairly certain that closures protocols are not at fault for the slowdown here. Anyway, the actual numbers are not really what is important here. I have a relatively ancient machine here. Results on newer machines will be faster, et cetera. All I'm trying to show is inconsistency with the previous results. And, oh no. So now with JMH. Again, the first table is the previous result. Unlike Criterium, when benchmarking JMH utilizes separate Java processes, which it calls forks. Running functions directly as code in the same process, like you would with Criterium is also supported, but it's not ideal for some important reasons. In fact, JMH goes so far as to warn you against it if you try. We can see in the interleave results that they are more or less identical to the first result. Actually, if you look at the complete JMH dataset, the results are almost like clockwork. Across the full runs, each type and size combination result is usually within one to two nanoseconds. A lot of time less. While the Criterium dataset is overall more varied. And I do want to say, some of you might be thinking right now, we're measuring nanoseconds here. Does this really matter that much? Well, first when writing code to be as fast as possible, nanoseconds actually do matter quite a bit. Nowadays on the JVMs, reading array elements in a predictable linear fashion can report sub nanosecond element access times due to vectorization performed by the jet, for example. But secondly, this is just an example. For real benchmarks that are measuring something useful, the effect can and will be more pronounced. Again, the exact number is not really the point. They'll be different on different hardware. This is about being able to measure consistently and produce stable data. Here's a pertinent quote from the JMH docs. JVMs are notoriously good at profile guided optimizations. This is bad for benchmarks because different tests can mix their profiles together and then render the uniformly bad code for every test. Forking each test can help evade this issue. And forking is just one of the many things JMH does for you to increase consistency and accuracy. I should say too that Criterium isn't doing anything quote unquote wrong here. And I'm definitely not trying to defame it. Here, it's just at the mercy of the underlying jet. A lot of times you will definitely have multiple implementations running side by side in a code base with an interface or similar. And you want your test to reflect this biomorphic or megamorphic environment. And again, JMH supports this as previously mentioned. However, when comparing implementations with the intent to choose one over the other, it can be counterproductive. And forking isolates you from this type of problem. And real quick, I will mention that these problems are maybe more broad that you might think. Here we have some similar sounding issues regarding Haskell's Criterion Library, which was an inspiration for Criterium. These are some excerpts from comments of some open issues, the first of which being several years old. We've seen the title of the first one. It is talking about benchmarks being influenced by the existence or absence of other benchmarks. And in the second, an excerpt from a comment is talking about the solution being to run tests in a separate process. I don't know about you, but this was all very interesting to me, similar sounding problems all without a jet. So now we're gonna get to the main part of the talk where I'm going to start with the most basic example and then gradually show off some more of JMH and JMH closures features. Note that this can't be a full JMH tutorial if I tried to cover all the features, this would be a much longer talk. But the Java docs are very well written and tutorial-like, so if you wanna learn more, read those, that's what I did. First, we're going to quickly set up line JMH, a simple front-end that I wrote with some convenience features. Keep in mind this library works bare bones too, it's not a requirement. Anyway, in our project file, we're gonna make sure to clear the default arguments added by line again in the JMH profile since these prevent some important compiler optimizations which give us the fastest possible code. This first example is going to show you one of the pitfalls of benchmarking in the JVM. Again, I'm not able to assume the level of familiarity with the subject. This is an interesting example nonetheless. So I wanna get this way out of the way right up front. These type of problems exist even when using JMH. We're going to attempt to measure a function that does nothing as a baseline. We were gonna try to answer the question of how much overhead JMH is imposing by comparing against some subsequent tests. Do keep in mind that these type of benchmarks are very artificial. Some of the problems we're going to see here won't come up with functions that do real work, but they're important things to be aware of nonetheless. So we have a function in our core namespace from before, that's very exciting, it's empty. Now we're a benchmark data file. We have Eden data that is specifying our empty function as our sole benchmark. And to run it, we're telling the plugin just to show us the score. You can see here only, and the keys we're interested in of the result maps. This is JMH terminology for the primary result of the benchmark. Let me get my mouse out of the way. We're only using one fork. Less forks equal quicker benchmarks with the trade-off being less confidence in the result. One will be fine for our purposes here. And JMH is going to use throughput mode by default. It measures operations per time unit, seconds if we don't explicitly tell it otherwise. In this case, operations mean function calls. Now the result. JMH informs us that we can do over 200 million nothings per second. Okay, we'll keep this in mind. So let's compare to doing a trivial operation like adding two primitives. Don't worry about where the arguments to our function are coming from for the time being. Real quick, I do want to note that JMH closure is aware of the primitive metadata and will compile the benchmark using one of closure special primitive interface types. So no boxing will occur. That would influence the results. I'm gonna show the benchmark data file later after I explain a few more things. What we're interested here in is the results. Okay, wait, it's faster than doing nothing. Nine million more measured operations per second. So this has to do with how JMH uses the return value of functions which are passed to what it imposingly calls black holes. Basically these ensure that the results and therefore the function calls themselves are not optimized away so it can accurately measure them. Here the empty function execution time is being inflated due to the increased cost of consuming its return value. It's a very small cost but even with a nil reference it still is noticeable. Most of the time this overhead will be a very tiny fraction of your benchmark but for artificial tests like these they'll be more apparent. I have ran these tests more than once and sometimes the empty function is slightly faster and sometimes like here the addition is slightly faster. Anyway, we can get back on track. We can quote unquote fix this and construct a benchmark that is always measurably faster. And there is a reason to all this so just to indulge me for a second. Instead of a function that returns a nil object reference we return a constant primitive. So simply returning a constant is faster than doing the addition. I've run this a few times and I get at least 30 million more measured operations per second. So we've learned here that JMH can consume an immediate primitive value like along slightly faster than an object reference or even a nil one. But we still have an answer to the initial question to do so we really need to do nothing. So we need to revive an idiom that most of the time has no place in closure the void method. Closure concept of a function that does nothing is one that returns an object that represents nothing. This is a different thing at the byte code level. We can tell the compiler to generate a void method like this. We see the normal way to specify a benchmark. By giving a map that gives the function along with options and other things that I'll get to later. We use the void flag here. Now we're gonna see a dramatic result in our result. We're gonna see a dramatic change rather in our result. This result is very much not real however. My machine here cannot actually achieve this many function calls per second. With nothing for JMH to consume and therefore no visible effect our function has been completely removed by the JIT. I've taken the time to go through all of this for good reason. If you ever see results like this that seem way too high most of the time the JIT has determined that your function or parts of your function do no visible work and can be optimized away. Like this function here that never uses the result of an expensive computation. In this case the JIT will likely remove the code. While it's obvious here what's going on a lot of times it will be less so. For advanced cases like this you can prevent this from happening by using black holes explicitly. Keep in mind that you will rarely if ever need to do this in practice at least for normal benchmarks. So again I'm sorry for this slight tangent here. What this is trying to show you is that you can't always trust the results you get even from JMH. Here are some relevant quotes from the docs. To gain reusable insights you need to follow up on why the numbers are the way they are. Do not assume that the numbers tell you what you want them to tell. Okay with that out of the way the rest of the talk is gonna be more straightforward which is good. We're gonna start by demonstrating in more depth the benchmarking modes provided by JMH. We're gonna add another file to our names excuse me we're gonna add another benchmark file and add another function to our namespace. Here in the data we're explicitly setting the mode and the time unit used for our result. And the function is simply sleeping for a hundred milliseconds. When we run we get around 10 operations again function calls per second as we would expect. The docs say of this mode it measures the raw throughput by continuously calling the benchmark function in a time bound iteration and counting how many times it executed the function. So what does this mean exactly? Most modes use this time bound iteration which just means that a function will be measured until the iterations period expires. We get five iterations each 10 seconds long and also five warm up iterations also 10 seconds long. So in the default configuration most modes can be expected to run for about a hundred seconds in total that's 50 seconds of warm up and 50 seconds of measurement in that order. And this is per fork mind you. The next mode is average and this one is easy to explain. It's simply the reciprocal of throughput mode. So we see here that our sleep function takes about one tenth of a second as expected. Note here to change modes we didn't actually have to change anything on our file. We can override the options given in the file which are actually just defaults by passing options to the runner. As we see right here. The final two modes are sample and single shot. We show here that you can actually specify a sequence of modes to run. So we're now gonna have two results one for each mode. Single shot measures only a single function invocation. The normal iteration time is not used in this mode. So as soon as the benchmark function returns iteration is over. I mainly use this for testing before an extended run. It can also be used to measure cold performance. So how fast the code is before the jet has a chance to optimize it. And the sample mode randomly samples the function's execution time. It will automatically adjust the sampling frequency depending on how long your function takes to execute. These last two modes especially sample are a bit more specialized in their application and using them effectively can require a bit more experience with JMH. In my experience most of the time the average modes, excuse me, the throughput of the average modes are what you're gonna want especially if you're new to JMH. And I should mention another convenience feature the plugin provides is table output. So we see here in the sample mode percentiles and some additional data displayed in an easier to read way, excuse me. You can also always enable the default JMH log via the status option which I use actually quite frequently. But there's something missing here. Most of the time your benchmarks will need input. For this we use states. We specify states similarly to benchmarks by giving a function. The difference being that a state's job is just to produce a value for a function that specs, excuse me. The state's job is just to produce a value for a function that uses it via a state keyword. To start with we have a function in our demo in our new demo state namespace that produces a vector filled with random numbers. We now add a new benchmark to our core namespace. The sum function expects an input sequence which will be provided by our state. The benchmark data now features a new key that gives the available states. And the function we're testing uses the args key to give its inputs. Here it's the VEC state that the arrow is highlighting that. Now to run. On my machine I can sum a small vector about one million times a second. Later we're gonna make states a lot more useful when we get to parameters. For example, parameters will allow us to dynamically size the vector. Also state creation functions are run before benchmarks and are not and their execution time is not gonna affect the overall measurement. This is an important property. There is a caveat to this that is a more advanced topic. You can see more about that in the docs. Okay, states have a lot of options that can get complicated for advanced use cases like heavily concurrent benchmarks. But most of the time they're pretty simple. To continue looking at states, we're gonna continue looking at states by measuring the performance of updating a closure atom that is uncontended versus heavily contended. This is the kind of thing that makes JMH really shine in my opinion. So first we have a function that swaps an ever-increasing integer to a provided atom. I'll just mention here that we could actually just put this INC0 helper inside the DEFEN. If you look at the generated bytecode, we see that closure actually hoists this out by default since all the arguments are constant. But we're gonna be explicit here by wrapping with a let. Now a new data file. There are two defined states. These are using function expression data instead of var symbols. This is a convenience feature and it's mainly intended for very simple states like this. They're best used sparingly. We're using closure's built-in atom function. The important thing to notice here is that they have different scopes. So the first one named global uses the benchmark scope. This is the default. It means that only one state will ever be created during the entire run essentially. This will make it heavily contended depending on how many benchmark threads you tell JMH to use. The other state is specified with the thread scope, meaning that each benchmark thread will have a thread local state value created specifically just for it. Meaning here that it will be completely uncontended. So for this run, we're gonna tell JMH how many worker threads to use. We're still only forking once one Java process, but the process will utilize four threads running their own iterations invoking our reset atom function. Keep in mind, JMH synchronizes threads by default, which means that each worker thread waits for each other to be ready before running an iteration. Again, by default, there are five iterations. So to recap, we had two benchmarks of the same function. That means we're gonna have two results and when each benchmark was run, our function was called for multiple threads with one atom, with an atom whose thread visibility is managed by JMH. And we can see here for this test on my machine, an uncontended atom is around an order of magnitude faster with a simple update function. Real quick, think about what JMH is doing for you here and imagine having to set up a similar benchmark harness manually. There's also a lot more JMH can do for you if you're writing concurrent benchmarks, like worker thread groups and desynchronization. Okay, I quickly wanted to explain in full the rationale behind using data to represent our benchmarks and states. Some of you may understandably perceive this as unnecessary ceremony or verbosity. Why do this? First, this is what enables the automatic state management shown before. If you're passing the state arguments to your test functions manually or capturing them in a closure, that's an S, not a J, this won't work. There are thread semantics, life cycles, and so on. Letting the framework managed state was the concept foreign to me when I first started using JMH and now I find it really helpful. Not having to create your own makeshift harness when testing a function with varying implementations and sizes of arguments like the indexing benchmark we saw earlier. And second, more importantly, this is what enables the process isolation feature. For example, how would JMH fork your benchmark code in an isolated process if you were attempting to test a function that only existed in memory, like in a REPL? Keep in mind forking in the Unix sense of the word which is not what we're talking about here where a process shares a copy on write address space with its parent process doesn't exist on all platforms. So to be able to utilize forking your benchmarks and states must be loadable from the file system via the normal closure require function. Or function expression data must be able to be evaluated in an empty compilation environment. This is why function expressions are best used sparingly. This caveat is not usually noticed in practice but it bears mentioning. JMH was meant to be used from ahead of time compile languages like Java. If you're curious about what's going on under the hood you can see the wiki for more. It's not very interesting. It's basically just jumping through a bunch of hoops to do, do note that this is all hidden from you and you don't usually need to know anything about this to use the library effectively. And yeah, the data specification may open up other possibilities as well. I've never used these things but maybe somebody could find a way to make this more interesting. Getting back to the examples. For more complex state values we can use life cycles. Let me check the time over. I'm just seeing the time here. So if you had a test that required a temporary file during the run you would write something similar to this. Instead of just giving a state function via the fun keyword, right here, excuse me, instead of just giving the state function via the fun keyword we're instead using the setup and tear down life cycle keys. In fact, the fun keyword for states is actually just an alias for the setup key. So running this benchmark will create a temporary file, one for each fork and delete it before the run concludes. There's quite a bit more to life cycles that I don't have time to get into right now. Real quick, just know that what we've been specifying previously for states is actually shortcuts. Each state can utilize a setup and tear down function at one of three levels. In the order of execution the trial level which is what we saw with the temp file, the iteration level and then finally the invocation level. Also for advanced use cases states can also use other states via their own args key. All the info on this is in the sample file that's in the JMH closure repo. If we have time at the end, I can come back and show a more complex example that uses multiple levels. That's pretty cool, but it wants to see more. One of the things that is different about JMH than some other libraries is that it explicitly discourages using loops. For a somewhat contrived example once again, our addition operation from before. Here's what the data file could look like right here. Now you might think the overhead of calling the add function would dwarf such a trivial operation and maybe we should minimize the overhead by doing the operation some amount of times in a loop instead. Here to attempt to offset the cost we're doing the addition via a loop, a closure loop, an arbitrary amount of times. This is the kind of thing I used to do when I was benchmarking in criterion. Sometimes it seemed like it helped to run, instead of running a function that I perceived to be essentially too short to measure, I would do it some amount of times in a loop. Again, this is a contrived example, but apparently sometimes this sort of thing is even encouraged by some benchmarking libraries to measure short operations, small operations like this. There are exceptions, but usually for JMH you don't want to do this. To quote the docs again, you will see there is more magic happening when we allow optimizers to merge the loop iterations. Also in many cases, the function call will be inlined away. Real quick, I'll explain this. For a tiny function like add, the call site will almost certainly be inlined by the JIT into the method that's generated by JMH closure. And this generated method is itself explicitly inlined by JMH. The moral of the story being, most of the time inlining can make your function overhead distance appear. Almost done now with the tutorial type of examples. Next, we have params. These are very useful with states. So we want to test a couple of functions that operate on an integer array. The first uses the built-in Java array hash code method and the second one uses closure's hash function, also known as hashEek. Here in our data file, we can see one of the shortcuts that JMH closure provides. The NS keyword, excuse me, the NS keyword allows us to save some typing when testing functions in the same name space. The state here, called interay, takes a parameter as its only argument. The parameter is defined right below it. So we're gonna measure the performance of arrays of differing lengths. Now the results with size 100. Why is the hash code so much slower? Well, in one second. Importantly, params like options are defaults. They can be overridden. This is what makes them useful. We can change the size of the vectors dynamically and we can also specify multiple values. As we do here, specify a size of 1000 and a size of, what is that, one million. So here we can infer based on the throughput alone that closure's hash does not actually look at array elements. This is not surprising because it exists mainly to be used with closures built-in data types. Notice we have four results, one for each combination of parameter and function. I'm not sure exactly how much time I have left. I think I probably have maybe five to 10 minutes. I'm not sure, but I'll continue. Feel free to interrupt me. Sorry, you have about 20 minutes more. Okay, I wasn't sure exactly when this started. Okay, then I'll go up it slower. Thank you. For some final examples, we're gonna look at profiling support. Let's see what JMH provides out of the box on my Linux machine. We can selectively enable one or more of these profilers by name for our run. To do this, we use the profilers option right there. We'll try the GC profiler with one of our previous tests, the sum benchmark we saw before. The secondary results related to GC are indented below the primary one when using the table format here. These are sometimes useful, but personally, I rarely use them, since JMH also supports external profilers. I prefer to use a custom one that enables Java flight recorder for the run. If you're not familiar with JFR, it's one of the most accurate profilers that I know of for the JVM. And a few years back, it was merged into OpenJDK after previously being Oracle branded JVM's only. I made a demo version that will be included in the talk repository. It's about 60 lines of AOT compiled closure. We can enable custom profilers by providing a package prefix class name. Note the, excuse me, we do need to actually have a class file on disk for these, so that's why we're compiling it first. They need to be available on the forked process class path. So here's a snapshot I took with the SUM benchmark. I wasn't gonna show this in Java Mission Control, and I can do that at the end, but I'm not sure I wanna try switching the screen right now. Java Mission Control, if you don't know, is another very useful tool now under the OpenJDK umbrella for doing profile data. And by the way, this is the easiest way that I know of to run JFR for closure code. The alternative is doing the setup manually, starting the new JVM process with the appropriate arguments and then explicitly running the code you want to profile. So it's pretty neat. Okay, now I'm gonna look at some things that I didn't have time to cover more completely, although since I have more time than I thought I did, I'm going to spend maybe a bit of extra time on them. The first one is selectors. These are for running only a subset of the benchmarks. They're similar to line-again test selectors, if you've used those. So here we have two selectors, one called non-void and one called SUM. I'll explain the SUM one here. This is function expression data again, and here it's composing three functions, and it's going to look at the map, which in this case is a normalized benchmark map. So like for example, if you just specified a symbol for your benchmark, it's gonna be normalized to have the symbol being denoted by the fun keyword. It's taking the name of the fun and then comparing then it's checking if the name is SUM. And if so, it's going to select that benchmark. And again, these are provided to the runner using the defined keyword. So we can see here, we can select this keyword and it will select all benchmarks that match the selector. I use these all the time. Next, option sets are for defining named aliases for groups of common options. You can also use the JMH default alias to give defaults for all your benchmarks, which is useful. You see our two benchmarks here. You pass one or more keywords denoting option sets or additional option maps as a sequence. So this one here, our function one is using the fast option set, which is the defined below it. And the second function here is using the stress option set along with providing new options. These are merged from left to right. And again, the defaults are merged first. So other options specified later are going to override these. This is a lot better than annotations. If you've ever used JMH, trying to avoid boilerplate is nearly impossible. You can change the JVM arguments for fork processes. Like here, we're turning on enclosures direct linking flag for faster var performance. And there are a few more things regarding function specifications. So all JMH special values like black hole are supported as special keyword arguments. You can also apply a function to a variable sequence of arguments using the apply flag. So like here, we would do this for that. Finally, external benchmarks and states are also supported. Compiled via gen class or by compiled from another JVM language entirely. You just have to use package prefixed symbols. For benchmarks, you just need to provide your classes to the externs runner option right here. They'll be run alongside any closure tests. This can be useful to compare JVM language performance, which I've done actually multiple times. And external states can be given in line. There are a very small amount of advanced state techniques that require this method. Since I have some extra time, I'm going to go back and show the more complex state example. So I was saying before that state life cycles are run in a specific order, an specific named order, the trial iteration, then invocation levels. So this temp file state here is running by default since we're not specifying what level we want in the trial level, which is going to be essentially run once for the entire benchmark for the entire run. And the tear down is going to be run at the end of the entire benchmark. A more complex example, for a more complex example, we see here we're going to have a kind of like a mock service. This is all empty because I'm just showing an example here. But we have a protocol for disposing a type of resource. We have a protocol, we're going to call resource, and we have a service. The service implements disposable, I dispose, excuse me, and I resource. And the resource is producing another disposable resource. Again, this is kind of unnecessary. I mean, I'm just trying to show an example. These are all empty, so. So here the benchmark itself doesn't matter. But we have one state. The trial level is going to be calling our state payload function. We're going to see that here. This is just creating a map. And we're putting whatever we want in this map. This is going to be available through the entire run. So we were adding our service from before. At the end of the entire benchmark run, we're going to be disposing the service, which we see below. This is just taking the service key out of the map and disposing it. And in this case, the return value doesn't matter because we're at the end, essentially, of the lifecycle. The invocation level is run before and after each individual benchmark. Each individual function is called. So in this case, let's imagine we need each benchmark to be provided with a transient resource that is yielded from the service each call. So again, we're using this with resource setup fixture function. And we're taking the service that we created here out. And we're adding a resource to the payload. And similarly, when the function is done running, we're disposing the resource and removing it from the aggregate payload. This is maybe slightly more complicated than I should be showing here. But it just shows you what this would look like if you had your benchmarking a web service, that type of thing. So again, OK. So now I hope you understand a bit more about JMH and how to use it from closure in hopefully an idiomatic way. Just a few final thoughts. You know about some of the benefits of JMH. What about the downsides? JMH is not simple. And there is a lot to learn. This is the main drawback. It's not simple because the JVM is not simple. There are lots of options. And to use it effectively requires more effort than you might be used to. Also, JMH is quite a bit more heavyweight than some alternatives. But it's because it's also handling a lot more for you. Benchmarking the JVM is tough in large part due to its inherent non-determinism. That's just the truth. While it's great how the concepts of closure fit together so beautifully and simply, what closure is running on these modern VMs like hotspot are complex. When you choose to benchmark in one of these advanced environments, you're suddenly exposed to a lot of very intricate stuff. I've recently done some work with OCaml, for example. And it's literally night and day how much more straightforward to understand what's going on under the hood is there. But again, there are always trade-offs to any technical decision, whether language or platform. The dynamic nature of the JVM that makes this stuff a challenge is what makes closure such a joy to use as it compiles code on the fly in a REPL as you're patching a running system. Regardless of the challenges, I think writing fast, well-optimized code is an important and attainable goal. In a similar way to how ClosureScript leverages Google Closure to handle some of its important underlying machinery, we may want to consider a widely used battle-hardened tool created by open JDK JVM experts. JMH goes to lengths greater than any other tool that I know of on the JVM to ensure accurate micro-benchmarking. However, as mentioned at the start, this is not meant to be a knock against any other library. Depending on what you're doing and the features you need, as long as you're aware of some of the pitfalls described here, you can still be effective using alternatives. Criterium in particular is a great choice and it is very well-written. I would recommend reading the source if you're interested in JVMs or statistics. It's much more digestible than the sprawling Java code base of JMH. Additionally, even if you decide to adopt JMH because you need more strenuous benchmarks or some of its advanced capabilities, it's not a panacea. As shown at the beginning with the empty function example, it's still possible to get it wrong with this library. And as always, your hardware operating system, JVM, vendor, version, configuration, and many other factors will affect your benchmarks. Your results may vary. So if you want to know more, begin. Please see the sample file provided in the repo along with the wiki there and check out the Java docs and sources of the JMH samples project. Some of my examples here were inspired or adapted from the samples there. Also, I recommend these blogs if you want to learn more about performance on the JVM, among many other good ones. So finally, I'd like to thank both the authors and contributors of both Criterium and JMH for their work. And once again, thank you to Bruno for inviting me and for organizing this event. I don't know much time I have left, but if anyone has any questions, I'd be happy to take them. Otherwise, thank you for listening.