 Hello and warm welcome everyone joining this session. We have Shakti Kandan today with us to share his experience that I love the doc in Fast and Curious benchmarking multi core OCaml. So without any further delay over to you Shakti. Welcome everyone. So this session is on benchmarking multi core OCaml and their experiences with the same. My name is Shakti Kandan. So when I actually submitted this presentation at FunctionCon we got part of Psychpol Systems based out of IIT Madras. And this is a continuation of the work done by my colleague Tom Kelly from the University of Cambridge in the UK. And now all of our OCaml Labs systems have been merged with the company Caridus based in Paris in France. This is my email address and my social media handle. So the outline of this talk I'm going to talk about what is the OCaml benchmarking suite that we have called Sandmat. And then I will talk about the documentation and the reason that we've been working on this for the last two years. Some results to share with you and also some solutions in the interest of time that we have found useful and also some key takeaways from the session. So what is Sandmat? Sandmat is the OCaml benchmarking suite which is available as free labor and open source software. It's available on GitHub. It has both sequential as well as parallel benchmarks. And if you just look at the top level directory of Sandmat we have a number of folders there from the benchmarks folder which actually contains the OCaml benchmark core. Then we have the dependencies that are required to build these benchmarks. They are again OCaml packages. And then we have the notebooks folder which contains some of the new notebooks which are used to analyze the results from the benchmark runs. And the OCaml versions folder has some of the OCaml variants, the branches that we want to run and test with. And there are some configuration JSON files and some dash scripts. So the source URL for the Sandmat repository is at github.com slash OCaml-bench slash Sandmat. So what we basically do here is we take the OCaml compiler. So the stock OCaml compiler is at github.com slash OCaml slash OCaml. And we basically build that compiler and then we use the build compiler to compile these benchmarks and we run them and we measure some metrics. That's pretty straightforward. So what could possibly go wrong with this? I don't know. What are the challenges that we currently face? So until January 2022, OCaml, the main stock OCaml or trunk OCaml had mostly sequential execution. And the parallel multi-core version was being implemented. And in January this year, we have merged the parallel and concurrency implementation with the stock OCaml. Actually, that was a big milestone of a lot of hard work over the years. It was a very big PR and we expect to have a fired-out release of OCaml in sometime June this year. But from running sequential benchmarks to parallel benchmarks, there has been a lot of changes that have happened. The language has evolved and there have been different variants of OCaml compiler that have been developed. So for example, here, if you see this 412.0 stock, which represents the core OCaml development, then you have 412.0 domains, which basically represents the implementation of parallelism when using OCaml. And there's an effect syntax with some of the academicians actually like to use. So the language is evolving. So it becomes a moving target for us to actually benchmark this. That's a challenge in itself. The Dune is the tool that we use to build the OCaml packages and the projects similar to how you have Cabal in Haskell. And so when this language changes, sometimes even the Dune build tool will fail to compile. So there are ways to mitigate that. We use a previous version of Dune to run and build the benchmarks. This is one hack that we have currently. But those are some things that we have to keep in motors. And for each of these benchmarks, there are a large number of dependency packages themselves. We used to have about 50 to 60 packages and every time we want to update to a new version of OCaml variant, we have to also make sure that these dependency packages work to build along with them. So that is a challenge in itself. Of course, the classification of benchmark runtimes, some run in less than one second, varies from one second to 100 seconds. Some benchmarks you want to run on the CI continuous integration pipeline. Some benchmarks like the Armin data store. Armin is a git based store in OCaml. So those are like database tests we want to run for like two, three days continuously. Those are longer v tests. So how do you classify the benchmark runs based on runtime? So those are things that are very important when we actually do benchmarking for OCaml. So this paper retrofitting palace into OCaml is what I talked about initially, where we have merged the OCaml team has merged the ballism and concurrency implementation with stock OCaml. And this paper is available, you can download it online. And this is the speed up graphs for some of the benchmarks. It's very similar to what we saw in the universal scalability law that Brooklyn mentioned the data for this day in our talk. So right now if you take OCaml, stock OCaml, you have palace and concurrency implemented, you can actually write programs for those. So we use these landmark benchmarks to review some of the OCaml PRs and changes that happen in the compiler itself. So this is an example of a bytecode regression that we caught recently. So bytecode is basically the OCaml interpreter, basically the repel that you use, where you can key in OCaml code, and then you can see the output immediately. So there was this camel ensures stack capacity function that is called very frequently and we're able to detect with this landmark benchmarks with this slowdown and the performance and this got fixed as well. So it's quite useful when you want to track down compiler PR changes for performance repressions. Another example is the runtime tracing. So this is an ongoing PR where we want to have instrumentation and tracing from the runs and say there is not much regression. So the main idea is if you take a sequential program and you run it on the parallel runtime as well, of course with multiple CPU cores you should have scalability, but there should not be much difference between the sequential runs in the parallel runtime as well. So it won't reduce the number of deviations between the two runs. So here we didn't see much deviation here and the PR is actually being renewed currently. So how do we do this? We have a lot of benchmarks and we have different runtimes that we want to select from and we have different configuration files on different hardware. So we use JQ and we use a tag system where you can pick the benchmarks that you want to run based on the tags for a specific configuration. So that's the three different dimensions that we have by which we can actually select the benchmarks that you want to execute on a specific machine. I'm not going to go into the details of these in the interest of time, but I'm happy to follow up with the hangout session if you want to learn more about these. So from the configuration perspective, we work closely with the compiler developers and what is it that they primarily need? So we need a way to specify the developer branch that we want to track. There might be some configuration options that they want to use when they actually build their compiler variant. When you're executing the compiler, you might want to specify specific options. Maybe there's some environment settings that you want to use. So we want to provide these options to the compiler developers when they want to actually run some of these benchmark runs and we have support for all of these. So here's one example where one of the developers wanted to see the impact of changing the minor heat size and be able to do that by changing the environment parameters. Of course, you can measure things using Linux performance tools like Perf and so on. But I will come to the metrics section shortly. So we're not targeting all the metrics. The main idea is to see what the compiler developers actually need. So the GC statistics is something that they are very interested in. This is one example where we implemented a code size feature for the F lambda variant. So F lambda variant has a lot of optimization passes that it does. And they wanted to see the number of camel symbols in the family code. And for some of the benchmarks, we have some counts here. This is a feature that we actually added as part of the metrics, which is basically the execution of the benchmark run comes out as a JSON dump, which we can use for analysis. So the key point here is you can have like 1000 metrics, but then what we really care is what are the metrics that are very relevant to the compiler developers. And that's what we need to focus on. So we have quite a few machines that we use, some important configuration settings that you should be aware of. So of course, this work builds on the initial work like Tom Kelly who presented in ICFP 2019, the International Conference on Functional Programming, where he does some initial benchmark experiments. There's a link to the detailed notes in the reference section. But I'll briefly mention here that something that we did was we disabled the hyper threading. We didn't want to have any crossover or resource sharing the CPU. Turbo Boost was also disabled. We didn't want to have any throttling from external factors. The next CPU isolation is done using Isold CPUs. We passed that in the camel boot configuration. So here is an example of the Navajo server. So this is about the configure course. And this is a benchmark run that basically runs for 64 cores. And so here we basically isolate the CPUs one to four right for the OS itself. So all the interrupts handling for the OS should really go into that particular CPU should not affect the CPUs that are actually running the benchmark. ASLR is also disabled because we want to have repeatable experiments. And of course, power state is also disabled. We are more able to power save more than actually running the benchmarks. So these are two configurations of systems we have that are currently running these benchmarks on a nightly basis. So one is the during machine, just 28 core machine, and the Navajo, which is an AMD one with 128 cores. NUMA is non-uniform memory access. A lot of the hardware these days come with this NUMA configuration, which we used to let it to experiment with, with something to consider. So we do have benchmarking as a service for the compiler developers. So where they can actually select the number of variants that they want to compare with across these two machines. And then we have those time collections count and so on that are shown in the web interface. The URL is sanmark.ocameralabs.io. So you can actually open that website and actually see the nightly run results for these branches. So the sanmark nightly config basically provides three types of entries for the compiler developers. One is they can specify a branch. The other option is to specify a branch commit. So you want to track all the changes on a specific branch from that commit. And of course, we have an expiry field. So let's say you want to work on this particular feature for like for two weeks. And you only want to see the results for those two weeks, then you can have an expiry field entry. And the nightly runs will stop running the benchmarks after that date. And you can also specify a specific pull request. So if you have something that you want to analyze and see if there are any regressions happening for your pull request. Then you can explicitly specify that in the URL. So the current sanmark nightly config is there in GitHub. So that's where developers can create a PR and then the machines will pick it up in the night and it will build the benchmarks for them. We also experimented with Docker. Current bench is a OCaml pipeline, which allows us to create custom builds. And we found between Docker versus native builds to be very close. We didn't find much difference. Of course, with all the tuning settings that I mentioned earlier, you're able to actually run sanmark inside Docker with the current bench pipeline. It's something we plan to expand in the future as well. Yeah, so the key takeaways from this talk, I would say when you're doing benchmarking, try to keep minimal package dependencies as much as possible. If you have some patches that you are building with specific OCaml or compiler variants, it's good to push those changes upstream so that the respective package maintainers can actually manage that for you. The nightly runs are very, very useful as we have seen. We were able to detect a few regressions in most of the PRs that we had. And yeah, it's important to work closely with the compiler developers to see what metrics they really need and only focus on that rather than trying to create a large Pythorov metrics and share more relevance. So it has to be very focused on what they're trying to measure and what they want to analyze. Build failures will happen a lot. The language is evolving and that's something that you need to embrace. The UI that we currently have for the Sanmark nightly config is very minimal. So we only have a few entries, mandatory entries that you can use. But the tooling as such has a lot of options available. So when somebody is really stopping to use the UI, it's good to just have a very minimal configuration to start with and always tune the hardware to measure, especially for the parallel benchmark runs. For the report analysis, we initially started with Jupyter notebooks, which are still there today in the repository. We find that very useful if you want to create new graphs, you want to test some things out. And then once you're satisfied with the analysis, you can do that. A lot of references here on Sanmark, the nightly repo and the nightly config. We do publish monthly reports on the multi-procami work, which are available. The paper references and of course the current bench that uses the docker for its pipeline. We are hiring as well. So if you're interested to work with OCaml, we have quite a number of positions. Feel free to reach out to us. And that ends my talk. Thank you. Thanks a lot, Shakti for the wonderful talk and sharing your experiences in various metrics involved in building the benchmarking framework in OCaml.