 Thank you. Thank you very much. Thank you for having me in the search room and for coming here. I will talk about library that we created at work to be able to assess performance on Elasticsearch queries that is called Escobar, which is a wordplay in Spanish and Portuguese, which is a broom, but without further ado, I wanted to say that this is my first time speaking at Boston, so every kind of feedback that you have on this talk would be very appreciated. I'm a free software and functional programming enthusiast and I'm maintainer of some projects in GitHub and also in the real world. Hippul is a non-profit that we have in Spain in the Northwest, where we do a fostering of free software and such, and Dracula is a project for keeping privacy and educating about it. And I'm working for OpenShine, which is who sponsored this work. We're a small consultancy firm based out of Madrid and we have some remote opportunities available at some point, so I encourage you to check it out. So what won't we hear to talk about? What's the problem with assessing Elasticsearch queries? At some point, I'm talking here about aggregations on a dataset that you have on Elasticsearch. So when you have some aggregations that are nested and that are somehow complicated, sometimes it's difficult to assess how well the cluster will behave with some of those queries, especially if you do not trust the user that inputs those queries. And we were building an application where we had to allow users to be able to interface with parts of the Elasticsearch query engine. So at some point, we were not sure whether the cluster would hold up to the kind of queries that they would want to perform on the dataset. So what ways do we have currently to assess query impact on an Elasticsearch cluster? Well, for example, we have these triggers that Elasticsearch puts, the cutting mechanism so that if something takes too long, then it breaks the circuit breakers. There's the Kibana profiler, although when we started looking at this, it was not available on 5.1 yet. And we can do load testing of the queries and make sure that there's some kind of stable way of testing all of that. But what about queries that are dynamic, that you can't know ahead of time which will the queries be? Because at the time they are used, it's already there. Could we know before we run the queries? This is what the project is about because also the cluster was shared by multiple tenants and we didn't want to impact the performance of one user to others in the cluster. So we wanted to be quite conservative about the usage of the resources because it was a low overhead system so we did not want to spend too much money on building more clusters. And it was good for the business purpose that it served. So we wanted to go back to the theoretical side of things and go to static analysis. Static analysis is the discipline of answering questions about code without having to run the code. So what can we answer about this? We can answer whether some code terminates. This does not apply to queries because they are not Turing-complete. But it can answer questions such as how much memory is needed for a program or what are the outputs or the kinds of outputs for a given input. Or whether some variable is initialized or not. Or whether we run this or not at all. Whether this is the code or typing things also. But we can do more things than just analyzing stuff. We can modify things based on static analysis. For example if you are familiar with GCC then there's this O-flux or 1.0.2.0.3 that allow you to change how the compiler optimizes stuff. Like for example reusing registers on the processor some way or another or extracting constants that are a mathematical expression that can get in line before getting compiled. Or even type inference in other languages. And you could do query optimization, maybe not in elastic search but you could do that in SQL for example. Or code rearrangement, fusing loops or changing the loop order. But you can also do cost analysis and cost optimization. And this is done in GCC for example. So about static cost analysis there's papers about it. There's a paper from the Comprehensive University of Madrid about how we can run cost analysis on the Java bytecode. But on GCC we already have these flags. For example we have mArch and mTune which are the flags that target your code for a given architecture. mArch is a flag that restricts the set of instructions that you can use to the ones that are specific to... I will get to that in a moment. These are the kind of instructions that restrict you to a certain type of processor. And mTune is another flag which says that from these flags that are general to all our architectures let's choose the specific instructions that perform better on a certain processor. So graphically if we have the x8664 architecture we have these two processors as an example which have some different instructions that they can perform but they also have a common core set. So if we had mArch specifically tuned for the Intel processor then we could use these other instructions that are not part of the x8664 specification but are extensions to it that run on that processor. But if we didn't, if we targeted the mainstream x8664 platform we could still mTune this for a specific processor which has better alignment on one of those instructions that perform better than the alternative instructions that could do the same things. So in some way GCC is doing cost analysis and cost optimization for us as a compiler. However Elasticsearch queries are not exactly compiled, they are transformed at some point into a query to Lucene, the underlying search engine. However Elastic does not provide documentation on how the internals work at such a deep level or does an ECAPI access to the internals that happen before execution. So what we came up doing was from the parse tree that Elasticsearch generates from the query, we analyzed that and we traverse it. We could have optimized it maybe and we could run better queries but we're not at that point yet, this is more of a prototype of what we could do in the future. So our cost tree analysis is generated from the parsed query. In our case this is efficient because a tree is a recursive data structure where in this case we are only using the child nodes in order to compute the parent, so we only need to traverse each node once. So in our case this is a very efficient cost analysis because it only runs on the amount of nodes that we use as aggregations. But there's of course more sophisticated analysis possible and that would depend on the kind of structure of grammar structure that you use for cost analysis, the kind of relationships that you have between the nodes. In our case we only wanted to have children but maybe if we had some better information about the cost model of Elasticsearch like what caching happens in the middle, what kind of network transmissions happen when we shard the data between the different nodes of the cluster and the reduction steps that happen when we are computing the aggregations, how all of that happens, that's not directly exposed in the query, that's something that Elasticsearch does after the query is input so that it gets to using search engine of each instance. That is not available directly on the API servers as we have come to it. Maybe there are some underwriting techniques but we didn't find any documentation to get to them yet. So if we wanted to have this kind of better cost analysis we would need some way to expose this better cost model so that we could get the optimization things that Elasticsearch does and compute them also into our model. This is ready to be implemented but we don't know how to do that yet. But this could be in a way somehow like the Spark Catalyst Optimizer does with Spark SQL in the way that from a query you get other sets of queries there. But how is this really implemented? How did we came up with this? We used Scala as a technology because it's easier to make functional programming with it despite Java's functionality to do that in recent versions. And we used, when we needed to parse JSON we used JSON4s native because we did not want conflicts with any of Elasticsearch libraries that they used so that our code set was mostly independent in the class path of the JVM as compared with Elasticsearch. And we used config libraries that are already implemented for Scala so that we can take a case class and directly look the information from a Jamel or atom file directly into the case class. And there's some use case for ACA that we will get back in a moment. Why Scala? I've just mentioned that. So the ASD nodes are pretty basic. The interfaces are just the ones of a tree. And there's the root aggregation which aggregates all the nodes and then the sub aggregations are the, this is a classic composite pattern where we have multiple sub aggregations until the leave node. And each of the sub aggregations have a reference to the node that Elasticsearch parsed from the query where we get the parameters from it. However as I said before Elasticsearch does not directly provide this facility to plugins directly. We needed to make a bit of a change with the visibility in some of the fields which are private and do not have getters for them. In this case the problem was getting the parse tree complete because we did not have from this kind of nodes that Elasticsearch provides we don't have the sub aggregations method to get the nested trees. So we had to implement something that ran on a higher security context because Elasticsearch by default runs and does very well with a restricted set of operations so that you can do for example reflection and get to the core without doing some things so that the plugins can't hijack your cluster. But we needed to hijack the cluster so that we could get to these fields and for that we need to run under a very high permission level of security so you have to either trust the code or read it because it's open source. And at the end we do some analysis of certain kinds of aggregations that we need to come with a specific cost model but in the case that we don't have a good cost model on them yet or it's not implemented we rely on a default cost for each type of aggregations that happens. And then we do some mathematical aggregations over the child and over the parent. For configuring it we use the same kind of configuration wherever you deploy this. I will talk about that in a moment. There's a default configuration for every node that is not a specific configure which can have some default properties and then for each type of aggregation that you have on Elasticsearch you can have some specific configuration for each of them. And this allows you to get a cost model for each type of aggregation that you have on Elasticsearch and even for what can sub aggregations have underneath some other things. So that for example in this example you can't get anything that is not a term sub aggregation under any data histogram so that you can't run more complex things than what you want to allow the user to do. We built this as a library so that you can plug this wherever you want in your projects but there's sub projects to make examples of how you would deploy this as different stuff. For example you can deploy this as an Elasticsearch plugin hosted in the Elasticsearch cluster and in this case you would run the libraries of Elasticsearch in the version that the cluster is running on. So you would have perfect compatibility between the parse tree that we are generating and the configuration but you can also run it independently as a microservice and in this case we're using ACA for that. And you can use it as a proxy for Elasticsearch so that all queries come to this microservice and then it distributes to the Elasticsearch backend or you can just have it run the cost analysis on them. As a plugin we created the SBT structure, the Scala build tool structure for building Elasticsearch plugins based on the gradle code that Elasticsearch provides for that because we had all our code based on SBT and we wanted to keep that. So if you are using Scala for any of your projects and you want to build an Elasticsearch plugin you can use this as an example of how to begin to create a plugin from SBT. And also there's the capability of running this as just a web frontend based on ACA HTTP and it's also a project of how you would run this on your own and including Scoba as a library so that how you would interact with its API in an HTTP service. And of course this can be deployed on Kubernetes because that's also how we're deploying it internally but it's a cool feature. And in this case we already provide a sample chart which lists the configuration as part of what you can express directly in the values.jammel file so that you can directly configure everything in one single place. Next steps we are going to create a logo because as you said as you have seen we don't have any visual way of referencing this but we need to feature proof this to Elasticsearch 7.2 because we do not have any ways to ensure that this will run on future versions of Elasticsearch and how this mechanism will continue to perform. We are using some things that we wouldn't like to like accessing private fields directly. We would like to have some kind of API for doing this and make it more feasible to other plugins to run in the same way and have this kind of usage. And we would love to have some way of analyzing data from the dynamic usage of this plugin so that when you run queries that go through the static cost analysis we could compare what's happening with the static cost analysis and what for example Kibana profiler could give us so that we can tune our static cost parameters to those that actually happen in the cluster. And that way we could get a much better fit for the static cost analysis so that the default cost analysis that we get on GitHub is much better tuned to what an Elasticsearch cluster performs. And from that also improving the static analysis methods so that it's more accurate to what's actually happening including caching and so on. And this is all I had for the talk so please any questions are very welcome. Thank you. Okay. Any type of questions? Yeah. Do you provide any way of providing the static analysis on the query itself or just aggregations? Okay. The question is whether the plugin provides any kind of static analysis over the query part of the query itself and not just aggregations. Currently not because our use case didn't need to do that but it's very easy to do so. We would just need to add a different interface so that we could have both aggregations and queries. However, so that you can think of more questions, you can do that but in queries it's not as much of a problem because you're always restricting the dataset. So even though there's a shoot and you can have multiple things that work at the same time, you're always cutting datasets. You're not necessarily expanding the data that you're generating. Anything else? Yes. Okay. The question is whether there's any need for or whether it's useful or why it's useful to use the internal structures of the parse tree. Yeah. If there's any benefit from using the internal data structures over parsing the JSON directly. That's a tricky question because it's a trade-off. We could have our own JSON parser for the queries but then if Elasticsearch changes the way that aggregations are handled or how the grammar is done on JSON, then we would have to rework that for future versions of Elasticsearch. And in this way, unless they change the names or the paths of the aggregations then it's done and it's much easier to change that because since Java is a tactically compiled language, we can know whether something has changed in a static way so that when we compile with a newer version of Elasticsearch we would get that compiler and it's easy to change because we just need to go to auto-completion on the editor and find out what's the new name for that. So that's the main reason but also because Elasticsearch can have multiple ways of analyzing a query. For example, in Elasticsearch 6 you recognize both ads and aggregations as a name for the sub-aggregations so we could have to go into this kind of caveats and quirks. Okay. Thank you.