 The microphone is green and on? Yeah. OK. Let's get started with a double edition for Lucine, and a tiny bit of solar. Yeah, hi. The next talk will be somehow two stages lower, as he said. It's now not about especially solar or elastic search features. It's more about what's below everything. And the talk will be about Lucine, but I will also have a little bit of solar, version 8 here. That's just very, very quick, my background. I'm one of the committers of Apache Lucine and solar. But I'm also, in my daily life, I'm doing more elastic search stuff. And I was the one who implemented the first numeric queries that all of you might use. So this talk will be about Lucine 8. And of course, the first question everybody will ask is, when does it come out? And of course, the official answer is no comment. But as far as we see, there are now plans to really do the first test releases. So I hope maybe you get it in the next month, maybe, or in two months. But let's see. Important, the release branch, so for the whole of the 8 series was cut off in mid-January. And based on that, currently everything is committed to, in most cases, three branches when they are bug fixes. So it's going to the master branch, which will be Lucine 9, to the 8th branch. And then we also are doing some additional deprecations because we want to remove some stuff. And this will go into the Lucine 7.7 release, which will come shortly before the 8th release. So that's a plan. OK, the first thing is, what will change in Lucine? So here you see already there's something with 10 times faster queries. And indeed, that's really happening because the most important change is that Lucine got somehow a new result collection engine. So when it executes a query, the way how the results are collected was changed. So you can short-circuit the search. So that means if you do not need the whole result count, so the exact count of matches, then you don't need to count all the matches, calculate the score of it. So you can short-circuit and just say, I have found more than 1,000 results or something like that. And that's something which brings for specific types of queries. And especially those are the Boolean queries. If you have many or clauses, and maybe some of those or clauses have very, very common terms like stop words and an or, you have something like many, many results. But you still want to do because in natural language processing and full-text search engines, you are never doing end queries. You are doing or queries because you want to show the most relevant ones first. But you have a very, very long tail. And you are, in most cases, not interested on the long tail. So in that case, and you also don't want to know how many results there are. So you can simply cut that off. So the new result collection engine works with term queries, of course, Boolean query, phrase queries, and also constant score query. But the question is, how does it work now? So the idea here is to add some additional statistical information to the index. So you can do that short-circuit because when you're collecting the top-ranking results, you need some information from the index to jump over the results when you know that they are not relevant at all because their score is too low. And to help with that, you are saving maximum term frequency and also the norm in blocks of the posting lists. And then you can jump over that. I will explain that a little bit later. It's also done multi-level. And it's stored in the so-called skip list. So here's the paper about what we implemented. I think it's from 2011. It's the faster top-k document retrieval using blockmax indexes. And the cool thing is those blockmax indexes are working very, very close to those skip lists in Lucene. So you can reuse a lot of code to implement that. And based on that, it makes it much easier for us to include that. So the first question is, what is a skip list? So the skip list is if you have a standard Lucene index, so if a query for two words, which is Lucene and search, and the list here is a document numbers which have that term here. And normally, if you're doing an all query, you would simply collect all those document numbers and score them. But for an end query, you can make use of the so-called skip list. That means because if you have an end query, you have to find all those matches which are on the same document for end queries. So that means you're first iterating through the posting list of the first term. You find the result number seven. And then you have to say, OK, skip the other one to any document which is after or on seven. And so that happens that it's jumping here from the seven to that seven. And then it asks, again, go forward and find the next one, which must be after seven. I have here 12 in that case. But 12 does not help, so I have to go to 15. And because of that, if you have some additional statistics like here, if you know, for example, you want to jump to 15 here in the first posting list and you're somewhere here, you can reuse some information saved on the block. So you are splitting that into blocks of four items. And then you can quickly jump to number 15 or to number 57. And that's a so-called skip list. So for end queries, currently the skip list is heavily used. So when you apply a filter in a Lucene query, most of the query time is used in jumping through those terms. And now the idea here is to reuse the same skip list. There's something else. The skip list is also multi-layered. So that means you have multi-layers. So you can also have something if you are here and you want to jump to document or after document 33. There's a direct pass from that one. So you can even jump faster so on a different level. So the idea now is to add some additional information, like, for example, the term frequency into that skip list. And then you can do almost the same for all queries because if you're collecting the top-ranking results, you have a specific score. And if you have already filled your top-ranking results with a minimum, so the lowest competitive score is something like 10, for example, then any document which would have a score of nine would never get into the top-ranking result list. So you can simply jump over those. And because we cannot store the scores in the posting list, what we are doing here is we add some of those statistics. So we say here all those documents with those numbers have a term frequency of three. And because the score is somehow realized, so if the score goes up or the term frequency, no, if the term frequency goes up, the score also goes up. And the same you can do for the norms, but there's the other way around. And based on that, you're just asking, in that case, if you have an OR query, I just need a new document where the minimum score is at least 20 to get into the top-ranking results. And then you are just calculating from the skip list what the best position is to move. And as you see here, of course, I cannot count that document anymore, because I don't know how many documents I chumped. And because of that, you're losing the information how many hits you have in your query at the end. So if you want to have more details, especially the stuff how it works with the other queries, how it works with Boolean queries, with phrase queries, and so on, we will talk by Ellen Woodward directly after that one, showing you more details about that. I wanted to make it short, but as always, if you explain something, you're going to give it as much as possible. So in addition to that, in Lousine 8, we have some new field types. One is the feature field, which is already existing in previous versions of Lousine. But the idea behind that field type is you can use it now with a blockmax index also to chump over documents where you have something like ranking factors inside. So if you have something like a per document feature like the page rank, you want to use that in the scoring. But it's an additional scoring factor. But as soon as you use a function query, all this blockmax stuff does not work anymore because you don't know if your TF and norm in the index is helpful to figure out if the result is going in it. So as soon as you have a function query, it won't work anymore. So if you have done that, for example, with Zola using a dog values field, this won't work here. In that case, the idea here is to use a feature field where you simply have a hardcoded term frequency as a value. And so for example, if the page rank is high, you simply set the term frequency to 200 in the term dictionary. And then you can use exactly the algorithms before to chump over those documents which you are not interested in. So most of those fields here have so-called feature queries which are new in Lucene. That's also a long point. And for the lat long point, a distance feature query that can be, for example, used to figure out to score based on the distance from your center of the map, for example. So that would be a one possibility. But in general, those queries always match all documents. The only thing that those queries are giving you is an additional score. So what you can do here is you just add those feature queries as an additional should clause to your Boolean query. And by that, you get some additional boost in the score. But the important thing is the actual selection of the documents is done based on the must clauses and other should clauses which are there. So it depends on how you build your Boolean query. So that makes it very easy to separate scoring and Boolean query logic. Another thing which already started in 7.4, I think, is for all people that loves Spun queries, I have heard that in the talk before about the Spun queries. So everybody who liked Spun queries, I know a lot of people from the patent. They want to figure out, find me all documents where in the first 100 words is electric and car and all that crazy stuff. So if you use Spun queries, you quickly figure out that they're very, very hard to use. And Alan, who is doing the next talk, started to work on simplifying that. There were some problems, for example, that you were able to combine Spun queries which were on different field names, which, of course, cannot really work easy. So the problem is here. Now, if you have a complex construction of Spun near queries and Spun queries, you now have only one query which is an interval query. And that interval query gets a lot of so-called intervals where you can define what you're actually querying the simplest cases, terms, phrases, or ordered, unordered stuff. You can also say to do something like a Spun near query. You can give a maximum width or the maximum gap between the terms. And then you can build a query. This is, for example, an interval query on the field. So everything is going on the same field. Then you have an interval which is ordered. Ordered means the terms or the intervals coming later must be in order. That means here, the first one, Lucine must come first. And then at some point later in the document, there must be the terms foo and bar also ordered. So that means there must be Lucine foo and bar. But between foo and bar, there should be, so the length of that should be three terms. So that means the gap in between must be limited. So they cannot be more far away than three terms from each other. And then you can execute it. So as you see here, by applying all those filters like containing, not contains, not within, you can make your really complex query based on the positions. Those new interval queries are currently only in the sandbox of Lucine. So you are not yet in the main chart file. But I think Elasticsearch in the next version will already have a query parser for those. So this looks how you're chasing would also look like at the end. So you would give those intervals and can use that. Unfortunately, Sola does not really have support for that now, but I think it will come quite soon. So that's one thing. Another thing that came to Lucine, which is a long ongoing issue because a lot of people think that it's a good idea to load the index into the main memory. And the first thing that you think is the right thing to do is to use the so-called RAM directory, which is there from the very first version of Lucine. But this one is really completely bad behaving for productive use. It was only added for test cases. It has broken concurrency. So if you do multiple searches on the same RAM directory, it actually gets lower because you have a lot of concurrency. And the other problem is if you have a large index, it saves blocks of 8192 bytes. So if you have an index of several gigabytes, you can count the number of byte arrays you have in your heap space and your garbage collector is driving crazy. And that's not something what you want to have because the idea was to have that only for test cases. And now I know some people are using that in Sola. I think in Elasticsearch there's no possibility to use RAM directory. But so for small test indexes, it's fine. But that is now replaced in the new version by the so-called byte buffers directory, which is based on the M-Map directory, which you are using on disk. For index on disk indexes, and that one is using the same infrastructure, so it's based on byte buffers. So it can also be of heap. That means you can also allocate byte buffers directory of heap and it's sharing most of the stuff internally. So you can actually replace RAM directory by byte buffers directory in the new Sola version. If you use, I think, RAM directory factory, it will use implicitly automatically that one. But the name is deprecated, but that's something. It is interesting. There are also some index format improvements. The first one you already know, it's a block mark statistics in the skip list, which speeds up the disjunctions. But we also have a skip list like structure for the dog values now. So that means if you have a function query, dog value space queries can now jump to later dog IDs with a constant time. So the index gets a little bit larger, but that should speed up those queries relying on scoring factors in the dog values immense. So about how to migrate. Lucene seven, the index version, since Lucene seven, there's an index version enforcement. So that means when an index is created, it stores the original version when the first segment of the index was created inside the index directory. And it preserves that during migration. So that's something which is very important because that allows us to do, so they were in old index formats where some broken offsets, like negative offsets inside. We were also exchanging the norms data type. It's no longer a byte. It's, I think it's a long now. I think something like. It's a long. Yeah, yeah, it's a long. It's still stored. It's still early storage. Yeah, yeah, exactly. So, but the problem is if you have an old index, it makes it hard to update. Because of that, there's a Lucene eight anti feature, which is complete removal of Lucene six index support that also affects solar people. And, you know, from the previous talks I had in the earlier years, there was always the possibility to use the index upgraded to raise the index to a higher version. But that's unfortunately no longer possible because as I said before, the minimum index version is stored in the index. So it will immediately say, no, that's impossible. So you have to reindex elastic search supports reindexing that supports it because everything in solar, you have to figure out if all your fields are stored and if they are stored, then you can do a reindexing, but there's no way around that. But if you really, really, really, really need it, there's a possibility to still do that, but you will lose correct scoring. So if you do some low level code, you can still raise a very, very old index. Even version 1.4 of Lucene, I brought them already to version six, it works, but if your analyzers are not correct, it won't really help you and scoring won't work anymore, but theoretically you can do that. So there's a possibility. Yeah, I think we have a few more minutes. Yeah, perfect. So we have also some new features and changes in solar eight. I think the most important one, in my opinion, because most of the other features were already added to previous solar versions, is that starting with solar eight, the inter-note communication, but also the external HDB connector can accept and communicate using HDB2. So that means by default, it will use a new HDB2 solar client to talk between the nodes. So all the internal requests are automatically sent using HDB2, but the backside of that is that solar eight nodes cannot talk to old nodes, of course, because by default it tries to do HDB2, but the old solar nodes cannot do that. And the most important thing is for upgrading. If you want to do rolling upgrades, you can start the new solar version with a special system property, solar HDB1 true, and then you can start your whole cluster, replace one node after the next one. And once you have done that, you shut down all the nodes again, or restart them enabling the HDB2 connector. There's one other thing is if you want to have TLS, that means encryption, you are required to use Java 9. And if you start solar on Java 8, it will automatically disable HDB2 if you want to have an encrypted connection. There are also some changes in BM25. The solar users need to know about, though absolute scores will be lower, but this will not change the sort order in normal cases, but if your schema match version is smaller than eight, legacy scoring is used. And now we have the minimum Java version, which is Java Solar and Lucene stays on Java 8 as a minimum version. Later versions work perfectly. You can easily upgrade. There's a fix for Hadoop coming, Caperos coming, HDB2 requires number nine. There's also some performance improvements if you're using later versions. So at the moment, it's recommended to use maybe Java 11. Java 11, so we have multi-release charts for that. And that's more or less the final slide because the support for Java 8 has ended three days ago. You should go to 11, in my opinion, and it's tested very good, I think. So as the final one, Solar 8 stays on Java 8. And in the master branch, we will likely switch to 11 as a minimum version in the near future. So the recommendation, as I said before, is to use Java 11 which should work perfectly with Lucene Solar version 8. Thank you. I think we have one more question. Yeah? I have like two minutes or so, yeah. Okay, yeah. Two, please. Okay, three people. Okay, yeah? Thanks for the talk. Yeah? I didn't know many of them. So I didn't know many of them. Okay, yeah. My question is like, if you have a different ranking function, so as you said, if I have a function query that is different from TFIDF, is it possible to plug Lucene logic to a determinate based on the blocks? So the question, yeah. So the question was, if it's possible to plug in another ranking function and it will still work with the blockmax, yes, you can do that, but the ranking function need to be compliant with the generating, so if the TF goes up, although the score should go up, so you cannot do something which is completely different, but you can simply replace it, and the reason for that is because we are storing not something like a max score, we are scoring the TF and the norm in the index, so you can use any scoring function which allows that. Have a minute. Yeah? Yeah, for testing, we also switched to that one. So there's also a replacement for testing, which is simply on heap, so it behaves almost identical to the RAM directory, so yes, yes. So for testing, there's a byte buffers directory, you don't need to put it off heap. Yeah, yeah, it's gone. Why is it gone yet? Yeah. Yes. Yeah, it's gone. I'm also sure maybe it's in the test framework, it's still surviving. Okay.