 Hi, Nick. You can start the talk now. All right, thanks. So the last tech talk, after the last one, we asked for volunteers for who to do the next one, and most people were traveling. So I figured I would talk about Elasticsearch because I'm not traveling. It's convenient, I guess. So I'm going to talk. I'm going to give you a crash course on Elasticsearch, sort of what it is, and a little bit about how we use it. I'm not going to go into a ton of depth but that's what you get in 20 minutes. So let's go. So this huge sentence is at least what I thought Elasticsearch is. So it's a distributed, replicated, restful document store with a rich query language that can sort results by relevance, filter, the return document extract, excerpts from the document since suggest better search terms and run aggregations across the query results. Got that? Because I sure couldn't even say it all in one breath. So I'm going to sort of unpack that. And by the end, you'll be able to read that sentence and at least understand what I think Elasticsearch is. So I'm going to start with the restful document store. That basically means that you can shove JSON in it and get JSON back out. And you post or you put to get the JSON in and you get to get it out. It's reasonably simple. But sort of tacking that onto search is really quite useful. Oh, God. And it's going off the edge of the screen. But that's OK. I'll just keep going. So the distributed, replicated part, the distributed part basically means that it's sharded. By default, Elasticsearch uses five shards. We default to one shard per index. It's lower overhead that way. But for the most part, sharding is invisible to the client code, right? You don't generally know that your search request is being sharded across 8, 10, 12, 20, 100, whatever shards. It's not a big deal to you. Elasticsearch shards are also replicated. You can figure the number of replicas to keep open. This is neat because you can shoot anyone a Elasticsearch server and it'll start rebuilding the replicas. In fact, when you do a restart of Elasticsearch, you have to tell it, no, sorry, dude, this machine's going to come back up soon. So please don't go rebuild all the replicas. Yeah, but it's kind of useful. So we can lose machines or even hold racks and continue to serve searches. We're at diminished capacity, obviously. So it has a rich query language that's made up of posting JSON. Well, it can get very verbose, but it's very rich. Like here, we're just querying one term on the top and on the bottom, we've got a query with a filter. And you can get more depth than I care to see. But it's very, very strong and rich that way. So I also have said in this, yes? Is that full query interface publicly exposed to now case, or is it? No, and it really shouldn't be. It's way too dangerous to let people write arbitrary queries in here. So the only thing that we get is the web interface. Is there an additional Elasticsearch API that you're going to talk about that we publicly expose? So we publicly expose the standard search API, and that is how Elasticsearch gets publicly exposed. Any other public exposure of Elasticsearch will be through other similar APIs that will have to be reviewed and things. Internally, we can use this rich query language to do things like reporting. I used it a while ago when I was trying to determine average sizes of things, or average sizes of the documents in our store. We use it internally to find some JavaScripts. The problem is some of these, the query language is so rich that you can do really, really, really slow, horrible things with it, that if we exposed it to too many people, or publicly, maybe even publicly at all, then it would be a really easy way to shoot the cluster. And users with shell access can use this on our cluster? Yeah, anyone with shell access can basically hit Elasticsearch with anything. So in that big, long sentence, I said that Elasticsearch would sort results by relevance. That's pretty standard for search systems. Essentially, if you have half a million hits, you need to know which ones matter to you the most. And we have four stages of how relevance is calculated. And I will go through them. The first one is this thing called similarity, which is we use a similarity. So similarity is configurable in Elasticsearch. And it is essentially the lowest level bit that scores a document based on query terms. The default one is called term frequency inverse document frequency. Or rather, it's called default similarity. But it uses this term frequency inverse document frequency concept. The idea being that documents which contain rarer query terms more times get a higher score. There's a lot of math and compressed, like floating points compressed into byte sizes and stuff. There's lots of, there's a lot in there, behind there, behind this statement. But it mostly works like that oversimplification that I wrote there. On top of that, we boost certain fields. So when you search, you want, say, title to be worth, in our case, 10 times as much as matches in the article text. This is done by just sort of multiplying the scores. And that's built into Elasticsearch. You can, Cirrus also uses this field boost to make exact matches worth more. So if you search for, the example I've got here is if you search for cats, it'll still find cat. But cats plural will be worth more. It's something like four times more in Cirrus. Beyond that, we take the top, I don't know, I think it's like 8,000 scoring documents per shard after all those previous steps. And then we re-score them. And what that means is we take the score that's there and sort of do more to it. But only to the top 8,000 or so, because it would be too much work to do it to the top million. There are two kinds of re-scores. I'll get into function score in the next slide. But the other re-score that we do is a phrase re-score. And that's pretty simple. Basically, imagine that you shoved quotes around your query term to search for just an exact phrase match. That's pretty much what a phrase re-score does is it shoves the quotes around and then reruns the query against only your documents or only the documents that you hit the first time around. And you'll still find examples where the phrase isn't in there. But any time you actually have a perfect phrase line up, then that document will be worth more. Lastly, we do this function score step, which is sort of catch all four. Do math on the old query score based on stuff. So you can either do the math based on stuff that's already in the document. So remember how it's a document store. So one of the fields we have is incoming links. So we multiply the score that came in by incoming links plus 2 log of that. The only reason there's a plus 2 is it keeps the log always positive. And this is actually a holdover from the way LsearchD works. And it seems to work pretty well to make more popular or more highly linked articles appear closer to the top, but it doesn't get in the way of what the user was searching for. After that, you can, and we do, on some wikis, do things like if any of the pages contain a certain template, then we raise or lower the query score by some percentage. And this is actually really useful on comments because the term frequency inverse document frequency is not that great at spitting out what are really, really good images or really good videos. It's pretty decent at finding text inside of PDFs and things like that. And for that, this doesn't apply as much. But this will really, really, if you want to find a good picture of a river, you can go search Cirrus. They'd go use Cirrus to search comments for river, and you'll get the ones that people have already set as quality. Do you have a question? Yeah, do you realize that now, in addition to Google Bombing, Wiki Bombing is possible? It's always been possible? But yeah. So the idea is that we use this to link into the community sort of curated quality markers, right? So it's certainly possible with that incoming links thing to Bomb, but it's not ever come up, really? Presumably, the regular editor process will catch the fact that people are adding links to George Bush from pages about action, right? Those would just get removed just by the regular. Yeah, I think media projects are more resilient to malicious types of activity like that. And remember that the boost is only logarithmic, right? And I imagine PageRank is something like that, too. So you'd have to add millions, right, before it started to be as better than thousands. Also, we don't have to compete among 17 pages about Benjamin Franklin, and I want to push mine on top of the others, which is the Google Bombing main case. Yeah, I mean, it's slightly interesting, but I think people would be, I mean, it's interesting the extent that Wikipedia is on top of Google results for certain things, but Google's not using our search anyway. So I think that the number of people who would find it interesting to type in a term to Wikipedia's own search and come up with something unexpected, I think, way down below on the case. And the other thing that this particular thing, the logarithmic score boosting on incoming links, this is big for the Suggester in the upper right. When you start typing the name of a page, that is pretty much only sorted by the number of incoming links. And that's how it worked before, too. But that gives you sort of more likely pages that you're going to want. Just following up on the previous question, is there any case in which a search-generated page is indexed by Google? That seems like that's the only case where no, I don't believe we'd generate that. There have been proposals to power certain things like categories by a search, and those may end up indexed. But those, I imagine, are similar. They're curated, right? People are making the category and saying, this is a category that is the intersection of the three categories. So I don't think there's that much danger there either. Yeah, I guess I'm just putting a mental bookmark there that if the search result page is ever Google-able to be indexable directly, then we might start having to worry about people following our search. Yeah, and I just made a quick comment on the chat here. We, as a policy with our robots configuration, no special pages are indexed by Google at all, special search being one of them. Because we don't want Google to be indexing dynamically generated content like that. All right, should I keep going? Yes, please. I will keep going. So the next, if you wanted to go back to the beginning of the presentation when I spewed that big sentence out, the next thing I said was filter the return document. So this is actually, this is pretty important for us. If we index 10 megabytes of text like we do for some of the documents on Commons, some of the books that we break up in the PDF, we index the whole thing. We don't want to send that back to MediaWiki to render the search page. So instead, we sort of do two things. The first one is that we only ask for the bits of the document that we need. And Elasticsearch will filter that out on the server. And the second one is that we ask for excerpts from the documents. Elasticsearch calls this highlighting because Lucene's always called it highlighting. That's what it's called. But this sort of boils the query that the user made. Rather, it boils the query that Elasticsearch received from wherever it came from, in our case, from 0s. It boils that query down into terms and weights. And then it zips through the, I'm hearing myself, and it's throwing me off. Anyway, it zips through the text, and it finds those terms and tries to make a summary of the document. And in this case, because I asked for a really, really short summary, so you have 30 characters, I did something weird, and it put the highlighted portion at the very beginning of the summary. Normally, it's centered in the summary, so it's a little better. But in any case, the thing is, this is not as easy as just find the strings that the user typed in that 10 megabytes of text. Even that might be slow with 10 megabytes of text. But instead, you have to simulate what Lucene did. You have to find the terms that match. So remember how, well, we'll get to language analysis, but you have to find the exact terms that match. And so there are three highlighters that go about doing this in Elasticsearch. We're migrating to a fourth one that we've written and released open source. And hopefully, it'll be better. I mean, it has been better, and it's faster, and better in memory and stuff. So moving on to another thing that Elasticsearch does. I have one question. Can you briefly explain what is the problem in the other three and what have you solved in this one? Not briefly, but so the other three highlighters are tied to they tie segmentation of the text to identification of hits. They tie those two things together. Basically, each highlighter has one segmentation strategy and one hit identification strategy. And that's not particularly good because the fastest hit identification strategy is actually tied to the slowest segmentation strategy, as well as the segmentation strategy, it's tied to a segmentation strategy that really only works for pros. And WikiPages ain't pros. There's just too much stuff in them that isn't all sentences. And there's no real way for me to reliably extract all the non-sentence stuff. So the fourth highlighter solves that problem by allowing you to pick which hit identification strategies used as well as pick the segmentation strategy, as well as a few other bonus things that were easy to implement because we were implementing our own highlighter. Things like the old highlighters would weight hits more highly that contained multiple instances of the same word. So if you had three instances of the same term from your query, that would be worth more than one instance of two different query terms. Then if you search for A, B, C, or if you search for A, B, then the old highlighter would rate A, A, A higher than A, B. And it would rate A, B, C the same as A, A, A. And that's just silly, right? So that's one of the things that the other highlighter, that they're doing that we've written, fixes. So you said you're running this on WikiText, not on the HTML output of the page, right? No, we run it on the HTML output of the page. Running it on WikiText is wrong because it wouldn't include the transcluded templates. So why is it hard to get rid of all the non-sentence problems? Because people don't just put sentences in paragraph tags. They put them in tables. They put them in definition lists. And it's hard to tell whether the table contents is, indeed, a sentence or the definition list is a sentence or isn't a sentence. The other thing is the sentence segmenters, they don't have a good concept of space. So if somebody writes a huge sentence, then it won't break it apart in there. That's obviously a problem we can solve. But combining that with just the trouble of breaking out the non-sentence stuff, we do a little bit of that. We break out tables from the article text. We break out image captions from the article text. We break out info boxes from the article text. But we don't think that we do it reliably enough to go to trying to segment on sentence breaks. We think we get too much non-sentence stuff still in the article text. OK, I mean, we could talk about that offline. Sure. I mean, I'd love to be able to get more of it out. Yeah, I mean, when I was working on the PDF back end, I also have a plain text back end that takes a Wiki text article and generates a very nice, plain ASCII text output. And that seems to be pretty readable without a lot of non-sentence stuff. But probably I'm just misunderstanding that the problem that you're solving. So we can talk about that. Maybe. Or maybe you've done it better than I have. That's not unlikely. All right, I'm going to keep going, and we'll talk later. So another thing Elasticsearch does can do is suggest better search terms. So like I say, in this first case I have, it'll actually catch the tune is spelt general. That's not all that interesting. We've had spell check in, as you type in Microsoft Word or wherever. It's existed everywhere for 10 years. So that's not that cool. But the other thing that can do that is reasonably cool is it can detect when you have spelled things correctly but are using them in a context that's sort of silly. Like if you say Nobel Prize winners, it says, no, no, no, I think you mean Nobel Prize winners. And Sirius, unlike Google, does not take liberties with the search and say, no, your search was just totally whacked, so I'm going to give you this My Suggestion instead. We'll always give you your original search, but we will always offer you the option of a suggestion. If we can find one that is better, but we won't look if your search actually is an existent page in MediaWiki. So that's just something that it does. And that has an interesting effect because Ian Wiki has tons of redirects to handle these odd cases. Like Nobel Prize and a BLE Prize is a redirect to Nobel Prize in Ian Wiki. So Ian Wiki doesn't get as much benefit out of this as the others do because they don't have quite as many of these redirects, but this is still useful. Sirius seeds these with titles and redirect titles. It does not seed these suggestions with suggestions from the article text, which means that you're, that's kind of a mixed bag. But that's how we do it now, and maybe we'll change it. I don't know. The last thing that I said that Elastic Search does is something that we don't actually do very much at all. But I included it there because they're very proud of it. It does these aggregations. So normally, when you search, you get the top N matching and filtered and highlighted. So you ask for 50, you get 50 results back filtered and highlighted. You also get a count of all the matched documents. The aggregations sort of expands on that count. So for all the matched documents, you could get statistical information about a field or distinct values, distinct terms that come from the field. You could get a histogram of where the field values fall in this document. You could do fascinating based on, my favorite example is fascinating based on categories. So we could tell people, hey, your search found 10 articles that are about people and 20 articles that are about cars or whatever. I think it'd be interesting, but that's one of the uses of Elastic Search that we just haven't really gotten into too deeply. You can ask for as many of these aggregations as you want on any query. These aggregations, though, are famous for being relatively CPU and memory-intensive. So now that I've told you what Elastic Search is, it's not really fair to talk about Elastic Search without talking about Lucene. Lucene, another big sentence I'm going to unpack, is a Java library that implements language analysis, search data structures, and queries. They use them. Many of the features that I said were Elastic Search features are, I mean, they're Elastic Search features, but they're built almost directly, like almost word for word based on Lucene. And so a lot of the stuff that you do, like the query syntax, for example, is JSON that is translated into Lucene query objects that are executed. So it basically allows you to execute Lucene stuff over rest and in a distributed way. So Elastic Search has to sort of merge the results later. So I will unpack that sentence, and I will talk briefly about Java, but I will not talk about Java. I will instead talk about the JVM, because almost everyone here knows what Java is. And either likes it, or doesn't, or whatever. The bits that are important about the JVM are that it's a virtual machine that does garbage collection. It's a single process, mini threads shared memory model. So not PHP or Node, or Erlang, or any of this stuff. It's more sort of your traditional, just don't lock yourself out of your, be very careful with your memory locks and that kind of thing, runtime environment. It's known to start really slow, but once it gets up and running, it's reasonably quick. There's been some silly benchmarks that compare it to C and say, look, Java is faster than C in this one weird corner case. And that may be the case, but in reality, it is a place to go if you want code that runs pretty quick and has garbage collection. Java's also famous for eating gigabytes upon gigabytes of RAM. And elastic search is no different. We feed it a lot, a lot of RAM. Finally, when you say JVM, there are actually like five different JVMs. And there have been more in the past. What really matters is that we use the open JDK. And at least to my understanding, most of the JVMs are based on the open JDK. So when there are errors, someone at Oracle or wherever tends to submit patches to the open JDK to fix it and then re-release the Java there, their Oracle VM built on top of the open JDK or something, I'm not really sure exactly how it works. But the upshot is the open JDK basically is the same thing. So the scene is a little different from a regular Java library in that the scene is sort of old school, right? It doesn't have many dependencies. Java applications have tons, tons, hundreds of dependencies. The scene has none in its default way that it's shipped. It breaks itself down very small so that you can run it in very, very little RAM or with very, very few dependencies. If you say like you want to analyze Polish correctly, you might need another dependency. And it'll have a small place that will put that in, right? Elasticsearch doesn't include all of those. Some of those turn into Elasticsearch plugins. And we want them all, it turns out, because we do want to analyze Polish well and we do want to analyze Japanese well and we do want to analyze Chinese well. So actually one of the things that I'm working on now is integrating all of those plugins. In any case, the scene was sort of designed for to be minimal. So as I said, the scene implements language analysis. So for example, you search for use, it really ought to find the word used, right? They are similar enough that a search for one should find the other. The scene does this with a tokenizer and a chain of token filters. For example, right? If you take the fragment used to throw projectiles, tokenizer is the thing that decides that it is indeed for tokens. There's a filter here that lower cases, so it squashes the capital U away. And there's a filter here that stems used to use. There's tons. This is actually where we get into the most interesting bits is using the appropriate sets for the appropriate languages and things along those lines. I think in Wiki, we use six filters, right, rather than the three that are shown. But they're all useful to make things better. So Lucene has this thing called an index. And Elasticsearch has this thing called an index. And they ain't the same thing at all, right? Every Elasticsearch index is the thing that you refer to on the URL. So we have an Elasticsearch index for Ian and Wiki's content articles. And then we have an Elasticsearch index for Ian and Wiki's everything else articles. This is actually pretty much how all of the Wiki's go. We have an index for the content and an index for the everything else. And Elasticsearch indexes are some state plus some shards, right? And that state is for coordination across the cluster. And it's generally held in memory. And the masters get to mutate it and all that other great stuff, right? That's all very Elasticsearch. The shards themselves are one master and zero or more replicas. Each of those, the master and the replicas is a Lucene index. So that's how you get from the word Elasticsearch index or the phrase Elasticsearch index or the phrase Lucene index. Elasticsearch indexes are made up of many Lucene indexes. Like in the case of Ian and Wiki, it's made up of 60. So every Lucene index is made up of segments. And this gets kind of sort of deep into Lucene, but Lucene segments are written once. Deletes are done by tombstones in the segments. And updates are done by atomic deletes and then adds to new segments. The only way to get the tombstones out is to rewrite the segments by merging them with other segments. And this also sort of is how you go from, well, shit, I have 30 segments, but that's too many. It makes querying very slow. So the active indexing things in Elasticsearch is constantly this creation of new segments and merging them. Because to make something visible for a search, you have to finish the segment that it is inside. Elasticsearch by default does this every second. So every second it just smashes that index closed. It either writes it to disk or holds it in memory depending on the size. And then as time goes on, it picks up those small segments and merges them together into larger segments and then does it again and again and again and again. We've raised the default to 30 seconds because we thought it would lower load in Elasticsearch. It has marginally, but I'm not going backwards. I'm not gonna set it back to one second just because all of the serious updates go through the job queue and that tends to have at least 30 seconds of latency anyway. So only doing the refresh to make these things visible every 30 seconds, it doesn't, it isn't a big deal. Or rather it doesn't make serious feel less responsive. It's still pretty snappy feeling. Anyway, Lucene segments are made up of many files. Each index lives in a directory where you just like spew tons of these files. There are way too many types for me to go through but I will give you three quick examples, right? So if you were to go into the directory and like, you know, LS in there, you'd see FRQ files, which are a list of documents that contain each term and the frequency in the document. Tim files, which are the terms overall frequency and then pointers into the FRQ files and tip files, which are an index into the Tim files. And there are like a dozen of these files that do these other things. And in fact, that's how the document store is implemented. Like that's how like everything is saved in these files. Elasticsearch has no backend other than for cluster state maintenance that isn't in Lucene files. So, and this is my index slide. What's left for Cirrus? Like why aren't we using this everywhere? The big problem is that it's just not fast enough to serve in Wikis traffic, which is really disappointing, at least to me, right? It's, you know, it feels like I'm playing whack-a-mole with speed issues. And you know, every time we get it, like sometimes it'll be something that we need to fix in Cirrus or sometimes it'll be something that we need to fix in Elasticsearch. And sometimes it'll be a combination of both or sometimes I'll have to, like what I did with the highlighter, I just spin up my own project because there aren't enough people at Elasticsearch that have the time to review another highlighter that, and I felt that we needed to move quicker on the highlighter than the regular Elasticsearch release cycle. So, like yeah, like this is sort of the, unfortunately, this is sort of the point of the project, right? We want to use something like Elasticsearch that isn't just our baby, right? And we want to use it because, you know, there'll be more expertise floating around. We want to use it because any advances that we make, you know, we want to raise all the other ships, right? That's sort of one of the points, right? That's why everything that we write is open source. That's the idea, right? And it's, I mean, it was, it's useful that our old search was very fast. It is very fast. But that speed hasn't migrated its way back into Lucene, right? And so some of what I do is go look around the old search and find what ideas are in there and see what can apply where, right? Any case, that's the big point. That's the big reason why we're not using it everywhere. Secondarily, there are some languages for which LsearchD does a better job, but I am not clear on which they are. If you asked me two weeks ago, I would have said LsearchD does a better job with Hebrew. But then early this week, someone pointed me to an article in Hebrew, which I could not read, that said that Cirrus was great and was awesome for Hebrew. And they were very, very happy about it. So I don't know, man. We have, as I said, we have these plugins that we're trying to get time to integrate. And basically, my life is juggling fixing these bug reports where we're not mimicking some weird syntax that LsearchD had and never documented, by the way. So it's that and improving results in non-English languages. English is pretty good because I can read it and tell. And thirdly, doing performance stuff. It's gotten to the point with the performance stuff, by the way, that we're not actually using much CPU when I replay traffic into Cirrus Search from LsearchD. When I replay that traffic, we're not actually using that much Elasticsearch CPU. It's fine. In fact, when I look at the hot threads and all the other reports that one usually does in Java, they say, like, yep, this server is doing just fine. But we're spending 25% of wall clock time garbage collecting. Like, that's what I discovered the past week. So that's the new thing. And until three days ago, I had no clue how to debug garbage collecting issues. I mean, I knew a little, but I'm getting much more versed in that. And this is, hopefully, going to be one of those all ships rise things where I find out that there's something that we can do that LsearchD did, maybe, that we can use and sort of apply back into the wider community. I'm going to stop rambling and take questions. All right, so on IRC, we've got some questions. Let me start with one from Subu. Can Nick talk a bit about the Wikipedia document base that makes this search problem different from, say, a set of docs on some other random website? Since fully expanded docs are processed, presumably, templates don't really get in the way, or do they? Templates, so this, so the old search did not expand templates. The old search actually wrote its own query parser, its own expansion parser, which is like, you know, the parser people know that's hard, right? So it obviously cut a lot of corners, right? The CRS just asked MediaWiki to do the job, right? And expand the templates. So in that sense, no, templates don't get into the way, right? And in fact, Elasticsearch really doesn't know anything about templates beyond that we send it a list of templates that are used in the pages so that we can search, so we can search those, right? One of the things that is different and more difficult about templates, about the Wikipedia document base, than a regular document base, is that people make changes to templates all the time, and those templates are included in lots of pages, and those templates actually do cause meaningful changes to the page. And so we end up having to rebuild the search index for those pages whenever the templates that include them change. We do not have an optimization where we look up and we say like, oh, did the page change meaningfully, where we pull it out of Elasticsearch and say like, oh, you know, there's no extra text in here, it's okay, we'll just move on. Honestly, probably should. But that's one of the issues that we have is we end up having to re-index the page over and over again as people change templates, which they seem to love to do. Especially in Ian Wiki Commons and English Wiktionary, I don't know why. All right, there's another question from Sumana. Which non-English languages are we especially grateful for testers in? All the ones I don't speak, which is all of them. We, I am especially grateful for the help that I've gotten from Matanya in Hebrew. He's the one that originally said, man, this is horrible. And then found a post that said it's great. And I've been working with him to get us, to get an analysis chain deployed that is really good in Hebrew, that will let you find the things that you wanna find. I had some people early on ask me about some of the, I actually forget which language it is, but it's one of the Indic languages that uses SerbWidth joiners pretty frequently. And it took me months to get around to being able to integrate the plugin that could do the appropriate fixing. So anybody that, basically anyone with experience in any language that ain't English, if you speak it, if you barely speak it, you're better off than I am, and that would be great. All right, and there's also been another question, but a bit of chit chat here, while you were doing your presentation about the comparison between elastic search and solar. Anything you want to say about this and our reasons to go as well? So the story is we actually started using solar. And Chad went on vacation and I spent a week porting it to Elasticsearch to see how hard it would be. And by the end of the week I was in love, right? I liked it a lot. The documentation was really good. I actually, so I have submitted, I have contributed to both Elasticsearch and to Lucene. And I enjoyed the process of contributing to Elasticsearch more than the process of contributing to Lucene. So like that sort of helped, that weighed my decision. The other thing is like, we got a little bit of pressure from ops, just a little. I mean, just like bugging me all the time. But it was right, like it was the correct kind of bugging. It was the, why don't you try this kind of bugging? And that may be because Elasticsearch is very commonly used in log analysis and system administration, not administration, but sort of basically the log analysis world. It's very commonly used there. And we use Elasticsearch there. And the fact that we're using the same tool both for the log stuff and for search is useful, right? Like that fewer tools means more people can have more expertise in the tools that you have, right? So that's part of it. The other thing is that at the time when we started, Solar didn't have a good story to tell about how you built the schema. How you described how fields should be analyzed. Rather, it didn't have a good story to tell about how you got that description of how the fields should be analyzed from your application into the server that was doing the analysis, right? It assumed that you basically SCPed the files from one place to the other. Whereas Elasticsearch has an actual API for this. And I think that that, I'm pretty sure the solar folks have solved that problem by now. But that kind of thing, that kind of like we should have an API for everything and the whole thing should be administrated via REST is really, really convenient. And it's something that I really like about Elasticsearch. I just wanted to add one thing to that. Because I'm talking and not typing. I just wanted to add one thing to that. Because I went on vacation, I came back, Nick had done this. And I fell in love with it as well for a bunch of the same reasons. One of the reasons I really liked it as well and that more turned me off to solar than through me at Elasticsearch was that at the time, and I think this has gotten a little bit better since then, the large cluster infrastructure within solar was more of an afterthought in the solar community. Because it had initially been much more of a master slave, a lot more like your traditional database setup. And not multi master, multi node setup like we have with Elasticsearch versus, Elasticsearch was built with that sort of mindset from the ground up. So there's a lot less room to kind of shoot yourself in the foot and set it up the way that we wanted to set it up as well. So a big question around the API. So if I was building an application that uses the search, would it just be using the existing OpenSearch API and all the Elasticsearch functionalities just added to the query string? Or is there an additional API that I should be aware of? The first one. Okay. Everything that's gonna get exposed to the users that is sort of in addition to all, the API expressivity is more a, wow, this is really good for the people that are talking directly to Elasticsearch. But there are, I mean, it is very easy to form a query against Elasticsearch that is just too slow or useless or, you know, quite useless. But it's possible to form queries that are very silly against Elasticsearch. And so the way they're going to be exposed to other folks is through serious, through the OpenSearch API or through other APIs, right? We want to make, if we need to, want to, I don't care, whatever verb, right? If we verb make another API then that hits search or Elasticsearch, then we can and it's reasonably easy. The library that we use to communicate with Elasticsearch is great, right? They had great support from that department as well, right? So it's very easy to build more things on top of this. That makes sense. And a question regarding the scaling issues with the Yanviki that you mentioned. So to what extent are the scaling characteristics such that we could address them by just throwing more hardware at the problem? We could address them by throwing more hardware at the problem. If we felt like, we have 16 nodes right now, which is a lot. If we felt like tripling it, then yeah, we could totally just cut over to Yanviki. Well, once the servers are up. Great. Yeah, we might want to consider that depending on the development effort that we have to throw at solving the problem in other ways, because that's not for you either, right? Right. The advantage at least of the development effort is, well, I guess there are two advantages. One is that it makes me feel better that I've made something that's fast. But secondarily, it's going to help like the community as well. But also, if we want to build more on Elasticsearch, we're going to need it to be not near the edge, right? We're going to need lots of headroom. And I'm not sure that if we wanted to throw more hardware at the problem, I'm not sure that we would want to stop trying to make it more efficient anyway, right? What it might do, what it should do, could do, is push us closer to the day when we turn off Elasticsearch D, and we still work to make Elasticsearch more efficient and get it to a better process, what we're doing with it. Yeah, I mean, know that you have that option if you want to discuss it. Yeah, I'll throw in two performance-related comments, too. One is that I'm a little bit nervous about people chasing down garbage collection issues in Java programs, because sometimes that is just normal program behavior that you'd see as calls to Malik in a non-garbage-collective language. And so the fact that you're allocating lots of memory means that you're going to garbage-collecting lots of stuff and that might not be a bug, it might be that you're creating, you need this garbage, but I'd check myself after doing an initial investigation to make sure that I wasn't just fixing phantoms. The thing that makes me want to dig further here is not that damn it, garbage question is bad, or it's not that it's some sort of demon. L-searchD is able to do better. Right? And if it's able to do better, then. I mean, my point is that. It's able to do like quarter of the magnitude better. Yeah, okay. And L-searchD is still in. Right, and if it were elastic search is 50% less efficient, that is something we can probably live with. It's not great, but it's something we can probably live with, but it's like 10 times less efficient now. Like in it, it's, because L-searchD is really, it's handling all of the E&Wiki queries without breaking a sweat at all, right? And when I shuffle 75% of them to elastic search, the servers and ganglion start to glow very bright orange and red, right? And that's just not cool. And you think of the hard part of the problem should be in Lucene anyway, and that part isn't changing, so why is, why is the... Kind of? I mean, it could be that L-searchD has some really great, like it could be that the way L-searchD does its job is using more, is using Lucene more efficiently than we are. And it might be that all I have to do is mimic that way with elastic search. And the other thing is that, like, it's not clear, it's not clear where the thing is, and it was easy for me to find the slow bits when they were CPU powered, right? But garbage collection is sort of a delayed thing, so it's very hard to say, like, oh, this is the bit, right? I don't know, and we've done, like, we've submitted quite a few things to elastic search that have made portions of it quite a bit quicker. So it's not that this is impossible. All right, there's just, we have been 53 minutes here. There's one question, and then I think we need to wrap up. There's one question from Jeremy B. So what's the balance of work and performance versus work on features, language support, et cetera, in a typical month? I don't have typical months. I have typical weeks, maybe. Then in a typical week. And that's not even, like, and it changes. Like, there were two months, like a month ago, I spent two weeks straight on performance issues. And last week, I spent 80% of my time on language issues and the bugs at the bottom, right, where we weren't quite mimicking LsearchD properly. So this week, my guess is I'm gonna go more towards performance issues. I mean, it certainly felt like more towards performance issues this week. Though I'm not religious about my time tracking, so I'm just estimating. All right, well, I think that's all. So thank you very much, Nick. And can you, where should people report any feedback, find you, hand you? Oh, God, you can report feedback on the StereoSearch component in Bozilla. You can report it on the top page on MediaWiki, the top page for the search page on mediawiki.org. Or you can bother me on IRC in, like, a bajillion channels. I am many bubbles in IRC. All right, thank you very much. Thanks, guys. Thank you.