 Thank you for having me here, and the talk is Beyond the Basics with Elasticsearch, and it is essentially about all the use cases where you can use Elasticsearch, but might not be immediately obvious that that is something that you can do. So we'll walk through several of those and see how and why it is suited for that particular scenario. But before we go beyond the basics, we need to talk about what are we going beyond? So what is the base functionality of Elasticsearch, and how comes that we can do all these other things? Like where is it coming from? Well, it's all coming from search. Search, especially full-text search, is the primary function of Elasticsearch. And search is not a new problem. It's been around for a while, and it hasn't actually changed much. The first essentially index over a book, over some text, has been created in 1230, and we still use the same data structures to this same day. Of course, there's been plenty of improvements, but the underlying infrastructure, the inverted index, remains the same. It is the index that you're familiar with if you've ever read any book, which I hope that you have, sincerely. And this is how it looks. You have the list of interesting words, and then for each of these words, you have a list in the case of a book, you would have a list of pages. For us, you would have a list of documents that actually contain this word. And notice several things. First of all, the words are sorted. Of course, it makes sense, because you need to be able to find the word that you're looking for so you can go to the page that actually contains it. And also, the pages or the documents are sorted as well. And this is not accidental. This is very important for us, and we'll see how. And also, when we're talking about search using a computer, there are other things that are involved in this data structure. Notably, some statistics. For example, how many times is this word contained in this document? Or what is the length of the list, et cetera? Things that will be very important later on. So when we have a data structure like this, how does the search work? Well, it's super simple. If we're looking for a document that mentions both Python and Django, we get back the lists. And now we just walk the lists, and we merge them together. So whenever we find a document that is present in both lists, that is our result. If we wanted to do something like a phrase search, that we're looking only where Python is immediately followed by the word web, all we have to do is add another information into the inverted index. We just need to add offsets. What is the position of this word in the document? And then when we're going through the merging process, we just say we care not only that the document is in both lists, but the offset must be immediately following each other. So Python would be on the position n, and web would be on the position n plus one. So you can see that actually doing a phrase search is not any more expensive than doing a regular search. You're just adding one more comparison, a numerical comparison at that. So it is fairly efficient. What else you can sort of imagine here is I can get the list of documents from anywhere. It doesn't have to come from the same index. So I can have multiple indices. I can have index on every single field in my document. And I can use them all. If I have one condition on the title, one condition on the category, and one on the body, I will just query those three inverted indices to get these posting lists, is what they're called, and merge them together. So we don't have the limitation of many other data stores that we did, the number of indices you can use per query, per collection. And that is also something that we benefit greatly from is this data structure. And finally, the last thing that you do when you do this merging, when you find your match, you quantify how good a match it is. That is the primary difference between a search engine and a database. We not only tell you which documents matches your query, but also how well does it match. Is it a good match, or is it just meh? And we know that because we have the information about the statistics. So this is called relevancy. We tell you how relevant the document is to your query. So how is relevancy computed? At the base of it, there are two numbers. Numbers that we call TF and IDF. TF is term frequency. It is just the number of occurrences of that word in the given document, or in the given field. So if I'm looking for Django in a document, so how many times does this document contain the word Django? Is it there only once? Is there three times? And obviously, the higher the number, the better the relevancy. IDF is inverse document frequency, which is just a fancy word of saying how common or rare this word is in your entire dataset. Is this a word that is contained in every single document that you have? Or is this a word that is only present in 1% of your documents? And we can get this information right away from the inverted index, because that essentially the length of the list attached to the field compared to the number of documents that we have overall. It's fairly easy to calculate, and there's actually the exact formula if you're so inclined. And this number has the opposite effect. The more common the word is, the less relevant this document is for the result. Because if we find a word that is in every single document, yeah, who cares, it's in every single document. Of course we're going to find it. That doesn't mean anything. So this is sort of the base formula for anything that has to do with relevancy. And it works very well for text. Now, Lucine, the library that does the indexing and the heavy lifting for ASIC search, adds some more stuff on top of it. You can see the exact formula there, and you can see in the middle that that's the TF IDF, the big sum. What it adds on top of it is it takes into account, for example, the length of the field. Because if we find the word Django in the title versus in the body, that also gives us different information, right? If we have a short field and we still find it there, it's more relevant that if we have a full text of the book and we find it there as well. Those are different types of information. So it improves on the basic TF IDF formula, but it's still only relying on the statistics that it learned about your dataset. And sometimes you want to go a little further. Sometimes you have other information about your data. So imagine that you have a QA website where people ask questions and give answers. Let's call it like buffer overflow, I don't know. And you have the users rate the questions and the answers. This is a good question, this is a good answer. And that is an information that you want to take into account. But you don't want to sort by it, because that would completely destroy your relevancy. So if you had one very high quality question that has many, many words, it will always be on top no matter what the people would search for. That's not what you want. You want to take the relevancy and you want to take the popularity and combine those numbers together to get something that you can then sort by. So another use case would be I'm looking for a hotel or I'm looking for a conference for an event and I want it to be in this location. I'm not strictly limited to that location but I would prefer it to be there or around this time. So again we can take the numerical indicator, the distance from our ideal and use that and feed that into the relevancy's mechanism. So this is the theory behind it and this is the practice. This is how it actually looks. This is a fairly simple query in Elasticsearch when we're looking for a hotel. We're looking for a hotel that's called the Grand Hotel and we have several other criteria. We would prefer to have a balcony. We are not limiting ourselves to just hotels and balconies but if it does, like we want to add two to the score. So we want to bump it up. Also we want it to be in central London and we want it to be within one kilometer of the center of London well not center, Greenwich, because you can identify it by the zero in the coordinates. And we want it to be there. We don't want to limit ourselves to hotels that might be so good that they would get to the top even without fulfilling this criteria. So we say that we want to use the sort of the cost function to calculate the score. It's one of the shapes on the image on the right that actually determines how fast the score drops once you get outside of your ideal zone. And we also want to take into account the popularity of the hotel. And then we want to add some random numbers. So random numbers are always good. They make everything so much better. Now in this case, the random numbers are there sort of to shuffle the results a little bit around so that people have chance to discover new things. Because we actually took this from a customer from a real example how they're doing it. And they do have this random score there because otherwise they would have some hotels or some results that would never be hit because they would always be just behind the fold. And also people would perceive the results as stale. But if you shuffle it around a little bit they will always find something new and they will always be excited and hopefully come back to your website. So this is one of the ways how we use the statistics, the TFIDF and all the things that we know about your data. This is the most straightforward way. We use it to calculate relevancy and we allow you to hook into that process yourself. If you are so inclined you can just remove all of this and just say, hey, instead of all these different criteria I just want to use a script and give it an expression in your favorite programming language, whatever that is, even Python. So and do all these calculations yourself. These are essentially just pre-built scripts that we have built so that you can use it and you don't have to expose the scripting functionality because obviously that can have some issues. So this was the first use case, sort of how to get more out of the relevancy that we already have. Another interesting use case we have revolves around reversed search or how we call it the percolator. And it is exactly what it sounds like. It is reverse search. In normal world you index your documents and then you run your queries. With percolator you index your queries and then you run your documents. So what this is useful for is for example, alerting. If you have something like stored search functionality on your website, you have classifieds or something like that and you allow the user to search and then nothing shows up. But the user wants to say, hey, I'm interested in this search, like save it and whenever there is a new item, new document that matches that search, just send me an email and I'll come back. Normally that's a fairly hard problem. With percolator it comes out of the box. You index the query that they're running including all the bells and whistles that elastic search allows you to do. And then when a new document comes in you just ask for it to be percolated and you will get back all the different queries that people have registered that they want to be alerted on. Some people even use it to power something like live search. If you've ever been on any website you're searching and suddenly there is a pop up hey, in the time that you've been looking at these results there are ten new ones. That again can be powered by the percolator. When you do a search and at the same time you register the percolation and every single new document that comes in gets percolated as well and you will know which browser you need to push this document into who is actually looking at the results right now. So those are the fairly obvious use cases. I'll talk about my favorite one for percolator and that is classification because there is a bunch of stuff that's super easy to search for but not that easy to do it the other way around. For example, geolocation. It is fairly easy to construct a query that will look for events in Austin. You just have to have the shape of Austin somewhere. You either pass it into the query or that's not optimal. So typically you have an index somewhere in elastic search. In my case I have an index called shapes where I have cities and I also have Austin there. So I say hey, I am interested in anything that falls within the city of Austin and that is a very simple query to run but what if I want to do the opposite? I have a geopoint, I have a set of coordinates and I want to know where they are. What city? What's the address? That's not that trivial unless you have something like this where you have a bunch of queries indexed in your elastic search and then you just show it a document with a geopoint and it will tell you, oh yeah, these queries matched. The one representing North America, the one representing United States, the one representing Texas, the one representing Austin, maybe even the one representing a city block or something so you can really pinpoint down the exact address. At that point it's just a matter of how good data you have and how much CPU you want to burn on this. But it gives you sort of the non-obvious reconstruction of the data. Another interesting one is language classification. This operates on an assumption that every language, and I've chosen Polish because English would be way too weird. Every language has a few words that don't exist in any other language. So the assumption is I can run a query that looks for these words. In this case I give it a list of words and I'm saying I want, if at least four of those are in the document, I consider it a match. So in this case I'm looking for documents that are written in Polish because nobody else, any other language in the world would ever write a word like this. So it's a fairly good indicator. So again, very easy to write the query and once we have the query we can reverse the process and actually ask for which query is matched. So if I then have a document for an event, so I have a DjangoCon US with some Polish description otherwise my demo would fall apart. This is what I get... Sorry about that. This is what I get back. I get back the identifiers of all the queries that actually matched. So I know that this is an event that is in the city of Austin, it is in Polish, and it deals with the topic of Python. Please don't try to look where the coordinates are. They are nowhere near Austin. So this is how you can use percolator for classification. You typically do this before you index your document. You have your document, you're about to index it so you run the percolation, you get all the dynamic classifiers, you get all the topics and all the language and the geolocation, you add it to the document and then you index it. So then when you're looking for something in the city of Austin, there's a simple, exact lookup which will obviously be much faster than running the geolookup every single time. And you can take the percolator a little bit further. You can attach metadata to the percolation queries. For example, who requested this percolation? Is it a user that paid me or is it a user who can wait? And other criteria like that. You can also not run all the percolations every single time. You can then use this metadata to filter those, et cetera, et cetera. You can also use it to highlight. So if you want to highlight some passages, that can be a fairly hard problem, but it's a problem that search engines are really good at. So you can ask Elasticsearch to actually highlight some passages for you and then index the already highlighted text. So this has been our journey into the depth of the percolator. And then there is one last big thing that I want to talk about. And that's aggregations. Now many different databases have aggregations. How come a search engine has one too? Well, it all started with something like this. This is an interface you might be familiar with, and it's called a faceted search, or faceted navigation. You type something in, you get the results, then you get the 10 blue links. But you also get on the left side the overview, the overview of what actually matched your query. So in this case, I'm looking for Django, so I immediately can see that Django is closely connected with Python. And I can see how many repositories and how many users have actually matched my query. And that is the huge difference between facets and search. Search is great when you know what you're looking for. If you know how to spell Django or how it sounds or something like that. The facets are great for exploration because you don't need to know, you look and you see. It is one thing to do with code, the other thing is to do it with hotels or books or if you've ever shopped on a website like Amazon, like you can see the categories, you can see the brands, you can see the price distribution, and you can see it. You don't have to read all the results to get that information. So we have taken it one step further with Elastic Search, and we power some analytics based on this stuff. And we visualize it because humans are essentially pattern recognizing machines. You're very good at recognizing patterns. You can probably spot several weird things about this picture, like the gap in the timeline or the fact that the two high charts are completely different. And you can see that immediately. If you wanted a computer to see that, you would have to tell it what to look for or have something very, very sophisticated. But any human can spot this immediately. So that is why facets and aggregations become so important and why we continue to develop them. But this is boring stuff. I don't want to talk about this. This is just counting stuff. Any database can do that. We can do better than that. We are a search engine. We understand your data and we can use it. So let's see how we would actually use Elastic Search and aggregations to do something like recommendations. Let's say that I have a music website and I have different users, and then they like different artists. So I have a document per user and there is a list in that document in that JSON that has all the artists that the users like. So one sort of naive way how to do recommendations is just ask for the aggregation. Just give me a look for people who like the same thing I do and then give me the top 10 popular artists in that group that I have not been exposed to yet. It's easy to write, easy to run, not as useful because popular doesn't mean relevant. Just because everybody listens to one direction doesn't mean it's relevant to my group. So what can we do instead? We need to identify the artists that are more relevant to my group, to the group of people that like the same things that I do compared to the background. So the code looks remarkably different. We replace the word terms with the word significant terms. And that's all there is because now we are essentially telling elastic search, hey, we have this group. We have defined it based on the results of the search. And now give me the stuff that is more relevant for this group compared to all the others. And we can do this. We are the search guys. We understand the data. We have all the statistics. We have all the numbers. You can even see the graphical representation of what's happening on the right side. Normally when you select sort of a random selection, you would expect that there would be the same distribution of people who like something in that group compared to the general populace. So you would expect all the data to be laid out on the diagonal. What this did, what using the significant terms did, it selected all the values that are pretty much on the vertical, which means they are much more liked in my group than they are in the general populace, which is exactly what I was asking for. I was asking for recommendation. Based on the people who are similar to me, give me what I'm more likely to like. Using the data, still using just the dumb statistics about the distribution of the individual values throughout the data set. So no learning was involved. This is actually a fairly simple aggregation that you can run. You can see the code is not that expensive to write. It's not that involved. But this still has some problems. This is an aggregation we've had for a while, but we've noticed that there are some cases where it doesn't work as well as we would like it to. And the one problem is the terms that everybody likes, the term that everybody has. So one direction is my go-to example of this, because everybody likes them, especially around Django cons. And so what do I do there? How do I make sure that I don't suffer from the bias that every single document I have actually likes this? Well, a lot of it is already filtered out by the significant terms, but I can actually do much better. I can also ask for a sample of the documents. So I will not do this analysis on all the users that have something in common with me, but only those that are the most relevant. So I've included relevancy right now at least twice in my query that I'm running. First of all, I'm looking for users that are most similar to me. And most similar means they have the highest relevancy to my query, because let's repeat, the relevancy is based on TF, IDF, norms, et cetera. So what does this translate to? TF, term frequency, it immediately translates to people who likes more things, who have more things in common with me. IDF, inverse document frequency, it means that people who prefer the rare choices that I have, I prefer the people who have like the same thing with me that are rare. I will ignore the one directions because that doesn't bring me anything, but I will actually hang on to the weird groups that nobody else in the world knows about. And then norms, the stuff that Lucina adds on top of it when it takes into account the length of the field. I'll prefer people with shorter lists. That means that people who like pretty much exactly what I like. So not people who like everything in the world because that will not be relevant, not be that relevant. So I can use directly the tools that we've built in the beginning of the talk for text matching. And I can use the same numbers, the same formulas to get the people who are most similar to me. And then I say, and take like top 500 of those, like on every shard because elastic search is distributed so everything happens on a shard level. And then run the significant terms. Tell me what is specific for that group. So we have refined the selection, not anybody who has anything to do with me, but the most relevant people, the most similar people, and then give me what is specific for that group. So this is the sampler aggregation. It is currently in the newest releases of elastic search being ready to release. And these two together allow me to do the recommendations and have it be more relevant and also have it be faster because we are actually looking at the subset of the data. We have just used all the information that we have about the data to identify the most relevant part so we don't lose any precision quite on the contrary. So sort of to generalize what we have done here is in a connected graph when we have people and artists and they like each other, we have identified the connections that are meaningful. Not the ones that are most popular or most common or we have managed to hopefully circumvent sort of the super nodes, the super connected nodes that are the hubs of any of the graphs. And we can use this and go further with it. We can actually use this in a graph algorithm. So imagine that you have an algorithm that's used to calculate the shortest path. So you want to go from point A to point B or in our case, let's model it on Wikipedia, you want to go from a page A to page B and you want to see what are the connections. How do you get from one point to another and still only take into account the relevant connections? Because if you just use a naive graph algorithm, you will fall victim to the super nodes. You will see that, hey, Beatles have concerted in the USA and so has pretty much every other band so there's an immediate connection there and that is not relevant at all. And I don't mean to insult your country but that's just not relevant connection. So when you use this approach to identify only those connections that are relevant that are not just an accident based on statistics that everybody has that connection, then you can get much more interesting information out of this. So you can actually use this to find meaningful connections. After all, aggregation and relevancy, that is how we look in the world. I said earlier that we are a pattern recognizing machines. We look at things and we immediately make assumptions like, hey, this room is not as full as it was in the last talk. Yeah, that makes sense. Like elastic search is not as relevant as pulse grids. I get it. I do. And I can see that immediately. And because I have the context, because I can see I don't have to count all the chairs to know that. So that is the aggregation part. And the relevancy part is if I ask you what is the most popular website, the website that you visit most often? Many people, when I actually ask this at conferences, they'll tell me something like GitHub or Stack Overflow or something. And that is actually not true. They probably spend more time on Facebook or Google or something like that. And it's not that they are ashamed, even though they might be too. They immediately recognize that that is not a relevant answer. That's not interesting. Everybody goes to Google. I'm not interested in that information whatsoever. I know that. You know that. Let's skip to the interesting part. Get up. That is more relevant to our group than to the general public. If I ask on the street what GitHub is, hopefully I won't get punched, but I don't know that. But I will definitely not get the correct answer. So we do the exact same thing. That is what is special about humans compared to computers. So I have a question for you. If I do an aggregation per time period, and then I ask for significant terms on the tags or anything, what do I get back? How do we call that? Anybody's willing to guess? We get the trending information. What's trending? Not what is the most popular for that given time period, but what is more specific for the time period compared to any others. So if we only filter the last five minutes and we ask for significant terms, we get what is currently trending, what are people talking more often right now than in general, than any other time. So something that might appear as sophisticated algorithm, we can replicate it with two lines, with a single query to Elasticsearch. So this is a nice sort of shortcut. If you just want to throw something out there and you don't want to spend billions trying to come up with your own algorithm, this is exactly what it is. So there is only one other pitfall that I'll warn you about. If you try to play with this. The way aggregations work in Elasticsearch is we take all the possible combinations that might come up and we create a bucket for them, a placeholder. And that can blow up very fast. For example, if you have a dataset from IMDB and you're looking for actors who acted together most often, if you just run this naively, it will work and the query is super simple. But the query will also blow up your memory like crazy because it will essentially do Cartesian product of actors versus actors. So it will be a huge essentially table, a huge data structure that we need to then fill out. What we can do, however, is we can limit one of the dimensions before we get into it. By default, we will try to do everything in one pass over the data because it's the most effective way and because we are a distributed database so we need that for us to function. But in this case, if you really insist, if you know that this will blow up your memory or because you've tried or because you can count, what you can do is you can say do it another way. Do a breadth first search. So first identify the top ten actors and then only find the co-actors of those ten. So just simply identify what is the dimension that you can limit most effectively and do that and then you'll be fine. You can actually ask for all the information in the world and we will give you a tiny, tiny, tiny little sliver of that. So remember, information is power. We have the information. We have information about your data and we can use them. That is the one leg up that we as a search engine have over the more traditional data source who are limited to the boring kind of stuff, filtering, counting stuff. It can be useful, but it's not exciting. If you want exciting, use Elasticsearch. Thank you very much. If you have any questions. So we have five minutes for questions if anyone has any questions. Yeah, great presentation. Question about language detection. Have you tried to use language detection within Elasticsearch in a production environment? No. So what I typically recommend when people deal with multiple languages is just use everything at the same time. The problem with language detection typically is that even if you can detect the language of the source, you have enough information to go on there and you can do it. You can identify the language of the document. It's very hard to identify the language of the query because if somebody just types in two words, it's very hard to say what language they are. So what I typically recommend people is if you know that you're going to be dealing with these five languages, analyze everything five ways. Analyze it as English, as Czech, as German, as Japanese, and then do the same for the query. When a query comes in, query all of these fields at once. And Elasticsearch has tools to allow you that. You can specify that I have this one field, but I want to analyze it in multiple ways. And then I have this query and I want to run it against all these differently analyzed fields. So it's what I call the shotgun approach. Just throw everything in there and see what sticks. Because of how the relevancy works and how different relevancies from different queries are combined, without trying so hard to actually think about the problem, you will actually get the most relevant results. Does it make sense? I may have missed it, but is the vector space model still a common way of combining information from different query terms, or are there more sophisticated methods that are now... Yes, it's still the same thing. I talk about TFIDF and everything, but that's only to give weight to the individual parts of the vector. But overall, it's still essentially, we're talking about vector metrics and vector distance. By default, essentially what I showed, the formula, that's actually a cosine metric, pretty much. There are some modifications and stuff, but yes, it's still based on that. Thank you very much for having me.