 So hello everyone and and good morning So I'm here to talk about what's beyond the basics with elastic search I work for elastic the company behind it So we've seen a lot of use cases and some of them actually surprised us and definitely surprised many people that are familiar with Elastic search as sort of the full-text search solution But before we get beyond the basics we first need to know what the basics are so super quickly For us, that's what we come from We are a search product. It's an it's an open source search product and search Search is not a new thing like it's been around for a while for a long while and the basic theory the Really down-to-earth basics haven't changed that much since those times We still use the same data structures We still use the same data structure that you find in any book at the end The index the especially the inverted index which looks something like this It looks the same in a book as it does in a computer It is a list of words that actually exist somewhere in our data set Notice that they're sorted And for each of these words we have again sorted The list of documents or files or pages when it's a book where actually these words exist And we have some additional information stored there too For example, how many files does actually contain the word python or How how many times is it present in file one and on what positions and stuff like that Those information those statistics will be very important for us as we as we go on through the talks So this is the data structure that we use. So how does search work then? Well, it's super simple if we're looking for python and jango It's the same search that you would do if you were looking for those things in a book You locate the line mentioning jango and the line mentioning python You can do that effectively both as a computer and as a person because Again, it's sorted and then you just walk the list And if you find a file or a document that is actually present in both lists, that's your result Naturally, if you want to do an or search instead of and you just take everything from both lists But that's not enough Because this will give you the information What matches but it doesn't give you the most important thing for us And that is the information. How well does it match? What is the difference between between the jango book that talks specifically about python and jango? and The biography of jungle reinhardt when it mentions in one passage that he had an encounter with a python the snake Obviously, there is a big difference between those two books And the difference is in in relevancy It is a numerical value A score essentially saying you how well Does a given document match a given query And a lot of research has gone into how best to calculate the score And again, it hasn't changed that much since the beginning At the core of it. There is still the tf idf formula Those are fancy words fancy shortcuts. It's a term frequency and inverse document frequency It essentially represents how rare a word we are looking for And how many times have we found it in the document? so This essentially represents that if you found the word the in a document That doesn't really mean much like every document in the world if we're talking in english will have the word though That's not a good information Because the idf the inverse document frequency That's the part that it will tell you that this is not a specific word It's almost in every document If you however find The word framework or something like that that is fairly specific So that's the idf part and the tf part is just how many times did you find it there? If it's only mentioned once in a book doesn't mean much But if it's if it's there a hundred times That probably means more And we can we can keep building on top of that so for Lucine for example adds another factor to it Which is a normalization for the length of the field That's essentially the equivalent of saying that yeah, there is there is a fish somewhere in the ocean Probably true Not really that relevant or surprising But if you have a bucket with water and you say there is a fish in it That is much more actionable information So that's that's the second part of it the the normalization for the field length If you find something in a super big field Okay, if you find it in a much shorter field for example title compared to body that probably means much more So already we have a formula that's baked into lucine It's baked into elastic search that does very well for for text and for search But sometimes even that is not enough For example, you're not dealing with text but with numerical information Or you have some additional information that elastic search is not aware of For example, you have the quality of a of a document You have some user contributed value Or even somebody paid you to to promote this this piece of content or something Or you want to penalize or or favor things based on a distance Let's say from a from a geolocation Or distance from some from some numerical range So how do you do that? We have a we have a few ways of expressing that and The best way to show it is on an example So this is a this is a standard query for for elastic search and it's using the function score query type The function score query type takes the regular query So normally we are looking for a hotel and we are looking for a hotel. Let's call the grand hotel So far so good And then we want Uh that hotel to have a balcony. We want our balcony in our room but We don't want to just filter just the hotels that have balconies Because then we would be robbing ourselves of the opportunity to discover something else But if a hotel has a balcony We want to we want to favorize it. We will just add two to the score So all the hotels with balconies will be towards towards the top Then we want the hotel to be in central london Within one kilometer of the center If it's within one kilometer, it's a perfect match The further away from it that it gets the score decreases. It will still match But it will the score will be smaller Uh again that means that The hotel that perfectly matches our criteria will be at the top But if if we have a super good match outside It will still show up And then we also have the popularity how have people been happy with uh with the hotel And let's take that into account So we have a special thing called field value factor, which is essentially just telling us a search There is a numerical value in there that determines the quality Put it into into the score And finally we add some we add some random numbers And this is a this is actually taken from a real life example because people use this To to mix things up a little bit to give users chance to discover something new Something they wouldn't otherwise see So all of these things together Will make sure that you find your perfect hotel We're not limiting your choices. We are not just because you say that you want a balcony We will still show you the hotel that is Almost perfect for you except for the balcony part We are also not just sorting by popularity so that's something that's Really not that good a match, but as is really popular would be at the top We're just taking all these factors and combining them together So this is one of the main Main ways what we can do with the with the score and how we can use it in in a more advanced way Just take all the factors that go into The perfect result And just combine them You you're not limited to just Picking up one and sorting by it You can combine them all together And then it's just a matter of figuring out what these numbers are supposed to be to one And what will actually give your application the best results Some people actually use machine learning techniques to figure out the best ones They have they have a they have a training set and everything and it's it's not that hard because you have only a limited number of options And typically that those are fat numerical values So if you know what a good match would be you can actually train The perfect query for you So this is if you're doing if you're doing search when you already know what you're what you're looking for But sometimes It's the it's the other way around Sometimes it It is You don't have The document, but you have you have the query and you want to find the document So imagine that That you have you want to do something like alerting or or classification For example, you're indexing documents you're indexing stock prices and you want to be alerted whenever a stock price rises above a certain a certain value Sure, you could keep running a query in a continuous loop and and see if there is something something new But what we can do instead with with the percolator feature of elastic search Is to actually index that query into elastic search And then we just show it a document and it will tell us all the queries that matched and That is very powerful, especially because it can use all the features of of elastic search So that's the alerting use case sort of the stored search functionality if you if you Supply your users with a search functionality And you want them to be able to store the search and then be alerted whenever there is a new piece of content that actually matches their search With percolator you get it essentially for free You just index their query and whenever there is a new piece of content You just run it by the percolator and it will tell you hey, you should probably send an email to that user That was here the other day. He was Really interested in that That's this That's the sort of stored search You can also use it to do to do a live search So if you've ever been on a on a on a website, you did some some searching You are looking through the results and suddenly there was a pop-up That there are five new documents that match your query since you've been looking at it Again easy once you execute a query You also store it as a percolator And then whenever there is a new piece of content during that time You can just push it to the browser to say hey, there are there are new results more more recent So again something that's that's otherwise Fairly hard to do or would require some busy loop or something and you can do it this way But we'll go we'll go a little bit further than that. We'll look at the classification use case That is essentially if you use the the percolation To enrich the data in your document So imagine that you're trying to index events And all you have As location goes is a set of coordinates and you want to find the address This is something that's easy to do the other way around If you have the address and you want to find all the events in that location You just you just do a geoshaped filter That you're looking something that falls within this shape within a shape of the city of Warsaw And that's a super simple search So with percolator we can make it into a super simple reverse search Let's say we get our hands on a data set with all the all the cities in in europe or in the world It's not so much We index the cities into an index so we don't have to construct the polygon every single time We store it in the index called shapes under the type city And then we create a query for each city We register it with a name And then when a document comes along and it and its coordinates The field location Fall within that shape We will know that it is actually happening in Warsaw, Poland So Something that is super simple to do one way But difficult to do the other We can we can do with a with a percolation just essentially using brute force But in a smart way and outsourcing the brute force to To elastic search we can do it very effectively And in a distributed fashion So that's geoclassification Another thing that's easy to search for but not that easy to uh to do the other way around usually Is language classification Usually any language has a few words That are super specific to that language. They don't exist in any other These are some of the examples this is essentially just a test how many polish people there in the audience and The assumption here is that if we look for these specific words And we find at least Four because four is always a good number Because 42 would be too high Then the assumption is that this is actually a document that contains uh polish language And Sure, it's a simplification. It's a heuristic, but it it actually Works fairly well. It just depends on the quality of your words Mine are super good For polish that is so again, and if you if you have a set of a set of words for for each language You can just uh start uh start a collection of queries like this And then when a document comes along with a description of an event with a geolocation and a description You can immediately get back the classifiers. You can get back the location In actually a human readable format that it's actually Warsaw Note that it's 473 minus 74.1, which is by the way not Warsaw, but Uh, whatever You also get the language back that it's in polish You can you can use a similar classifiers to determine the topic Like if you have within keywords something like programming and python and jango It's it's fairly accurate assessment to say that the conference probably is something about python So this is how we can use uh percolation to enrich our data and sort of to uh to determine Something that otherwise would be hard to do Another use case for this is uh imagine that you have a blog a cms And you have a category defined as a search That's super easy to do one way But then if you have a blog post and you want to see in which categories this blog post is That's the harder part Again with percolation with something like this It's it's super easy to do when you can actually tag Uh tag the blog post with the categories as they come in And you can do obviously a little bit more with a with a percolator You can attach metadata to the to the percolators and you can filter on the metadata You can aggregate them so as the response you will not only get the percolators that matched But also let's say Their distribution across categories You can even use them to to highlight something so you can search for some words in your documents and then Uh just highlight the fragments that actually contain those and store them separately in the document for easy presentation Etc etc. You can get the top 10 hottest categories for Uh for this piece of content or something like that but next uh Those are if you if we're working with individual individual documents But we can also look at and more documents At the same time. So this is the traditional search interface. You're just looking something and you get back the top 10 links What we also have here is something that's called faceted search This part the search part is really good when you know what you're looking for This part shows you What is actually in your data so you can immediately see See the distribution you can see that if you're looking for something related to Django The most results are in python and some in javascript So it allows you to discover data Some people have taken it even further and we have we have allowed that with with aggregations with multi-dimensional aggregations That you can aggregate over multiple dimensions at the same time But that is still boring that is still just counting things and that's not really interesting like any database can do that What we need is we need to use the data that we have the statistics So to do that we Let's look how we would do recommendations using elastic search We this is our data set we have we have Document for users and then for each user we have a list of artists of musicians that they like And we want to we want to do recommendation like assuming that I like these things like what should I listen to next So we have two in this case. We have two users. They have artists being common and there are three other artists So the naive way to do it is to just Aggregate just ask for the most common thing that they have in common So give me all the users that like the same things that I do and then Give me the most popular artists in that group Without the ones that I already know That way I will get the most popular artists, but not necessarily the the relevant It's like asking you like what is the most common website that you go to Probably google not interesting Because everybody goes to google But if I ask the people in this room And I think about it. What is the more specific part for this group compared to if I ask somewhere on the street It will be something like github You probably all go to github Nobody in the outside world goes there. Nobody even knows that it exists. That is relevant That would be a good recommendation And we can do that with elastic search We have all the information We have the statistics about how rare a word that is and what is the distribution again across the populace So what we can ask for is simple. We ask for the significant terms It will use all the score compare it to the background And then the results will look something similar This part is important because What I would expect is all those dots to be on on the diagonal line Because that's what would happen if I had a random sample The more it moves away From the central line the more specific it is And That is how we can do relevant recommendations because we see that this dot here it is Obviously much more common in this group than in the general populace that would be here So it has moved greatly And because we have all the information because we've analyzed the data Because we are we are the search people. We understand the text. We understand the frequencies And we can use it. We can actually produce something like that There are obviously some caveats For example, if I like a very popular band Like one direction Then it will skew my skew my results because everybody likes one direction, right? So I need a way to to combat this because otherwise I would just get completely irrelevant recommendations And again We are the search guys. We understand data. We understand documents so we can find and sample Just the users that are most similar to me And we have all the tools already at our disposal remember tf idf normalization and everything tf the people who like The more things that I like the better They match me idf the people who who like to share The rarer things that I like Put them to the top And then just take 500 of those best results and only drive the recommendations based on that group It will make it both faster And more relevant It will allow you to discard or all the irrelevant connections that you might find And only focus on the meaningful connections On the things that are relevant for your group in this case the group of people who like the same things that you like It will provide you with Uh with a recommendation So just by just by applying the concepts that we have we have learned from search into other things like aggregations And everything we can get much more out of it Another example would be if you have wikipedia wikipedia articles when the labels and links Are are the words and you apply the same concept You get a you get a meaningful connection between different concepts If you if you try to do it based on popularity, it would always be linked through something like yes, that person and that person Yeah, they're both people Okay Not exciting But if you apply this principle you get something more out of it So if you combine aggregation and relevancy all the statistics that we can do That is actually how we as humans look at the world If I ask you what is the the more most common website that you go to you'll probably not say google Because you know that's not interesting we as humans have been trained From the very beginning to to recognize patterns and to spot anomalies at the very same time And this this concept can can be used for other things as well For example, if you use the same principle the significant terms aggregation And per time period, so you split your data into time period and you ask what is significant for that period How do you call that feature? Well, it's a very common feature that we now see It's what's trending That's just it because it's more specific. It's not more popular than In any other area not necessarily, but it is more specific for this one time period for the current time period, let's say Compared to yesterday compared to compared to the general background So again once one year once you're doing these aggregations, there's again one one single caveat that can happen Is that you can have too many options too many too many buckets too many things to calculate and If that happens what you so imagine that you're looking for a Combination of actors that star together Very often So i'm looking for the top 10 actors and then for each of those i'm looking for A set of top 10 actors that act with them that they appear together If I just ran this what will happen in the background is I will essentially get Get a matrix Of all of all the artists of all the actors And all the actors and it would be huge it wouldn't fit into memory. It would probably blow up my cluster Actually, I'll see search would probably refuse to run this query because it would say hey, I would need too much memory This is just not gonna fly So what you can do Is you can just say Just do it breath first just first get the list of the top 10 actors And greatly limit the matrix that you will need to that you will need to calculate And then then go ahead. So it will be a little slower It will have to run through the data essentially twice But it will actually finish and it will still finish in in quite a quite a reasonable time So that's just a how to find the Common caveat that people people get into when they start exploring the the aggregations and especially with the multidimensional So just to just to wrap things up because we are approaching approaching the end and questions The lesson here is that information is power We have a lot of information about your data We have all the statistics all the distributions of the individual words And if you if you understand this and if you if you can map your data To to this problem you can get a lot more out of elastic surge Than just finding finding a good hotel in London or the conference events in Warsaw So That's it for me. And if you have any questions, I'm I'm here to answer them Question in the back. Hello Any questions That's a long question Thanks You More specific question, but you show an example how to search people Like the 500 more like you Can you do that? more like people have that have 90 percent of being like me instead of a fixed number because fixed number You have to find and tune in of course You can do that by a simple by a simple query because aggregations are always run on the results of a query So we can very easily Remember the example that I gave with the with the language classification when I was looking for at least four words I could do the same. I could say Give me only the users that have at least 70 percent or 90 percent or or nine Yeah, I can use both relative and absolute numbers Of the same artists that I like and use those as the basis for the aggregation So yes, absolutely, and it would actually be much simpler. You wouldn't even need the sampler aggregation Thanks Any other questions is anyone still awake? Okay, I'll take that as is So a question a question going once Going twice sold Are there any performance implications of running say hundreds of percolators? Of course But you can scale way beyond hundreds I've seen I've seen people doing millions of percolations And it still works. It scales very well with the distributed nature of elastic surge Essentially the only resource that the percolation consumes is cpu So add more cpu either to a box or add more boxes and it will scale fairly linearly so and also Just the more boxes and more cpu you you will have the faster it will get You don't need anything else. You don't need much memory. You don't need faster disks You only need the cpu. So it's very easy and fairly cheap to scale Uh to to give you an idea I think that if you want to run hundreds and thousands or millions of percolations You will need like Five reasonable boxes or something like that and you will get responses within within milliseconds So it actually does scale very well another question Could you give us some examples of the customers you had you mentioned that you had like cases that were like really impressive For you and you didn't expect those these cases Sorry Could you give us some examples of the use cases from the customers that you mentioned? That you didn't expect them So uh, so some of some of what we didn't expect it was the was the percolator example There are some people running big clusters of elastic surge and they don't store any data in it Like they have a cluster of 15 20 machines Without storing any data. That is a weird Weird experience for for essentially a data store So that's definitely one of them And we also always run into into these issues where we have a feature We recommend people to use it and then People listen to our advice and we find out that We might have underestimated the the the people in in the wild For example, we introduce the the idea of of index aliases That you can have an alias for index essentially like a like a sim link or something So you can have you can sort of decouple The design of your indices from what the application sees So you could have like an alias per user But all the users can live together in one big index and the alias will just point to that index and a filter And that works very well unless until we encountered A User that had millions of users and suddenly we had millions of aliases and we didn't thought that that would ever happen So as with anything else with with computer engineering like assumptions assumptions assumptions So we we encountered something like that. We had to go back and fix it and rework the aliases So these are the two most notable examples where we got really surprised by how Our users used our product that we really didn't foresee And it's it's good because we always learn something new and it allows us to sort of We orient ourselves better to what the users actually need Okay Any last questions? So hello, um, I have a question regarding reverse queries for language classification So basically, um elastic search supports the ngram indices. So could you use those actually for classification of languages? So ngrams have the problem that they have a very wide spread so They might give you some correlation with the language, but they will definitely not be not be precise So, uh, just to explain ngram is essentially if I split a word into all uh all the tuples of of Letters for example with things I would have th a h a n a n k And then I would essentially query for these for these triplets And it will obviously have have a correlation, but it will by no way be uh be decisive enough Uh, especially for something like language classification where you're really interested in In the in the probability ngrams are very good for as an to as an addition to something else Because because of their nature because they they always match something That's why you typically don't want to use them alone, but they're fine if you have some Some more optimistic methods like exact matching and then the regular like fuzzy matching and everything And then you just throw ngrams into the mix to sort of boost the signal if it matches and sort of to catch Some things if nothing else matches So I definitely wouldn't use ngrams for language classification And I I typically only use them with a combination of other other query types And uh other analysis process make sense Okay Okay, so I I think that we're we're running out of time. So thank you very much. And if you have more questions, I'll be I'll be outside