 So yeah, so what it what it actually means going from iContains to search. In this talk I would actually like to tackle the distinction between filtering and searching and what search actually is. It's going to be a more theoretical talk, what's actually behind it, how it works and hopefully you'll be able to draw some implications into what it actually means for the actual performance and the use cases. For the more concrete stuff using elastic search that will actually be tomorrow. I'll have a talk on how to integrate Python and Django with elastic search. This is mostly theory that can be applied to any and all search engines. So I said theory, so any good theory and sometimes even the bad one start with some definitions. So let's define what search is for us. So for me and for the purposes of this talk search is any interface to data. Anything where you have some data set and then you need to sort of explore and find what's in there. And there is a difference between exploring and finding and we'll deal with that a little bit later. But just shortly when you know what you're looking for then it's search. When you don't know what you're looking for you want to know what's there that's sort of the exploration part. And that part is often more interesting. So if I asked you to implement a search how would you do it? How do people do it? So typical the first response is I'll just go through it. So if I'm looking for a book in my library that mentions Django I'll pull them out one after another and I'll read them and I'll set all those that actually mentions Django aside. If I have a file I will do the same but just using grep so it'll be much faster but still not ideal. And many people do if I have a database I'll just run the icontain query and that's essentially the same thing. What this means is I'm going through all my books in my database I'm going through all the text and I'm looking for Django in there. So you can see that that doesn't really scale because indices in the database doesn't help us. Grep will still have to read almost every single byte in the file. That's not really true because Grep is amazing but still it will have to sequentially go through the files. And I don't even have to tell you how much work it would be to read all the books in your library at least I hope. I do hope that you people read you should it's awesome. So clearly this is not the way to go. But someone must have had this problem before clearly we're not the first one not even the first generation to think about search. If you think that way you're actually correct. We already solved this. When you're reading a book you can you can list to the end sometimes and you will see an index. You will see an index that actually points you to where the relevant terms occur in the book. This is a data structure that first occurs in the 1200s in a small monastery in Europe and in concrete lane 1230 some set of monks actually completed the court coordinates of Bible. So essentially taking all the references from Bible and putting them in sorted order somewhere and then for each of those have a reference to the chapters to the passages in Bible where they actually occur. And this enabled them to have much better discussions and to find things much more quickly. And this very same data structure is the same data structure that powers 90% of all the searches out there. This data structure is called an inverted index. What it is it's a list of terms a list of words that actually occur in in the data set. So in this case we have words like Django, Flask, Python, Jazz and note that they're in sorted order. That's important. And then for each of those in the rows in our in our case we have the list of in our case files where those terms occur. And again you can see that those are in sorted order and that will be important. So what how do we use this data structure for search? So if we want to search for Python and Django we just find the two relevant lines. And finding those two lines is simple because the lines are sorted. So we can immediately go to to the Django line and to the Python line. And now what we have is two lists of documents. Two posting lists is is the proper term that we just need to merge. So we are merging two sort of lists and we're outputting the documents that actually are present in both. Which is a fairly simple operation. To merge two sorted lists to iterate over multiple lists at the same time. And just output anything that's that's in both. That's the sort of that's the sort of task that you might get during an interview. It's not that hard. Can get a little tricky when you have multiple but it's really not that hard. And also this data structure lends itself very nicely to further enhancement. So with this as as we have it now it's safe to say that we can do and and or queries and anything like that. But if we add some more information into the data structure we can do a lot more. For example if we add positions. So for each for each document we add what was the position of the word. So we can see that for file one Python was the fourth word. And we see that for that for file two web was the second word. And why this is useful is because this actually enables us to have the condition on the merging phase a lot more sophisticated. We can do phrase searches. So if we look for the phrase Python web which means word Python followed by the word web. It's easy. We do the same merge process except in this time we take the we take the position the offset of of the word into consideration. So we only consider a match if the offset if the difference between the two numbers in the two different for the two different words is just one. And plus one in this case. So that's that will mean that I can do I can do phrase searches. So even though I'm only indexing individual words and when doing searches I don't care how big the documents are. I go for the words directly. So it doesn't scale linearly it scales much better. And just by adding more information you can even do searches across words and even search across relations. So you can immediately see that the offset doesn't have to be one. I can for example say Python followed by web and there should be at most 10 words in between. So I can do proximity searches. I can do a lot more with that and we'll we'll discover a little bit today of what what everything what everything that I can do is. But in the end it's just the inverted index the which is a which is a very nice data structure. And it is how I can do it what I have to put in to be able to get it out. So this was a brief introduction into how inverted index works and the the important thing to remember is you're always just merging sorted lists. Which is a which is a very very efficient operations one you of course once you have the inverted index on desk or in memory even better. So how do you get there how do you how do you build this data structure so that you can then search very effectively. Hopefully I've already convinced you that you can search very effectively with an inverted index. So now I'll let's see how you actually build it. So let's assume that we have that we have a phrase a sentence or let's call it a document for for our use case. And this is some surely sentence that you've never heard of before Django is a high level Python web framework that encourages rapid development and clean pragmatic design. So this is this is the document this is the raw text that we have on the input. And now we need to split it into into words. So how do we do this. I've I've already I've already provided you with the result and you can see that there are several surprising things. For example, I've omitted some words. I don't care about a word like is or a or and or that they're way too common. Chances are that every single document will have those words. So there is no point in me actually indexing it because it will it will bring nothing to the table. So instead as a form of an optimization, I'll just I'll just leave them out. No one cares. So that's one thing that I've done. The other things I've done here is everything is lowercase. Because the inverted index that that does exact matches. So if you if somebody searches for lowercase Django, they should still be able to find this document, right? And then also what I've done is I've normalized the words. So you can see that in text I have encourages. But what I'm actually going to index is just encourage. And this is to sort of allow for the for the morphology of the language for the different shapes of the words. Because jumping and jump is is the same word. When I search for jumping, I want to find a document that just mentioned jumps or jumped. So this is a process called stemming where you actually take the word and you find just the stem, just the core of the word that defines the meaning and omit all the grammatical constructs around it. And one last thing I'll mention is you see that for for rapid, I also index it as fast. So you can see that the inverted index doesn't necessarily have to contain the same words in the same shape, or even the same number of words as the original document. So I can I can just index additional additional words, for example, fast. So when somebody searches for fast development as a phrase, they should they should find this document. Even though the creators of Django have very, very large vocabulary, and they're not limited to simple words like fast, most of the people who search are. So we need to we need to take that into into account. So this process is called analysis. The example that I give here was composed of these of these steps, split it into tokens or into words, remove the ones that I don't like, lower case everything, do the stemming apply the synonyms. And this is this is a very interesting and important process for doing for doing searches. It varies different languages. It varies for use case, because in some use cases, you want to do the analysis. In some cases, you just want to index the raw value as it is. In some cases, you might want to do completely different types of analysis. For example, you want to index every every three letters as as they follow each other. So, so called n grams. So you can do partial matches. And stuff like that. So it's all defined as part of the analysis. And it happens at index time. So when a document comes in, we apply this analysis process to get the get the results to get the tokens that we will then put into the inverted index. So once you change it, you have you need to reindex your data. So this is the one one drawback of the search engines. If you if you change any of this any of this high level configuration, you need to reindex your data. The other important thing to keep in mind is that this needs to be applied for the queries as well. Because it's all nice and well, if I lowercase everything in my document, but if I then don't lowercase the query that comes in, it will again never match. If I don't remove the stop words from the search, so I try to look for for the word, which I have an indexed, it will fail. So I need to apply the same process for queries as well. So this is this was analysis. This is a process of actually taking the text and splitting it up and populating the inverted index. During this during this process, we we got all the tokens. We obviously have the document ID to put into into the posting list. So now we have all we need to actually build the inverted index. So what else what else is part of the search? What else is the major difference between between filtering and searching? Well, the biggest one probably is relevancy. Because when you do when you just do a filter when you just do an I like query or I contains in our case, it will tell you which document actually contain this word, but it will not tell you like which contain it more or better. It because it doesn't have any concept of relevancy. We do. There's a science behind it even. So this is a this is a formula from from Lucene, one of the most widespread search libraries out there. The library that powers both elastic search and solar and can even be used raw. There is even a Python interface to Lucene. And this is a this is a scary formula, but it's actually not that difficult to to wrap your head around it when you when you break it apart. And the parts are the most relevant part is in in the middle. It's the it's this it's the sum of it. And that's taking into account TF and IDF. You'll find a TF idea formula in in the middle of any relevancy formula out there. TF stands for for term frequency and IDF stands for inverse document frequency. So the TLDR version is the higher the TF, which means that we have found multiple occurrences of the term in the document. So the word is repeated multiple times in the body or in the title, the higher the score. And IDF means the more common the word, the lesser the score. So it's a balancing act. Because if I find a word five times, but the word is contained in almost every other document out there, that's not really saying much. But if I find a word twice, that is only contained in in 5% of my documents, that means that it's a it's a it's a good match. So that was just a just a mathematical expression how to how to put this down in a in a formula. What we also take into account is the length of the field. Because the assumption is that if we find something in the title or in a tag or something like that, it's probably more relevant that if we just find it somewhere in the middle of the body. Because there's a lot of things in the body. So the chances that of our word being there are higher. So we need to take that into account. So this is this is relevancy. This is this is a very helicopter overview of how relevancy works and how we can how we can actually do that. We can we have the access to the to the term frequency because that's part of the inverted index. And we also have the the access to the document frequency, which means how many documents actually contain this term. Because that's just the length of the posting lists for that term. So we don't need to do any look up. We have direct access to those numbers so we can calculate the the relevancy the score very quickly. So that's all that's all you need to to build a full text search. You can implement this in in Python and and you can use you can use Redis for your data storage and it will work just fine. But this is not all there is to search. We've just built that the the search bar that the text input. There are other parts on this on this page that we might consider part of the search and are definitely part of the search experience. So let's look at them very very briefly. The first one is highlighting. When I when I search through a book and I find the word Django somewhere in there I cannot return the whole book and just tell the user here I found it somewhere in there. That's not very useful. That's not very nice of us. So instead what we can do is we can just return fragments sort of I found it in the in the fifth chapter on the on the second paragraph on the second sentence and this is how the sentence looks. So that's a process that's called highlighting because I can also in that fragment I can also highlight highlight that the text and that's done again easily just by sort of augmenting the inverted index to also contain the offset bite offset of the original of the original term. So for for example for the word fast which is nowhere in the text by the way I the thing that I index is fast and I index the offset which is 60 to 65. So then when I have a match on the word fast I'll just reach into into the text into the document and retrieve anything around the bite 60 to 65 and I'll put I'll put the HTML emphasis tags around the 60 to 65 characters. So again very simple very simple operation I just go somewhere I retrieve a bite offset and in precisely specified bytes I in I input some HTML markers and that's all I need for highlighting. And the last one we have we have time for is my my favorite one and that's facets and filtering. So on the left side on on github for example or if you're if you're ever looking for something on Amazon or something you have the breakdown by by different things. On github it's breakdown by type so is this a repository is this a code is a user and you have a breakdown by languages and for each of those you will also have a little visualization like how many doc repositories for in the Python language have I found how many in Ruby or JavaScript and this is something that's called a facet or an or an aggregation and this is the part if you remember from the beginning this is the part about exploration this is something that enables you to explore what's in your data set without you having to know beforehand if you're looking for Django you have to know that you're looking for Django but now if you've searched for Django you immediately see that majority of the repositories that mentioned Django are written in Python and you can see it you can see right there what then you can do is you can click on the Python and have it filtered to only contain the repositories for for Python. So how this works under under the cover is we have a feature called aggregations and this part is specific for for elastic search other search engines do it slightly slightly differently but the functionality is in all of them so we have two different types of aggregations we have buckets and metrics so buckets are those that actually define define the groups of of the documents it's what you would put you could say that it's something that you would put in the group by in SQL so for example you can have a group by language type or you can have a group by group by geo distance or you can have a group by month so that's how you define how you define the buckets and now the interesting part is you can nest those buckets so you can define a bucket per language and for each language you can define a bucket per month so in one query you can actually get the the different distribution in time for different languages and then inside these buckets you can you can ask for certain metrics just the count of the documents how many of documents fell into that bucket or what's the average value of a certain field or something more more sophisticated and we've already seen facet navigation how it can be used the other way how this can be used is to actually just visualize it because this is this is Kibana by the way it's just a JavaScript application that serves data of a felastic search and it's just doing aggregations and visualizing them so you can see that the the time series they're just a data histogram so basically everything is split by by time in this case I believe it's like five minutes and within five minutes it's split between the four or five different types of the request in this case and then all you need to do is you just need to you just need to filter so if somebody clicks on something you add a filter and filter is again using the inverted index but in this case there is no there is no analysis there is no there is no nothing so it's just an exact match and because it doesn't contain also any score it's very fast and very cashable and that's it for now we ran out of time a little bit I'm sorry for that so if you have any questions I'm not do we have any time for questions we do have time for questions so I'll be happy to answer any of your questions