 or any of the open source search engines to build a search feature for your website. I am going to be talking about what principles these software are built on, the theory behind the search engine right, I will skip this right, but before I get into the theory let me just answer this one question that you might have, should you be doing this, should you be building a search engine from scratch instead of using any of the open source solutions. My own opinion no you should not be, I think you would rather go ahead and use one of Lucene, Solr, Sphinx, Elasticsearch, any of those, these are great solutions. In fact, we at Flipkart, we use Solr for our website search and it is been great, we have been able to customize it a lot, these are great solutions. Then why am I giving the stock, I mean what is the, why are we all here, right? So it is, to use a database you do not need to know what goes underneath the database, how a database is built, if you do know that, that helps, that helps you be a lot more effective with the databases. It is quite similar with a search engine as well, if you understand what is underneath that search engine, it will help you be a lot more effective with it. Also some of the techniques and some of the things, some of the ideas that are used in a search engine are very, very useful and these are things that you might be able to apply in very different kind of solutions and very different kind of situations as well, right. So let's get started then. Is this visible? I have to work with that, right? So this is a very high level block diagram of what a search engine looks like. So we are usually familiar with just that part of it there, where a user sends a query to a search engine and gets a response back. In the background, there is an index which the search engine uses to answer that query and there is an indexer which actually builds that search engine. I am going to go over some of these parts. I am actually going to go over the searcher. I have named that searcher to make a distinction between the whole set, which is what we will refer to as a search engine versus just the part which does the searching, right. I am going to go over what an index, what a search engine index looks like. It is quite different from any other kind of indexes that are used in say databases or such and I am going to go over how the indexer actually builds that kind of an index. And then I will go and talk a little bit about how to scale these pieces, right. So before getting into the searcher or the index or the indexer, I will actually go over a little bit of theory behind search engines, right. One of the first things that a search engine does when faced with a query or when faced with documents that it needs to index is that it does some text analysis on it. What kind of text analysis? Let us take an example, right. Let us take a, this is a phrase from Shakespeare's Julius Caesar, right. Now given this phrase or given the sentence, what a search engine does is a number of things that it transforms it both at search time as well as query time, right. First thing it would do is what is known as tokenization, where you take this string and break it up into words or what we refer to as terms. So you take the thing, you remove all punctuation and you break it down into individual words. So this is very, this is a key thing. If we do not do this, what the search engine will actually end up doing is it will be end up like a graph, right, looking through all the text. Instead we break it up into individual words and from here on the search engine only deals with terms not with whole sentences. Then it goes on to do something important known as stop word removal. Stop words are words which are very, very common in the language. For example, for English that would be propositions and articles. We go ahead and drop all the stop words. So the was and the there are stricken out. So they are dropped because they had little value. So I would have to get into some information theory to go into why we would want to get do this information theory around entropy and all but I will not do that. Let it just sink in that these words do not add a lot of information. So they can be safely dropped to reduce the size of the index and also to increase the accuracy of the search. And then we go ahead and do a whole bunch of things which are collectively referred to as normalization, right. So this could be something as simple as case folding wherein we lower case all the things, all the words. It could be using something like a synonyms database to make sure that different forms of a word. So color is spelt differently in America versus in Britain, right or pavement versus footpath. To make sure that these different forms of the words match each other, we use a synonyms database to map them to each other. And there are various other kinds of normalizations that are done. There's removing of axons and normalizing abbreviation u.s.a and u.s.a should match each other. How do you make sure that? So abbreviation normalization helps that. And there's very different kinds of normalizations as well. There's using phonetics and so on. Another thing searching is then go on to do is what is known as stemming. Stemming is activity where different inflections of a word, different forms of a word are translated to its base form. So you can take something like killer, killed, killing and transform them all to kill. So maybe the document actually contains killer or killing, but the user types killed, they should still match each other. And in order to enable such a thing, searching is stem. So the stemming and most of the things that I'm talking about are done both at search time as well as index time. So there are rule based timers that are used as well as dictionary based timers and the and getting the stemming right is actually pretty tough. So stemming is a very important step and Google had not introduced stemming until about 2003 or so because getting the stemmer right is not an easy job. And then there is a more advanced technique known as a lemmatization. This is something that not all search engines do. This is where the context around a word is used to see where to place the word. So let's say you see the word saw SAW, right? It could have been used as a past tense for C or it could have been used to refer to a hacksaw. Now which usage was it in this particular instance? How do you figure that out? There are various techniques to do that, mostly to do with some natural language processing. There's stuff like part of speech tracking where you tag each word in the sentence with the part of speech that it belongs to. And then you use that and some additional resources to say this is this kind of thing, right? So one important thing to see is that all of these things except tokenization are being done to make fuzzy matching possible. Search engine matching is in general very fuzzy. There is no, this is very different from the matching in a databases for example. In database you have a query, either a row or a document matches that query or it does not. It's not so in search engines. It's a very fuzzy match. So let me move on to the index itself. I'll give an overview of what a search engine index looks like. So actually one simple way, very simple way of thinking of an index is to think of it as a what we call a term document matrix. Fancy term for something very simple. Imagine a matrix of 1s and 0s, right? And where the rows represent terms that is words and the columns represent documents. So here we are using Shakespeare's place as examples. And these are the words from Shakespeare's place. And those are the names of the Shakespeare's place. Now whenever a particular term occurs in a document at the corresponding position where these row and column intersect, there is a one. And there's a zero everywhere else, right? Very simple. Let's take an example query and see how we can use a document term matrix like this to answer that query. So we have a query Brutus and Caesar and not Calpurnia. So what we are asking here is give me all the place where Brutus is mentioned, where Caesar is mentioned, but Calpurnia is not mentioned, right? So let's see what happens with this. One way to think about this is that every row here is a sequence of numbers. Every row represents a term. And a term is a sequence of 1s and 0s. Let's pull out what the sequence of 1s and 0s are. So Brutus is a sequence of 1s and 0s. So is Caesar, so is Calpurnia. And since we are interested in not of Calpurnia, that is documents where Calpurnia does not exist, let's compliment that. And let's get not Calpurnia. Now in order to match this query against the documents, it's as simple as do a binary and of all these binary sequences, right? So we end up with some 1, 0, 0, 1, 0, 0. So this represents the documents which match this query. So the first document and the fourth document here match this query. That is Antony and Cleopatra and Hamlet have Brutus and Caesar, but do not mention Calpurnia, right? That sounds very simple, right? It's a very simple. This is the most basic form of how a search engine works, right? This works and it's beautiful. The only problem is this is very toy-like. I mean, this will work for a very small data set. Maybe if you have a few thousand documents and a few thousand terms, it will work. But as soon as you start talking about tens of thousands, hundreds of thousands, or even millions or billions of documents, this is not going to scale. So how do we scale this? How do we make it possible to scale this? A key observation that will allow us to do that is that most of this matrix is actually zeros. This is what we, what is referred to as a sparse matrix. Most of the matrix is zeros and there are ones spewed just here and there. Why do we even have to store the zeros? Why not just store the ones, right? So that gives birth to the idea of an inverted index. Now this is a misnomer. All indexes are inverted, but this is a term that the information retrieval community uses heavily and it's the well understood term. So an inverted index is very simple. So you take all the terms just like we had before, but against each term, you store the list of documents. The document IDs, which contain that term. You don't store the document, you don't say these documents do not contain this term. You just store the list of document IDs which contain that term. Now the list of terms itself is referred to as a dictionary and the list of documents where the term occurs is normally referred to as a posting list. A posting list, it's a very peculiar term. And the entire set of postings list is also referred to as a postings. Now let's take the query that we were looking at, Brutus and Caesar and not Calpurnia. Now how do we answer this query from using this? We cannot do the same thing that we did earlier, right? We cannot, we can no more just do a binary and. So what we would have to do is, we'd have to take Brutus, let's take Brutus, take its document ID list, then take Caesar's document ID list. Consider these as sets and take a set intersection. Now take the document ID list for the posting list for Calpurnia and take a set difference of what we had earlier and this, and boom. What you have is the documents that match this query, that makes sense, right? And in order to make sure that we are able to do this in a very, very fast manner, the documents IDs are actually stored in a sorted order. You can see how that makes it really fast. So since they are already stored in a sorted order, the intersection is a matter of just going through those lists once. So it's a single pass on all the entire document list for that term to do the intersection for all those terms to do this. So if against each term, we have N1, N2, N3 documents stored. The complexity of this is O of N1 plus N2 plus N3. That's all. That's very fast. Now let's take a look at the searcher and specifically in the searcher part, I'll actually focus on relevance ranking, right? So I'll explain what I mean by relevance ranking. Let's take a sample query. Let's say somebody fired the query MySQL performance. Now, and let's take four sample documents which match this. We could have matched this based on the inner manner very similar to what I had shown earlier, right? But now the question comes up, in what order do we show these results in? Which document should be the first document, which should be the second, which should be the third? And consider that there could be millions of documents which actually match this query. What order do we present them in? And that's a very, very critical question. That's a question that makes or breaks a search engine, right? So one key insight that we can use to order them is to say, hey, let's see how many times these documents use the term MySQL, use the term performance, and let's just order them in that order, right? This is a technique that is referred to as a term frequency, TF, wherein we take the number of occurrences of each of those terms within those documents, we sum them up and simply order in that fashion, right? So we end up with, say, in some order. This by itself yields a huge return. So of the millions of documents which match MySQL performance, many of them would be mentioning MySQL as well as performance these terms. They would be using these terms in just passing, and they would all go towards the end. All the documents which are probably talking about MySQL and our performance are bubbling up to the top. But you'll notice one key issue here, right? So the last two documents here, the first one says top 25 best Linux performance monitoring, it probably, it has one mention of MySQL. The next one says Linux performance is Linux becoming too slow. And that has three occurrences of MySQL. Then there is MySQL performance blog and eight great MySQL performance. The last two ones here seem like they should be the most relevant ones. But they are coming up much lower. The reason this is happening is because we are giving the same amount of importance to the term MySQL as well as the term performance. Whereas they are not equally important. We are not interested in any performance related articles, right? We are interested in articles related to MySQL performance. Now, how do we make sure that MySQL documents which are mentioning MySQL, which are about MySQL performance in particular, come up to the top? Technique that is used for this is called inverse document frequency, which is a fancy way of asking how common or how rare is this term? How rare is the term MySQL? How rare is the term performance? And the idea behind this is that the rarer a term is, the more importance we should give to an its occurrence, very simple idea. And the inverse document frequency itself can be simply computed as 1 upon the number of documents in which that term occurs. Which is and which gives a measure of how common or rare that term is. So, we take this and we define a score for every document and a query term as the product of the term frequency and the inverse document frequency of that term. I'll show an example shortly, which will make this clearer. So, in practice though, it is important to note that term frequency itself is normalized based on the document length. This is to ensure that longer documents do not overpower all the shorter documents. And the idea of itself is dampened by applying some functions to it. The most common function being log, inverse of reciprocal is another function. There's a whole bunch of them that I used. Let's take some examples. So all the numbers I'm showing here are random numbers that I've just used for the purpose of demonstration. So let's say MySQL comes up with an idea of of 10. It is a rare term and performance comes up with an idea of of two. It's much more common than idea. So the number here indicates how rare that term is. MySQL is far more, five times more rare than performance. And let's see what happens to our ranking if you use this formula, TF into IDF, right? So MySQL comes up with some score, performance comes up with score, and let's actually order this. Great, so it looks like TF IDF actually works with this cooked up set of numbers at least. So we will get the things which are talking about MySQL performance higher than the things which are just talking about performance. In practice, TF IDF ends up working very, very well. Just by using TF IDF, ranking becomes just brilliant, right? Let's, I'll just talk about another component now, the indexer, and how we do the indexing part. So it's quite simple, actually we take the documents, take it through the text analysis that we already talked about, that results in a, that's term document ID pairs, right? It's a pair of term and document IDs after tokenization and everything. We then go ahead and sort these which results in term document ID pairs which are sorted and then we just persist them. Now the key thing is we persist them in three different ways. One, every term is set against a term ID. A term ID is given to every term and that is stored. And then we store a mapping from the term ID to a posting ID itself. And this is referred to as a dictionary. So term ID to a posting ID. And then we have a posting file which stores the posting ID to the list of documents, the posting list itself. That is the list of document IDs, right? It's as simple as this. The only problem is the sort over there, right? It's a killer. Imagine you have a million documents, 10 million documents and 200,000 terms, right? So you'll end up with 200,000 into 10 million things that you need to sort. I mean, that's killer, right? That's not gonna work out. It's never going to work out. What do you do? So the simple solution is to divide and conquer, right? You take, say, 10,000 documents at a time. Do, create a posting list for that and persist it to desk. Take another 10,000 documents. Create a posting list for it, persist that. And keep doing this. And once you have enough, all of them done, now you go ahead and merge all of these posting lists. It's a simple matter of doing a merge sort from this. So you actually, this also allows you to index a large collection as long as you have enough disk space, even if you don't have a lot of memory. If you don't have enough memory to fit all the documents in memory, you can do it with this, right? Something you can see from here is that this is something that is obviously distributable, right? You can easily take this and put it in a map-reduced paradigm. How do you do that? Well, the mapper is quite simple. It takes the splits of documents and does the parsing, tokenization, converts into term document ID, and creates a local posting file for that set of documents that it has received. Then the reducer actually does the merging of these posting files. Something as simple as that. Then that's great, actually. So the indexing part is taken care of, but if you really do have such a large index, how is your search server is going to even work? I mean, it can't load the entire index into memory, right? We're talking about huge indexes. You'd actually have to shard the index itself, right? Let's take an example index. Let's say this is a simple index that we have of 10 terms, and we need to shard this. Well, the simplest way of thinking about it is, we'll take that, divide, take the first two terms on one server, take the last three terms in another server, and boom, we have sharded. Now, a query needs to be sent to both these servers. Great. Only problem with this, the vocabulary, that is the set of terms does not grow very fast, but the set of documents keeps going. Newer documents are going to come, and when a new document comes, you might have to touch almost all the servers that you have, right? Because this document is going to get contained terms which are spread across machines. Moreover, let's see what a search does, right? A search comes, a query needs to be sent to the servers, they'll send back document IDs, and remember, for each term, we pick the document IDs and then do an intersection. Now, this intersection has to be done at some central place, probably the place from which we are firing the query. So, we fire a query off the multiple servers, get the document list, and take an intersection on the querying server, which can turn out to be very expensive. So, this is referred to as partitioning by terms. In practice, most search engines do not use it. What they instead do is, they partition by documents. So, quite simple, again. So, let's say we'll assign D1 to D5 to one server, and D6 to D10 to another server. Now, each server will contain probably a large part of the vocabulary itself, the dictionary itself, but querying now is very simple. We just have to, the one disadvantage of this is that every query needs to be sent to every shard. So, but after the documents are returned, what needs to be done is very simple. We just need to merge them into, we just need to take a union, which is a far cheaper operation to do. Right, I think I'll stop right here. Just a small attribution, most of the images that I have used, or at least some of the images I've used is from a book called Introduction to Information Retrieval by Prabhakar Raghavan and Christopher Manning. It's a wonderful book. If you're interested in information retrieval or any of these things I'm talking about, this is a brilliant reference. It's a free e-book available online. The authors have made it available online for free. Highly recommended. I'll take any questions. So, when you distribute documents, you've basically screwed up IDF. Yeah, so. So, how do you then do ranking? So, that is a big problem. So, what he's asking is basically, remember IDF, which is one upon the number of documents the term is in. Now, each server has only a local information of what documents are present. So, each server has only an incorrect IDF, really. Now, how do we work that out? How does the ranking happen? The technique that most search engines are employing for this, is that there's a background process or there's a set of background processes which are continuously computing the IDF across all the servers and distributing this value over again to all the servers. And this is what gets used to finally do the ranking. Other questions? Thank you, Satari. Thank you.