 lost. So our next speaker is Ankit Bahaguna, and he'll be talking about query embeddings, a web-scale search powered by deep learning and Python. Thanks a lot. I will be talking about query embeddings, which is our system which we have developed at Klix, which does use deep learning and the system is entirely built in Python. A bit about myself. I'm a software engineer in research at Klix. I have a background in computer science and natural language processing and deep learning. We are building a web search engine which is part of a web browser, and also the browser works on mobile too. The areas that interest me are NLP, information retrieval, deep learning, and I also am a Mozilla representative since 2012. So about Klix, we are based in Munich. It's majority owned by Hubard Buddha Media. We are a national team of 90 experts from 28 different countries, and we combine the power of data, search, and browsers so that we are redefining the browsing experience. The website is klix.com, and you can actually like check out our browsers. So here I'm talking about search, so I'll start with it. So search at Klix looks something like this. So when you open your web browser, what you usually do is you go for a link or you go for a search term. What Klix experience gives you is a web browser with an address bar which is intelligent enough to directly give you the site based on what your query is. So say if you're searching for something like Python wiki, you will get a Python wiki link. If you want to search for weather in Bilbao, you'll get the weather snippet. And interestingly, I found out that on Monday, that's today, it's 41 degrees, so take care. And of course, if you want to search for news, you'll get like real-time news. So it's a combination of a lot of data built into a browser with like the technology of search behind it. So it's all three things combined. So a bit historically about how traditional search works. So traditionally search for us, so search is a very long studied problem. And by search, I mean information retrieval of the web search. And what they used to come up with was create a vector model of your documents and your query and then do a match at real-time. And the aim of the whole process was to come up with like the best URLs or the best documents for the user query. Over the time, what we found out, like search engines evolved, the web became rich, the web 2.0, there was a lot of media which came in and people expected more from the web. To come up with our search story, search at clicks is based on to match a user query with a query in our index. And our index is based on query logs. So if you type Facebook or FB, it has to go to Facebook.com. Given search and index, you can actually construct a much more meaningful search result experience for the user because it's enriched by how many times people actually query and lead to the same page. So what we aim is we construct alternative queries given a user query. So if we find it directly, it's great. But if something which is different or we have not seen before, we will try to construct them at runtime and try to search for those results in our index. And broadly, our index looks something like this. So you have a query and it has URL IDs, which means your ID is linked to some hashing value. And that URL is the actual URL that people go to given the query. And there are these frequency counts and everything which actually allows us to make a prediction, okay, this page is the rightful page that the user actually intended. To give an overview of the search problem itself in a bit more depth, the search problem can actually be seen as a two-step process. First one is recall, the second one is ranking. So given your index of pages, what you try to aim at is get the most best set of candidate pages that you can say, okay, given a user query, that should correspond to them. So I want to get the 10,000 pages, 10,000 URLs from my billions of pages, which best fit the query. And then the problem comes up is the ranking problem. The ranking problem means given these 10,000 pages, what we want is give me the top 10,000,000 results. As you might know that given any search engine result page, the second page is a dead page. So everybody concerns about the first page. So it's very important to have the top five or top three results as the best result for your query. And that's all we care about at Clix. At Clix, what we want is like, given a user query, we try to come up with three best results from our two billion pages in the index. So where does deep learning come up? So what we aim at Clix is like, we're trying a traditional method of search using fuzzy matching the words in the query to a document. But then we're also utilizing something which is a bit deeper and a bit different, which is using something called semantic vectors or distributed representation of words. What we actually try to do is we represent our queries as vectors. So a vector is like a fixed dimensional floating point list of numbers. And what we try to do is, given a query and given a vector, that vector should semantically understand the meaning of the query. This particular thing is called distributed representation where the words which appear in the same context shares semantic meaning. And the meaning of the query is defined by this vector. These query vectors are learned in an unsupervised yet supervised manner where we focus on the context of the words in the sentences or the queries and learn the same. And the area that we actually study this thing is called neural probabilistic language model. Similarity between these queries is measured as a cosine distance between two vectors. So if two vectors are close together in the vector space, so they are more similar. And hence what we do is we try to get the closest queries based on which are the closest vectors in space are to the user query vector. And this gives us a recall set or the first set that we can actually fetch from our index which most accurately correspond to our user query. So a simple example to illustrate this is like say a user types a simple query like sims game PC download which is a game. What our system actually gives us is sort of a list of these queries along with the cosine distance to the query vector that user typed. So given the query sims game PC download, we get like sort of a sorted list where the first one is like the most closest to sims game PC download. They're in mind like it's a bit different to understand because you're not doing a word to word match but a vector to vector match. So the vector for the query sims game PC download is much closer to download game PC sims. Now this is coming from our search backend which is a bug of words because we want to optimize the space as well. So eventually the vector comes out to be the same. And the values on the right are the cosine distances. So as we move down the cosine distance increases and we'll see like we'll start getting some a bit far off results. So we are usually concerned about like top 50 closest queries that come through this system. So a bit more about how this learning process works and what we actually utilizing in production is we use something called an unsupervised learning technique to learn these word representation. So effectively what we want to learn is like given the continuous representation of the word, you would like the distance of like two words CW minus CW dash to reflect a meaningful similarity. So for example, if there's a vector like King and you subtract that like a vector of man and then you add a vector of woman, you'd probably get a vector which is close to vector of Queen. And the algorithm that defines this is word to work. And we learn this representation and the corresponding vectors. So a bit more about word to work. It was actually given by Mikhailov in 2013, where he had two different models where continuous black words representation and continuous black words representation. So in this case, we have a skip gram model. This we focus on again, distributed representations that are learned by neural networks. Both models are trained using stochastic gradient descent and back propagation. A bit more visual indication of how this works is like in a CW or continuous bag of words model on the left, we have like a context words of five words say and we want to try to predict the center word. So in this case, we have a word on mat. The word sat has to be predicted given the other context words. And the skip gram model does the exact reverse. So given the center word in the sentence or context window, you try to predict the surrounding words. So given these two models, you can actually like define these vectors for each word that you see as a lookup table and you can learn them using stochastic gradient descent. I'll take this because it has a lot of math in it but still. So what we try to optimize is a neural privacy language model tries to optimize given how many times you'll see a particular word given the context and given how many times you see a word not in its context. So a best language model will actually say, okay, given a certain sequence of words, you'll see the next word and given a certain sequence of word, you will not see a certain word. And that's where the model actually learns. And this is another example of how a traditional language model actually works. So for example, this, the cat sits on the mat. You try to predict what is the probability of mat coming after the sequence in a certain vocabulary dictionary that you have. But the only cache here that we have to worry about is like your vocabulary could be very, very huge. So what you might look at is like you want to try to predict the probability of a word. Say you have 7 to 10 million words in your vocabulary. You want to predict the probability of your one single word across all of them. So to avoid this scheme, what we use something called noise contrastive estimation where we actually don't use the entire vocabulary to test our word against. What we do is like we say, okay, we pick a set of five noisy words or 10 noisy words. So for this particular sequence, the cat sits on the mat, you're pretty much sure that the mat is the right word. But so can be other words. But then say the cat sits on the hair or something like that. So these words will not be the exact sequence that you will find in your real life. And you can pick those words at random from a uniform distribution and get these noisy words as your training examples. So what effectively your model learns right now, given the sequence, what is the right word to get next as a next word? And given the sequence which are not the right words. So if the system differentiates over and over again with millions of examples and you train this over certain iterations, you'll probably get a model which is able to differentiate the position of the right words with the position of the bad words separated with a clear distance. So let's see like how this will work with an example itself. So for example, there is a document like the quick brown fox jumped over the lazy dog. And we have a context window size of one. We say, okay, given like the first three words, the quick brown fox, I have the center word quick and the surrounding words as the and brown. So I want to get in a continuous drag of word model, what I want is like, can you predict quick based on the and brown? So it's just like a very simple example. But at production we found like skip brown does much better. So effectively what we tried to find out is like, we tried to predict the context word from a target word. So we predict the and brown from quick. So given quick predict what is the probability of the, predict what is the probability of brown. And the objective function is defined over the entire data set. So whatever data set we have, our data set is built on a lot of Wikipedia data, a lot of query data, title descriptions that we have, and a lot of other textual data that we have to actually learn how the queries are formed or how sentences are formed or what is the sequence of these words. And we use SGD for this. Say at train time T we have like a certain pace. We have quick and the and our goal is to predict the from quick. So we select like some noisy examples, say like, same num noise is like one and we say sheep. Sheep should not be like part of this. So next we compute a loss for this pair of observers and noisy examples and we get this objective function. So what we try to do is given this, which is like log of the value of the score. So given the probability which is the correct, correct piece of sentence or correct piece of context, the and Q should be given a score of one and given like a quick and sheep, the score should be zero. So if you update the value of theta because that depends on it, we can maximize this objective function as like a log likelihood and we can actually do a gradient descent on top of it. So we perform an update on the embeddings and we repeat this process over and over again for different examples over the entire corpus and we come up with like a lookup table for words and vectors. So we can define the dimensionality of a vector. As I said in my slide that we use hundred as dimensions to represent that word and that's pretty well for us. So how do these word embeddings actually look like or what we have actually learned is something like this. So if you see like the word vectors or like you project these vectors in space, what you find is like the vector for man and woman is roughly equidistant from like thing and queen and you'll find this not just variation in gender but also variation like verb tense like walking and walked and swimming and some because you might have sentences where like the guy or the person is walking and the person is running or he walks or he runs would occur in the same context and this is what the model actually captures pretty nicely and not just that we actually also have like some other informational features like countries and capitals like Spain and Madrid or Italy and Rome, Germany and Berlin. So these are like country-capital relationships. This is like a projection on a 2D scale using TSNE where you actually can see I mean it's a bit short but you can actually see like some characters here at the bottom and here on the top you'll have like may, should, would some here like more, less, some more adjective identifiers and this is like a projection that you can see if you see there are more semantically meaning words are actually closer in vector space and this is a very important property because if you can try to leverage this and construct like sentence or document representation you'll probably get like similar documents in space as well and that is what query embeddings addresses. So the way we generate a query vector using these word vectors is like for the same query since game PC download we have a vector for each of these words what we do is like we just don't use these word vectors as it gets the term relevance and term relevance for us is a bit sort of a custom process that we come up with but actually what you see is like you'll get a score for each term in the query so this tells us like Sims is the most important relevant word in the query because it's the name or the name identifier and next week what we do is we use this term relevance and also a vector to calculate like a vector average of these vectors so what a weighted average actually means is like say given two vectors of two different words and their weights or their term relevance you do a numpy average and you'll get like an average representation of those words and effectively what we actually say our query vector is this average representation so given our vector and the term relevance we get like this average representation of the query vector so effectively at the end Sims can PC download is nothing but this 100 dimensional vector and that is what we use as our query vector. A bit about term relevance so we have two different modes of term relevance usually it is the frequency of the words that you find but it's not very good for scale also like you use something like TF IDF or these sort of representations of TF5 is like given the number of queries linked to a page how many times that term has occurred in those top five queries and that's a much better indication to us given the data that we have that we can roughly say that given the word statistics give me this number and give me the knock in frequency I'll get something like an absolute term relevance and the relative one is actually sort of a normalization of an index what we found out is like if you normalize your scores across all the pages of your index the vectors are slightly better and get slightly better results and these all are data dependent we compute them on the fly each time we refresh our index and for example this looks something like this for each word you'll have like features like frequency you have UQF and all the other stuff and similarly for all the other words as well. What we actually create is like a query vector index now so given a traditional index which has all the documents we have all the queries and their vectors and we try to do a query vector lookup so we cannot do this for all the queries because there are just too many queries so what we found out is like given all the pages are index we can actually just pick the top five queries which effectively represent the page top queries and from the page models we can actually get this data so roughly we come up with like 465 million queries which represent all the pages in our index and we try to learn a query vectors for each one of them and if you just like dump the whole system on disk it's like around 700 gigs and what we actually have the problem now is like how do we get similar queries from these 465 million queries so given a user query the closest 50 queries from this 465 million queries so how do we find closest queries should we use brute force it's too slow it's too too slow we cannot use hashing techniques that effectively because it's not very accurate for vectors because these vectors are semantic even a small loss in precision could lead to like here are results so what a solution actually required was the application of a cosine similarity metric somehow we could have to scale for like 465 million queries and take 10 milliseconds or less so the way we came up the answer was something called approximate nearest neighbor vector model and they were actually pretty helpful for us so what the model that we use is called ANOI it is a C++ and Python wrapper that exists for this to build the approximate nearest neighbor models for all the queries that we have ANOI is actually used in production at Spotify and now it clicks as well we can train all on the 465 million documents at once but it's too slow because it is sort of memory intensive so what we do is like we don't train them all of them together we have a cluster where we actually host these models along with our search index so we train them as 10 models with like 46 million queries each what these queries actually mean I'll explain next and the size of the model is like 27 gigs per shard that what you get after training which is like around 270 gigs if you just scale it to 10 models and everything is stored in RAM because for us the most important thing is latency given a search you want the results to happen pretty quickly later I'll show you a demo of how this thing actually is used in production and then what at runtime you query all these 10 shards simultaneously and then start them based on what cosine distance is that you get so your different parts of your shards might have different closer queries so eventually what you'd want is the best representation of those queries which are closely matching the user query and where we actually found a nice cutoff was like 755 is a heuristical number as how many nearest queries would be very good for the system that doesn't really decrease our recall or anything or latency for that matter because this has a huge latency cost as well so by first I want to actually explain like how we actually use ANOI and how ANOI actually works it's one of the nice frameworks that you can actually use if you are using vector calculus or like using like something vector based approaches for your recalls or ranking and we try to find out the nearest point in sub-linear time so what you try to find out is you cannot do it one by one so it's not O of N what you want to do is try to do it in sub-linear time can you get those closest queries in log of N time and the best case data structure for that is a tree so given all your query vectors are represented by like each point represents a single query what you try to find out is say given a certain point which is the nearest point which is a user vector which is like a user query vector it's some random point on the space find me the nearest ones so to train that model first to build that tree what you do is like you just split this type of space recursively so you split take two points at random and split the space you do it again and then you get something like a tree so you have like a certain segmentation of certain number of points in a tree you keep splitting and you end up with a huge binary tree the nice point about this binary tree is like the points that are close to each other in space are more likely to be close to each other in the tree itself so if you are trying to navigate through a node and you try to come up with like some China nodes that whole track or that whole branch would be composed of all the similar nodes in the vector space so how do we search for a point in the these splits that we have built so say that the red X is our like user query vector and we try to find out which are the nearest vectors to this particular vector and give me the queries related to it so what you do is like you end up with like you search for a point and you just jot down the path from the binary tree and you will get like these neighbors that you get and you use like a cosine matrix so how close it is if it's very close to like between 0 and 0.5 it's much much more closer if it's more than 1 because cosine takes values between minus 2 and 2 so then you can actually decide okay how close your vector is but here what the problem is like you only see like 7 neighbors coming what if we want more neighbors what if we want more than 7 closer queries so what we use is something called navigate through one branch of the tree we can actually navigate to the second branch of the tree and this is maintained in sort of a priority queue and we can actually traverse both the parts of the tree and get like these closest vectors and so you don't not only like look at the right with a light blue part but also like a slightly darker blue part so you see both the sides of the tree because that's where the split up is and you can find okay both of these sort of areas in hyperspace to the user vector but sometimes you'll find like because we did it randomly what happens is like you can actually miss out on some nice zones because you just split across two different points so what you do is like to minimize this you train a forest of trees and it actually looks something like this so you not only like train on like a certain sequence of splits but you randomize those across say these so effectively your model learns these ten configurations at once and searches for them in real time in parallel and this gives you like a pretty good representation because when you sort them and get like good query representations you'll get like some good similarity between queries so between a forest of trees so one bad or like a missing feature in annoy or like maybe it's a feature not a bug is like it doesn't let you store string values but it actually allows you to store indexes so you can actually store like for a query sims game pc download give this like a unique index say like five or one and that one will be stored with the vector and that model will be learned so when you query annoy you'll get like an index back of all the indexes which are close to it so what we have atkliks is like we have developed a system called kiwi which is like a key value index which is also responsible for our entire search index which is much better than Redis or anything to compare with in terms of reads and maintainability we developed it in-house it's written in C++ with python wrappers again and it actually stores your index to query representation so what you effectively see is given a user query you'll get a query vector you search within the annoy models the closest query vectors you'll get indexes for these and when you query the kiwi index you'll get all the queries and effectively you can fetch the pages for all the queries that are closest to the user query and this is how we improve our recall and the results are pretty amazing in the sense that we get much richer set of a candidate pages after the first fetching step with like a higher possibility of expected pages among them and the reason it is going this way is not going beyond synonyms or doing a simple fuzzy match but actually using how vectors are learned semantically it screws up sometimes but most of the time you'll find there's a definite improvement because you always try to learn those words which are near to the context and that's a very important feature and queries are now matched in real time using a cosine vector similarity between query vectors plus using the classical information and overall there's a recall improvement from previous release that we had was around 5-7% so it's the improvement that we find on internal tests that how much we are improving on this and translated improvement in the final top 3 results is around 1% so that gives us a clear identification of where these vectors are actually useful or not and the system actually triggers only for those queries which we have never seen before so that's also like a very important point here because for the scene queries like fb or google you actually land it to a certain page you're definitely sure about it but for queries which are not seen before which are new to us which are not in the index you have to go beyond the traditional techniques and this one technique actually helps a lot so before I conclude like I actually wanted to show like what the browser actually looks like so this is like a clicks browser search page and we actually have this snippet which comes up the idea of this was to reduce that whole step of search engine result page and you can actually get like directly to our page so the libraries are like Spotify and on which is again available on github Kiwi which is clicks OSS and a github that you'll actually find it's pretty useful it's a pretty active project as well and it's a little bit different but I would recommend to use the original C code because it's a bit more optimized and we found like there are certain variations and like the models that are developed because of the comment history that we see there are other clicks OSS projects that you can actually contribute to if you want to find the slides it is actually on speaker deck it's QE python so before I conclude I'll just say this thing that we are still like working on this system we have like the first version of this thing ready but we are trying to look up at other approaches of deep learning like using something called long and charter memory networks the only downside of that approach is like most of these user queries are like keyword base and you don't usually find people actually typing okay what is the height of Statue of Liberty height and that sort of linguistic relationships maybe well captured by LSTMs they are more complicated but this system is like simple enough to still give you pretty good results so we are trying to use this new metric that we have into ranking we are trying to use this query to pay similarity using document vectors where again we are using like sort of a differentiated LSTM model or like a paragraph two vectors model we are trying to also improve our sort system for there are some pages which are never queried before so we have a lot of list of these pages we try to find out what could be the best way to represent those pages so either using vectors or traditional n-grams approach or something like this last but not the least I'll say thank you and I'll finish with this quote which was given by John Rupert in 1957 where he said you shall know a word or two words and he said you shall know what word means and Mekolo actually developed a model using the same contextual approach of words and it actually helped us give good results so thank you any questions? yeah so one of the reasons we had was like we wanted like a unified so we tried a lot of these key value we wanted to make sure in the sense that we sometimes have a vector index where we need like our values should be a list of vectors sometimes it is just strings sometimes they are repeated strings where like you will have the same JSON data structure again and again so we can actually optimize it more if we can write those parts of the code ourselves we started by doing that so I mean Kiwi is a much bigger project here and I'm not really the expert in it but what I can say is like you can press your keys you can do a message pack sort of compression using setlib or snappy and that gives you like a much cohesive vector it's faster to index it's faster to read and it's scalable in terms that we don't actually have to put this in memory we can actually still have it in disk and do a memory map so you can still have like lots of data that you can train what we actually wanted in our use case was we wanted reads to be optimized because we don't have writes at all but we can do that once and then what we want at runtime is like use a query and give data from the index for that Kiwi works pretty nicely for us you were already talking about having no writes on the database I was wondering how you handle having new data new queries new data to train your embeddings or embeddings I would say new query new query new query index because from what I know there are still no nothing there are no implementations of nearest neighbors that can just update the index yeah so it's true so what we do is like we have a release cycle where we compile each NOI index every month and we do that and we do that and we do that in our system but it's true like say immediately if tomorrow I want to include like a set of results which are like new queries for tomorrow I cannot do that but to address the same issue we have news so the news vertical actually handles this so for the most recent part of like say anything that is trending right now you'll have like in the news section so we actually have these concepts which are already learned from Wikipedia data and that's what we use so you can always learn the concept for the new words like some XYZ Gen X word which comes Gen Y word that comes up like tomorrow you probably not have it but it's a very hard problem anyway yeah anyone else? okay let's give a big hand of applause