 Hello. I'm Machi, and I'm a data scientist at LIST. We're a fashion company. We basically go all around the internet and we get all sorts of fashion products from all sorts of retail and all sorts of designer. Put them all in one place so that, as a user, you can just go to one website and follow your favourite designers and browse all your favourite fashion products and buy them from us. So that's the principle. I'm going to be talking about nearest neighbor search. So nearest neighbor search is very simple in principle. Basically you have a point and you want to find other points that are close to it. So the most obvious application is obviously maps. So that's something we use every day. You've got, let's say, your location on a map and you want to ask Google or Bing or whoever you want to ask. What are the nearest restaurants to me or what are the nearest cafes? And that's what it does. It figures out where you are. Rather, you give a location and then it looks up other points on the map that are the restaurants that you're looking for and then calculates the distances between where you are and when the restaurant is and tries to give you the closest ones. So that's basically the essence of nearest neighbor search. Given a point, give me other points that are close to it. Now, this is the most obvious application, but even if you're not building a mapping application, which you may well not be building, we certainly aren't, you may still find it very useful. So what do we use it for? Well, we use it for image search and we use it for recommendations. So image search, how does it work? The principle of image search is, well, you've got an image. Let's say we had a user and they submit an image to us of a dress and they want us to give them images of similar dresses or something that's a good substitute for the dress that they've submitted to us. So, as programmers, it's sort of hard to find similar images just by looking at an image. So the first step you want to do is to transform an image into something that you can more easily work with. So one very naive idea would be to, let's say, well, we've got an image which is RGB, like red, green and blue, and these are three numbers, and for a given image we could average the values of each color and then try to find other images in our database which have sort of similar color values, and that's a very naive approach and it's definitely not going to work, but it illustrates the principle that after transforming the image into a numerical representation you've got a point. In this case you've got a point in a 3D space and that's the nearest neighbor search. If you want to find similar images to the image the user has just given us, we look at other images, other points, which are close in that space. So that is the nearest neighbor search. The way we would actually do it is something more complicated. So recently the most fashionable way of doing this and probably the most effective is deep learning. So this is a very simplified diagram of a convolutional neural network. So basically the black square, the far left end of the slide is the image that you start with, and then you basically build a machine learning model which successively detects more and more interesting features of a given image. So let's say at first you detect just simple edges. Is there a line in that part of the image that goes from left to right, that sort of stuff? But then as you progressively build better and better representations you start to learn more and more about an image. So maybe in the first layer you just detect edges, in the second layer you detect shapes, is it a square, is it a circle, and in the following layers you detect high level concepts. Is it a cat, or is it a dog, or is it a building, or is it a bridge? Now the nice thing about it is that in the final layers you've got a long list of numbers, an ordered list of numbers, a vector which represents a point in space, in a high dimensional space, and the nice thing about it is the images of cats in that space are going to be close to other images of cats and the images of bridges will be close to other images of bridges. So that's how you can do very good, very good image search. And this is indeed what we do at LISP. We sort of process, take images and process them into this point representation, high dimensional space, and then we want to use nearest neighbor searches to find similar images. That's useful for two things. One thing is search, you give us an image and we can find you similar images, or maybe you type in a text phrase and then we convert your text phrase into a point in the same high dimensional space where the images are and suddenly we can find images which are similar to the text you typed. So that's really cool and that's something we can do. Are there useful applications, the duplication? So we've got two images and we have no metadata associated with those images, but we know that these are the same products, like the underlying truth is this is the same product and we can use this nearest neighbor search to find out that these are the same products and present them just as a single thing rather than two disparate things on our website. So that's really useful. Another application's recommendations. So this approach is known as the collaborative filtering approach. So basically you have your products and you have your users and you represent both products and users as point in space. So let's say we take our products and this is a handful of points and we cast them at random into this high dimensional space and we do the same for users. These are our users and we represent them as points and we cast these points at random into a high dimensional space and what we do is well if a given user, a point in our space interacted with a given product we draw the points together but if a given user did not interact with a given product we push them further apart and the nice thing about it is at the end of this pushing apart and pulling together process users end up close to the points to the products that they will like and far apart from the things that they wouldn't like. So when we want to recommend things to a user well we look up the point that represents them in the high dimensional space and then we use nearest neighbor search to find the products that they would like. So that's also very, very useful. So I'm a data scientist and you know data scientists are very excited by these sorts of things and we spend a lot of time thinking about them and implementing them and you sort of go away and you work for six months and you come up with this amazing solution where you translate pictures of products into this high dimensional space with similar products close in space and you say well as a data scientist I've done this amazing thing and it now works and it's going to be great let's just deploy it on the website and make a user really, really happy. Now this is where you are as a data scientist you've got your beautiful child which is going to be great and you sort of go and let's just deploy it let's make it work. So how would you do it? You are given a query point you take a user and you want to find all the nearest products and that's simple, right? And you find all the points representing products and you compute the distance between your query point and all the points, all the products in database and you just order that and you give the closest ones simple, right? I mean it couldn't be simpler. So yes but no. The problem is at least we have 80 million images and we have about 9 million products so if we wanted to do the simple solution i.e. you know calculate distances with all the points and then return the closest users would be very, very bored by the time we finished. It would take literally minutes so that will not work. Okay so how do we make it work? Look how it is sensitive hashing to the rescue so we all know about hash tables or dictionaries in Python so we are going to build a special hash table so we are going to pick a hash function which unlike normal hash functions it maps points that are close together in space to the same hash code which is very different than normal hash functions are supposed to map things uniformly over the hash space so this is sort of special it picks two points that are close together in your space will map to the same hash code and you just build a hash table a normal hash table using that so you take your points and you take the hash codes and then you put your points in the hash packets corresponding to your hash codes and then magically all your points are close together will end up in the same bucket and when you are doing search you just look up the bucket that you need and you just search within that bucket which is really nice so to do this at least we use random projection forests and I am going to tell you how it works so this is our imaginary space of points so we have got about 100 gray points and one blue point which is the query point the point that we want to find the nearest neighbors for and this is how we do it so if we didn't do locality sensitive hashing we would have to calculate the distance between the blue point and every other point which takes too long, we cannot do that so to make it faster we draw a random line at the beauty of it we just take a random line and draw it it has to go through the origin but otherwise any line will do and the nice thing about it is well if we look at the picture most of the points that are closer to the query point end up on the same side of the line and the points that are not close to the query point end up on the other side of the line so just by drawing a random line we managed to create two hash packets and suddenly we only have to look through half of our points to find nearest neighbors so that's already a speed up factor of two just with one random line and we didn't have to do anything intelligent to draw that line, it's just a random line so that's the nice thing about it if this speed up is not enough for you you draw another random line again completely random and the points that end up on the same side of the line end up in the same hash packet if this again, the speed up is not enough for you you keep drawing lines until you've got few enough points in your hash packet and that's your speed up so you draw enough lines to have small enough hash packets and then when you need to perform a nearest neighbors query you take a blue point and you calculate which hash packet it should end up in and then just compute brute force distances between the query point and the points that are in the same hash packet so that's the principle if you can think of it as building a binary tree as well so we start with all the points and then we have a split and the points that are on the left side of the line go into the left subtree the points on the right side of the line go into the right subtree and then we follow the right subtree in this example and do another split into a left subtree and right subtree and another split and another split and you can sort of with a query point you can start the root of the tree and then follow the splits until you end up in the right hash packet so that's the principle of how it works now it works really really well in some cases but in some cases doesn't really work very well so the way we started this at least we thought will we going to draw a fixed number of these lines and hopefully that will give us speed up and hopefully will be accurate enough so what we did is we decided on the dimensionality let's say we want to do 100 random splits and then after 100 random splits we stop and then things that end up in the same packet are the nearest neighbours and the rest we discard that works reasonably well if your points are fairly uniformly distributed in your space because all regions are of fairly equal density wherever you draw the lines hash packets are going to end up to roughly the same size so the same number of points and the splits are going to be good enough but in spaces where some regions in problems where some regions of your space are of high density but other regions are of low density where you're going to end up is with some buckets having lots and lots of points and some buckets being completely empty neither of which is very good so if you have a packet with lots and lots of points you're not going to get a good speed up but if you have a packet with very few points in it you're not going to get any results back both of which are horrible so the first point is keep splitting until the nodes are small enough so you don't take a fixed number of splits you just build a binary tree and you split and you split and you split until you've reached a stopping criteria in which this bucket contains X number of points and when that happens you stop splitting and you take that tree so that's the first point the second point is use medium splits random lines you can end up with highly unbalanced trees so the left subtree will be very short because for some reason there are very few points there but the right subtree will be very deep because there are lots of points in that part of the space that's not really horrible but it's not great either because you spend more time traversing the deep part of the tree and you will be traversing the deeper part of the tree more often because that's where more points are so medium splits you take a random line and then you calculate the median distance from the point to that line and then you split on that distance so it's guaranteed that half of the points will always go to the left tree and half of the points will always go to the right subtree so that gives you nice balanced trees and faster traversal times and the final point is build a forest of trees so you don't just build one tree you build lots of trees what's the reason for that well the random projections algorithm and locality sensitive hashing in general is approximate algorithms they don't give you an exact answer they're probabilistically correct but in some cases they will be wrong so if we look at this picture if you look at the query point well maybe there are some points to the left of the line that are closer to the query point to the points on the right side of the line but if you build just one tree we're never going to surface these points because then in a different part of the tree so that's a mistake that this algorithm makes and that's not great we want to have as good results as we can giving the speed up so the way we get around this problem is we build lots of trees and because in each tree the lines are chosen and random again each tree will make its own errors but they will not repeat each other's errors so when you combine the predictions from all the trees they will end up correcting for each other and the accuracy score of the aggregate will be higher than the accuracy score of any single tree so that's why you build lots of trees another nice property of this approach is if you build more trees you're going to get more accurate results at the expense of having to traverse more trees so it's going to take slower but this is something you can control you can pick the trade-off between the number of trees you build it draws your performance curve and you can pick a point on that performance curve speed versus accuracy that's appropriate for your application so that's very nice if you want pure speed, build few trees you get pure accuracy but you'll be very quick if you want something accurate but don't care about speed that much build lots of trees, it'll be accurate maybe not so fast so that's the principle of the algorithm does anybody have any questions at this point because I'm happy to clarify obviously it works well for two dimensional points how do you generalise it to higher dimensions like you assume you will use hyperplanes but how do you split the feature space with a hyperplane right so the algorithm actually one of the main reasons of the existence of the algorithm is high dimensional spaces and the reason why it's in 2D on the slides is because 2D is easy to visualise and high dimensional spaces aren't so that's why it's in 2D but basically what you do is you run a hyperplane whatever your dimensionality is and then you do the same calculation so it's exactly the same principle which is just in higher dimensions and it works very well for very high dimensions we can talk about that afterwards so that's the principle of the algorithm how do you do it in python at a python conference it's useful to give an idea of python packages so that several python packages for doing this one of them is ANOI which is a very cleverly named package A and N very cleverly named package from Spotify it's a python wrapper for C++ code it's pip installable it's very very fast it's actually very nice another package is LSH Forest those of you who are data scientists if you play with ML you probably already have scikit learning in computers so it's really easy to use it because it's already there and it's also quite easy to use and then you've got Flan which is I believe C++ code and it's sort of gnarly and hard to deploy the nice thing about it is you give it your data and then it takes a long time to train but it figures out the optimal structure for your problem and there's a python wrapper for it which works I'm told there are some bits that we don't like as much so LSH Forest itself isn't scikit-learn you can read the source, it's fairly readable but it's actually very slow so if you want to develop a high performance application maybe not the best solution Flan is a pain to deploy it's C++ code, you need to have CMake all sorts of dependencies ANOI is great, I recommend you use it but for us it didn't really fulfil two important requirements, one is it doesn't allow, once you build a forest of trees you can't add any more points to it which for us was a no-no we need to add new products to the index as they come in so this is something we needed and secondly you cannot do this out of core you have to keep all the vectors in memory all the time so like any engineer we wrote our own it's called RP Forest which speaks to the algorithm it's available on GitHub and it's pip-installable as well so please go forth and try it out and break it in all sorts of novel ways and I'll try to fix them it's quite fast, it's not as fast as ANOI but it's fast enough for us it's certainly much much faster than LSH Forest which is built into scikit-learn allows adding new items to the index and does not require us to store all the points in memory which is really really very nice so how do we use it? we use it in conjunction with PostgresQL so basically we have a lightweight service that has the ANM indexes in it the RP Forests so we send a query point there and what it gives us back is these are the product IDs or these are the image IDs you are going to be interested in so we get these IDs back and then we push them to Postgres and go dear Postgres here are the IDs please apply the following business rules all your worst statements and so on and then do final distance calculations also in Postgres using C-extensions so we store the actual point locations the actual vectors as arrays in Postgres and we've written some C-extensions to Postgres that allow us to do the distance calculations in Postgres which is quite nice side note Postgres is awesome if you're doing all sorts of numerical stuff you have arrays in Python arrays in Postgres and you can write custom functions to do anything you want in C so if you really want to you can write your stochastic gradient descent machine learning algorithms in Postgres and run them in Postgres I'm not sure you should do that but it's definitely possible so the whole combination the algorithm and the implementation and Postgres as a data as a backing store gives us a fast and reliable ANM service with a 60% precision at 10 if we get 10 nearest neighbours we get the 6 out of 10 actual nearest neighbours using the approximate approximate approach at 100 times speed up so I think that's reasonably good speed up over what? speed up over brute force not particularly demanding speed up baseline but that's where we start so we've got that and the speed up that we gained it allows us to serve real-time results so we don't have to pre-compute we can just serve it real-time which is also very nice and it's all built on top of a real database so stuff like concurrency and updating the database that's all taken care of by smarter people than us so it works well anyway thank you and be very happy to take any questions yeah hi I've just did something like that for chemistry with distance and I'm doing brute force but I would like to ask you for an estimate of what number of entries this is going to be hard so when is my application going to need trees like in the future I don't know when I mean it's sort of an empirical question right so it depends on your requirements as well so let's say if you're doing offline processing and you want to you don't really mind that it's taking 10 seconds or 20 seconds with the data you have but when let's say one lookup takes 10 seconds but you need to do your lookup 100,000 times then maybe that's that's the point where you really want to look into these solutions if it's fast enough for you for what you're doing you don't need it so what was your pain point in your application what was too long for you so let's say if we want to serve web requests like 100 milliseconds is too long for us and doing this brute force would take anywhere up to 3 seconds and it would completely destroy the database so getting from 3 seconds to under 100 milliseconds that was the difference for us hello and thanks I have a question about clustering so are there any algorithms like kayamins and so on which could be fast and precision have nice precision thanks yes so the question is can you use clustering algorithms to achieve the same sort of effect and yes you can and there's an approach called hierarchical clustering which again is sort of building a let's say a binary tree for example where you have all your data points and you put them into two clusters and you build two clusters and then these are your two clusters on the first level of the tree and then you go down into each cluster and you recursively build more clusters and keep splitting the tree that's also an approach it's not something I have investigated so I cannot give you a good answer on what the performance trade-offs are have more questions thank you