 Yes, thank you, hello everyone, my name is Victor and I come from Sweden but at the moment I'm working in China for a Chinese dating company called Tantan. So let's start. At Tantan we use Postgres for basically all our data storage, require usage. Tantan är en dating app som du har på telefonen. På bakgrunden har vi en kod i Go och på Postgres som database. Och vi har en post-GIS på topo av Postgres. En av Tantan's core features är att vi använder user för dig att se. Och för att Tantan är en mobil app, vi rankar ut de utgifter som vi ser till dig. Så vi vill visa utgifter som är nära. Om jag är nära till dig, så är det säkert att jag vill ha höre utgifter än om vi är väldigt fara från varandra. Och för att vi använder de här utgifterna, vi har en väldigt stor sikrelselekt staten som visar ut många utgifter för dig att se. Och för att vi använder de här utgifterna är det ganska svårt faktiskt. Och det är för två reasons. Och den nummeren är hur vi ska göra det. Jag har bara sagt att vi vill ranka utgifter som är baserade på ditt distans från dig. Men det finns också många andra utgifter som kan användas för att ranka utgifterna. T.ex. de är populärer, de är ägare och de är gender. Men för detta tal, jag vill inte tala så mycket om den exakta utgifterna för att ranka utgifter. Men jag vill tala om andra utgifter, nämligen hur vi kan göra det så snabbt. I Tantan har vi ungefär 64 server nu, tror jag. De gör den här rankningen. Och i våra utgifterna gör vi ungefär 1 000 server per sekund till utgifterna av utgifterna. Så det ljudet på systemet är ganska hög för den här rankningen. Och tror jag att de har 60 server eller 64 server. De spender runt 50% av de här utgifterna för att runna den här rankningen. Så vad är det nya, exakt funktionalitet som jag hade på det här talet? Det här är det här ljudet här. Det är en ny operatör som är available i post-GIS 2.2. Och med det kan du ranka eller byta 4 dimensioner, utan de 2 dimensioner som du har i den tidigare versionen av post-GIS. Och den här rankningen kan vara done using index, så det är ganska snabbt. Och jag kommer att tala mer om det här ljudet i några ljuder. Så min plan för idag är att jag kommer att visa hur det kan vara done. Och jag kommer att använda Tantan för att illustrera utgifterna för detta. Men förstås är det inte respektivt till våra utgifter. Jag tror att det kan vara utgifterna i många andra applikationer. För exempelvis vill du ha de 100 flesta kvar som är av en certain h, kanske mellan 10 och 15 år sen kan det potentiellt också vara utgifterna. Så för att visa hur indexen kan vara utgifter, har jag tre utgifter som är utgifterna för utgifterna, för utgifterna. Och jag pickar dem för att de har olika utgifter, de arbetar olika, så att du kan få en fel av olika utgifter. Så den första är popularitet, så vi har äg, och den första är aktiviteten, den senaste aktivtidning av utgifter. Och jag kommer att tala mer om de här tre utgifterna. In a minute. Så om vi börjar med att kolla på de två dimensioner, den normala ordning just basen på lokaler, så har vi basically X här, med den långa utgifterna och vi har latitude Y, det är inte väldigt svårt. Och en liten problem kan vara, om du är väldigt nära norrf eller särskilt, eller om du är ute i den specifika ögonen, för då liten och långa utgifter kommer att röra om. Om du kommer till den mest norrf av ögonen, så blir det latitude. Ja, det rörar om det. Och samma med långa utgifter, om du passar om ögonen, så rörar det från positivt till negativt. Men för Tantan's use case, vi bara känner om, basically we don't have any users in the north or south pole, and we don't have any users in the middle of the ocean. So it's not very relevant for us, which is very good. So to show you this more in detail, I have created a little table here, that contains, I prefilled, in my examples, I have prefilled it with around 1 million users from our production system. And it basically contains, we have a user ID, birthday, the location, active time, and something called popularity. And the interesting thing here is the location that is geometry type, which basically store this X and Y coordinate in a very basic way. And finally we create an index of this location so that we can use two-dimensional indexing on this column. I should also mention that post-GIS have another data type called geography that will take into account the roundness of the earth. But it's a little bit more limited in the functionality supports and it's not as fast. So for now, for here I will just go with geometry, and it's good enough for these kind of use cases. Maybe if you want to deal with, for example, plotting a flight path from one city to another, yeah, then it will be more important with geography. But for this user ranking geometry is fine. And then, selecting from this table becomes like this. So we select from this table and we order by how far away they are from us. And since we are in Singapore, so I just use the X and Y of Singapore. And then a little limit of 10. So we take the 10 closest users to us here. And then by using this operator this query will operate very fast. And it runs in about less than one millisecond for my million test users. So yeah, this basically sets the ground for my coming examples. So let's start at the first case, popularity. In Tantan we think that the more popular users are basically better users in our system. And also our research has shown that yeah, it's better to make the popular user a little bit more popular by ranking them higher, so they are shown earlier. And what do I mean by popularity? Yeah, we define popularity as the number of likes you get divided by the number of likes plus the number of dislikes that you get. So if you have 10 likes and 0 dislikes then you are popular 1. And if you have 0 likes but 10 dislikes then your popularity is 0 basically. And the nice thing with this popularity is that it becomes a number between 0 and 1. So it's pretty easy to reason about what happens when we use this later. So what does this query look like when we want to take advantage when we want to have the popularity affect our ranking of users. Well, the normal way of doing this that we actually currently do this in production is by first selecting a pretty big amount of users in this case 100. And then in the second part of this query we take those 100 users and order them by the distance times the popularity and then a limit of 10 which means that among these 100 users we pick out the 10 best ones based on the distance times the popularity. And since popularity is better when it's higher and distance is better when it's smaller I just made 1 divided by popularity plus 1. And yeah, plus 1 is in case popularity is 0 so we don't have any division by 0 problems. And now comes the question how can we make this into a 3D query instead so we can use index for this second part here. Well, the key insight is that we can put popularity on the Z axis in this way. We have longitude, we have latitude and then we have popularity going up transformed into SQL code it becomes something like this. We add a column called location popularity we use the same geometry data type as before but when we fill it with data we put, yeah, we use this make point then we take the X and the Y and then we insert our popularity there. And then finally in the bottom we create an index on this new column and the important thing is here that you need to use this you need to specify that you want to be that you want to use more than two dimensions and you can also see here that I have scaled the popularity by multiplying it with 0.01 that is so that later when we do our queries we won't well basically of course we want the popularity to play one part but we also want to have the distance to play its part so we need to scale it down a little bit because, yeah, so it comes into reasonable size compared to the X and Y so basically this scaling thing that I did here multiplying it by 0.01 I just tried it out to figure out the proper number and I think this is the best method because then you can try different values and see which one is the best one which one gives you the ordering that you want and to use the four-dimensional query or three-dimensional query we have to use this new operator that I showed a couple of slides back and this is directly from the documentation of this it says that this operator returns the end the distance between the centroids of A and B bounding boxes and then this operator will make use of any indexes and then this also has to be used in the order by part of the query but there is no problem for us since, yeah, we will just use it there anyway and the fact that it returns the distance between the centroids of the bounding boxes is also no problem because we have basically point locations for our users and the bounding box of a point location yeah, that's also a point so there's no problem so now when we have our operator we have our data then we can rewrite our query into this we select from users we order by and then we have our new column location popularity the operator and we make a point where we are x&y and then we add 0 because we want the popularity to be as close to 0 as possible and then limit by 10 again so in this case the query become pretty easy to understand and read but a little bit more difficult to reason about because you can't immediately see how the popularity affect this order by by just looking at this query so moving on the second case I want to talk about is age of course when you have a dating application the age of those people that you look at is very important I think most people have some age requirements for their potential future boyfriends or girlfriends but in contrast to popularity this is a strict filtering criteria we don't use the age to show younger people or older people earlier instead we are just interested in either you are within this age range or you're not so if you just look at our table again we can see I had a birth date as a date column here which we can use for this query so the normal way of doing this query would be to just add a where clause where age of the birth date is between 20 and 30 years and actually this looks this works very fine it works perfect in most cases the problem is that it can create a problem in some special case and the problem can be seen on this slide some users like to look at strange ages in this case 74 and 75 year olds or actually yeah between 74 and 75 and this creates a pretty big problem and you can easily see this problem in this next slide here I have done analyse on the query to show what it does and we can see here that it removed almost 200 000 rows by this order by no by this filter here and then the execution time was 500 milliseconds and this is of course not very good it's actually a pretty big problem and a real problem on my slide here I have a screenshot from WeChat where we had a small conversation this was from the day after we allowed 16 and 70 year old people to sign up on Tantan basically it started out like this a lot of suggested query seems to run there some suggested have been running for 2 minutes la la la let's kill the queries obviously not a very good situation to be in so what can we do about this well when we actually had this problem in our production environment a couple of minutes later where we had time to look into why is this happening we noticed that the problem was that some of these 16 and 70 year old people they were just interested in looking at other 16 and 70 year old quite reasonable but the problem becomes that if we have 100 of those in the database and they are spread out geographically then basically the big query has to look through all of them and become very slow so what could we do to prevent this well one solution would be to restrict it so you're not allowed to look at certain ranges maybe you're not allowed to look for 74 year old people you have to look for I don't know 60 to 80 another solution would be to add a distance restriction to our query you could do it so that yeah if you want to look at this very strict search criteria then you only look 10 kilometers 100 kilometers away from you and not more and then the third option would be to add the age to the geometry index that we have so obviously when we had this problem in production we didn't have time to do any fancy solutions so what we did was we added distance restriction so that yeah if you are 16 70 year old yeah then you are only allowed to see those uses within now I don't remember the exact number but maybe 50 kilometer away from you and that took care of the problem for that moment but for this presentation of course we want to add the age to the yield index so how can we do that well just as for popularity we add a column location age we fill it with data and then we create an index and when filling it with data we take the X we take the Y and then we take the year from your birth date and divide it by 100 000 and this is because later when we do the ordering we don't want the age to affect the ordering but we want to have the age inside so that we can restrict the query to only look at a specific age span so if we move on to the query it looks like this we select from users and here comes the interesting part where this location age is within this triple end operator means that location age has to be within and this bounding box that we create and here I have defined the bounding box to include every possible X and Y value we see here from X can be between 180 and minus 180 Y can be between 90 minus 90 så it includes all of them but then for the year your birth date we only include those 74 and 75 year old people and as you see I have to use the same scaling factor here and then of course we use the same ordering with this four dimensional the four dimensional ordering and if we look at the query plan we see that our problem has been fixed the execution plan over here is down to 19 and we can see that we use index condition to do the filtering so yeah everything looks fine yeah why do we use a bounding box and 90, 180, 90, 90 shouldn't it be restricted to 90? yeah, for my example here I basically took all the positions because I didn't want to think too much about it in our application actually we have a setting where you can decide how far away you want the users to be from you so you can be let's say 10 kilometers away 20 kilometers away you can decide for yourself what range you want to look at and yeah, for this example I just took everything because then I don't, otherwise let's say that I only want to see the users within 10 kilometers from me then it's a little bit tricky to calculate what these numbers should be in my box here so then I would have to explain that on a slide as well that's the reason why I took all of them yeah and and if we look at the actual output of this the real users age that were returned by my query they look like this and the interesting thing here is that it actually returned a 75 year old person and a couple of users that are 73 year old and this would be a problem maybe not for those users that are looking for 74 year old because I would assume that they can also handle a 75 and 73 year old guy but of course if you are looking at let's say you are looking at 18 plus and then you get a 17 year old yeah that's not so good so we need to do something about this and I think the reason for this problem to happen is that the geo index is lossy and also we scale the age to a very small number and those two things probably create this problem that is not exact so what we need to do is basically we add back the old where statement that where age between 74 and 75 and this will fix the problem so let's move on to the final property that we will look at activity so activity is the last active time and this is interesting because if I like you and you like me and yeah then we can start chatting and then it's of course good if we are both active at the same time I don't want to if you were active in our app one month ago if I send you a message you probably won't reply to me so that's why we also want to write by activity and the old style the normal style of doing this query will be just as popularity we select out 100 users ordered by the distance and then among those 100 users we take the distance times the activity time the interesting thing here is that what is the difference between popularity age and time the activity time popularity was a number between 7 and 1 for everyone age is also a very bounded problem it's between in our case between 16 and 100 but yeah I imagine that most use cases have also a bounded age however time increases infinitely in one direction it always goes up so when we add this column and later when we do the query we have to be a bit careful but adding it is quite simple we just well put let's see what we do here we take the number of seconds divided by 60 to get minutes and then we scale it by dividing by 100,000 10,000 this should give it a pretty similar strength ranking strength as the location as you will see and then to select with this time parameter the activity we can do it in this way basically we want the activity to be as close to now as possible so if I just quickly go back here we define that activity time the set goes up up up all the time number of seconds so it will be a pretty big number and it goes up and then now if we take the number of seconds from the epoch to now it's an even bigger number so that means that those uses that are closest to this big number should be shown first and then of course we have the Singapore location again so what have we looked at so far if we take a little bit of a review well, first we looked at the X & Y location how to do the very basic query then we talked a little bit about popularity age and activity but now of course maybe you are thinking hmm, we talked about two dimensions and we talked about three dimensions where is the fourth dimension well since I now explain the basics of how to add one of these three to the as the third dimension I think you can pretty easy understand what my plan is for utilizing four dimensions basically we just picked two of these popularity age and activity and in my next slide I have picked popularity and age and just as before we first add a column, geometry with location, popularity and age we fill it with data and finally we create an index and popularity, yeah, that was calculated in this way and then the age in this way, so we just put in there popularity becomes the Z and the age becomes the M for the fourth dimension and then for the actual query it's pretty simple the ordering here we want the popularity to be close to zero again and for the age well basically we don't care so we put zero there as well and then for our search box of course we must allow for any popularity in that one so yeah, here I just took from 100 to minus 100 but since popularity can not be such a big number we could have used smaller numbers as well so now when you have looked at this maybe you are curious do we get any benefit from doing these pretty complicated things well I made some benchmarks in my office yeah, this is from a table I used around 1 million rows from our production and then ran these different queries in the bottom here we have the normal location query it's extremely fast and then the other queries and basically what we can see here is that yeah, we get a pretty big benefit in the location plus age where age was the 74 year olds because yeah, actually it doesn't even fit on this in this graph here but in all the other cases our query becomes quite a bit slower yeah, we have the four dimensional queries 50 and this one here location and popularity 33 34 basically everything except the location activity query becomes very slow I would say and as I said in the beginning of our talk we have 64 servers doing this query and if we increase the query time by like 10 or even more then maybe we can't afford 10 times more servers so this is a bit of a problem actually so the query times did not really impress us here maybe if we look at the index it gets more interesting this is the different indexes that we created in blue I have the indexes when they are fresh and in orange when I made them a bit bloated I updated every row in the table yeah and then it becomes like this and we can see that the index sure it increases a little bit but the difference between two dimensions and three dimensions is not that big so if the query times were okay then these indexes they would be precise would be fine we need a little bit more RAM but yeah it would be okay I guess so then we come to the conclusion of this so for basically for popularity we can see that maybe it improves the ranking a little bit because we avoid the case where maybe the most popular user is a little bit more far away from you and not among the 100 closest one however it makes the ranking formula much more difficult to reason about because now it's like hidden away in that geometry table geometry column it also makes the query pretty much slower for age this is maybe the success trial because it actually fixes our main problem here that if you have an outlier age you only want to look at 74 year olds then it actually fixes that case and make it fast however the average query time also becomes slower and you have to make special care so you don't disturb your normal ordering and of course you need that extra care to make sure you don't return rows that you don't want and yeah the story is similar in time that yeah maybe it improves the ranking a little bit if the most nearby users were active a long time ago yeah then this will help a little bit on the other hand it also makes the ranking formula quite difficult to reason about if the query is slower if we move to the next slide and my final conclusion of this is that it's quite a lot more difficult to use this free and four dimension things for this kind of like hacky usage that I tried here and I guess if you went to the keynote opening of Foss Asia he said one of the speakers said something like yes you don't learn from success it's from the failures you learn things and I guess maybe this is connected to this yeah basically yeah if you want to use this I think the best path forward is to at least try it out quite carefully and test a lot first yeah that's basically it for me any questions yeah I have two questions please roughly how many users and roughly how fine a grain on your location like is it street address or is it town no we have the location reported by your phone so basically so that is basically like I don't know sometimes it's 100 meters sometimes it's 10 meters how many users we have around 2.5 3 million users per day thank you actually i my experience with trying to search for especially we are actually working with the recurve volume of data trying to search what we are doing is actually we are trying to use the political boundaries or geographic boundaries and basically the boundaries to find the search and just do something like that all right then you can minimize the yeah of course for this presentation I just focus on this free for dimensional indexing things in Tantan for example if you have already looked at someone then you have to filter out those users if I only want to look at girls and you are a guide then I shouldn't see you but if I like you anyway and then for you maybe you are not interested in looking at me basically the whole query becomes pretty complicated it helps in some cases but at the same time it can become troublesome basically we update the location every five minutes or if you move more than 200 meters I think so basically every request that the phone is doing to our backend servers sends the location as well and then we check if is this new location like did you change your location and if you did then we update it this is for us we encounter some issues that maybe after reading certain region or maybe squeeze in the bound box the problem for us becomes that if you have very restrictive searches for example this 16-70 year old if you are the one of the only 16 year olds full of China and there are 1 million other users in China then of course we want to give you something to look at so then we can't really restrict your search range too much because it's not good for the user experience because if you don't get any users to look at then you can't do anything with that ok, that was a great presentation Victor, thanks a lot