 tools for different Python frameworks. Regarding to search engines, it's really difficult to find useful information, especially benchmarks comparing accuracy and quality of the search. And that's why it's really difficult to select search engine for your project if you start project or you continue project. And the agenda is, we're talking about a little bit about me, what is full-text search, different full-text search engines like Postgres, Elastic, Hoosh, Sphinx, and search accuracy and search speed and what's next. I'm both expertize in Python and Golang and I created with my friends Australia and startup which helps create check out very quickly. I speak here and I have blog. I can't believe but 18 years ago there was no Google and other web search engines were around back then and AstaLavista, HotBot, InKatami and all the web search. And but what's more unbelievable is that 26 years ago there was no web search at all and now world is rapidly changed and the volume of information available and bandwidth gives us the opportunity to get this information. But unfortunately the processing rate which human being can consume information does not change much and this inevitably means transform searching from something that only geeks ever care about to something that every single of us to deal with on daily basis. Let's start it from simple text search. Text search is not new problem and every day every developer do something like I like to do. We have CPython code base or your project code base and we try to search all occurrences of order dict class and the common tool for every Unix platform is grep and you see that you can find it less than three seconds on my laptop and if we try to do this task using ACK it's like improved optimized grep for programmers it's a little bit less than two seconds and my favorite one but not super fast it's Python version of search through code it's like PSS it's less than one second and the most my favorite one it's Golang based platinum search maybe everyone use it in Veeam or Yamax because it's super fast. It's like 14 less than second. It's characteristic web made tests. But okay it's direct search. It's when you have pattern and search it's simple problem but what if we talk about full text search and full text search provide capability to identify natural language documents and satisfy a query and sort it by relevance to the query and if you plan to read any books in the end you can find search index and it's by the way search index of one of my favorite book about elastic search and the purpose of sorting index is to optimize performance and without index the search engine will scan every document and corpus which would require considerable time and computing power. For example while an index of 10,000 documents can be acquired with milliseconds sequential scan of every word in 10,000 documents will take hours. Disadvantages of index is additional computer storage requires to store this index and time to create index or refresh it because data can change. Let's imagine we have simple example with two sentences and we try to build inverted index it's common term in full text search for these two sentences and first we split the content field on each document on word basis and then we create sorted list like you see in the first column it's term like quick, zebra, et cetera and then we mark each accuracy in each document and place where it's a cure. For my example I exclude the places but only the fact that term exists in document and if we try to search query using quick brown you see that our table or our inverted index can show for us like accuracy is brown a quick and we see that brown existing to the documents and quick is only in one. But you can find that I have a little bit redundancy in my index that's why we should apply normalization. It means that we should lowercase quick, pluralize docs using root forms of verbs, et cetera and maybe use synonyms like jump and leap. Okay, let's talk about what search engines we have now at the current moment. There are lots of different types of search engines but today we will talk about only four of these it's Postgres, full tech search, elastic search, Python search, Hush, and Sphinx. Let's start from Postgres, full tech search. Everybody use Postgres, it was created by Michael Stonebreak for eight years in 1986 and the interesting fact that full tech search supported from version eight dot three, it means that you can use it in every project because I'm not sure that you have less versions eight, eight and last table version is nine dot five and lots of free advantages for this database. Let's see example. We have simple query and simple text and we try to full tech search through this query. In context of Postgres, we should use two functions, it's tech search vector, it's format when we transform your data and special function tech search query and the results will be looks like this. You see that these results return through it means that we find the results of our search. Okay, next, we all full tech search, it's about indexes. I mean, if you want to understand how it's work, you should understand how indexes works and Postgres provide two kind of indexes. First, it's generalized inverted index that I shown before in example and second, it's generalized search tree based and the last one is a little bit lossy because the index might produce false matches because it has very limited hash function for techs we should try to search. It means that it can represent the same phrase with the same ID and you can find false match. That's why it's a little bit not recommended but the differences between these indexes is very simple. When you have data which static, it means that it's change not so often that's why you can use first one. If you have dynamic data which change every day, every minute, every second and you try to search, you should use generalized search tree. Next, important, it's ranking search results. It's how relevant documents are to particular query. So, when there are many matches, the most relevant ones can be shown first. Sometimes it's very useful and Postgres provide two C based functions. It's rank and rank, close density, something like it. And you can cover density and you can use it and I have some small example for you. How to use it, it's next after this slide. Last but not least, it's highlighting results. Every user wants to see what he search and see what's occurrences or what he tried to search. And for this, Postgres provide headline function which very easy, you just use this function and it will mark your results with some HTML tags or et cetera. Also very important, it's stop words. Stop words, it's like English words for example which are useful or uninformative, et cetera, et cetera. It's like it is, et cetera. And working with stop words also included in settings of Postgres and when you apply using text search vector function, you can see that text search vector apply to your column which have some information and as a result, you see like special format for Postgres where you see only useful information exclude stop words like least stop words instead of, because in there of unnecessary, no need to search using this word in some cases. Next, a little bit about Python, Postgres full text search provide for Python. If you use Django, it's good news because in Django 1.10 already added Postgres search functionality which rely only Postgres full text search engine and it means it will be super fast if you use it in your project. Old version of Django model is Django RAM extension. It's written couple years ago and it's working perfect with old version of Django. The third one is using SQL Alchemy. Some example how to apply to your project. If you already have some model which called page, you can just create search index, it's special field and overwrite your search manager where you should add configuration and search field and after update search field means on each safe update or delete index will automatically updated by Postgres. And as a result, you can do some very common or RAM queries using keyword search. You just use search and you can search documentation and about only limitations that Postgres provide and as a result Django also it's very simple query construction mechanism. It means you can use only two Boolean operators. It's and or or and you see that in second example I provide example with about or document or Django, et cetera. According to Django 1.10 itself, as you can understand they edit by default using underscore underscore search for each field. That's why you can also use it without any installation like I show in previous slide and or you can annotate it with search vector and filter by cheese and see results. It's awesome because it will convert it for direct tech search query, tech search vector, SQL query and Postgres will execute it very fast. And examples. Yeah, this commit was made by, I'm not sure but couple months ago it's super fresh information. There is no any documentation about this only in source code in this commit you can find it. Maybe they will update it but I'm not sure that it's already done. Okay, let's talk about finish with Postgres full tech search. We have pros like quick implementation, you saw it. No dependency, maybe disadvantages need manual manage indexes because it's not done automatically. Depend on Postgres, if you use my SQL it will not work. If you use another database it will not works. No analytics data. What I mean about this is this. It means I can't get analytics on search from Postgres. I can only search and that's all. If I want to get some important natural language text data I can't do it. And very simple query builder. Okay, let's continue with elastic search. Elastic search is distributed scalable real time search and analytics engine. It's very important because it enables us to search, analyze and explore your data. It based on Apache Lucene search index which now is the most advanced and high performance in the internet. Who use elastic search? It's GitHub use elastic search to query 130 billions of line of code, you everyday do it. Stack overflow use combine full text search with geolocation, sometimes it's very useful. Guardian, parse logs like lots of companies and Wikipedia try to provide full text search with highlighted data and data doc and cloud and other. The idea with elastic search very simple. It's not quite equals but it's like in parallel you can understand how it's works. You have relation database, you have elastic search, you have database, you have indices, you have rows, you have types, columns equals documents, tables equals fields. The most important it's maybe logs. Elastic search use optimistic concurrency control. It means when you try to change document in elastic search they just update it and update version of this document and it means when you search for some document it will use the last version of the document. According to elastic search there are lots of Python clients, it's default Python client, new version made by Honsek RAL with Async IO and also DSL when you can build your queries if you work with elastic you know how it's difficult sometimes work with these big JSONs and manipulate with it, it's annoying and Honsek create DSL and it looks pretty awesome. Some examples, you can get data, you can create index with number of charts, number of replicas, you can scale it, you can add JSON to index, it's just like how you create data for your index. You can manage stop words, for instance, you can add list of stop words, you can highlight results, my favorite feature you can select tag because sometimes it's also useful to select your predefined tag, not use default one and relevance, relevance is you can explain query and you see what's weight and I remove lots of details but it's big, big explanation why this query returns these results and which each value of weights and I like it, it's difficult to do in Postgres but it's really easy to understand and calculate why this results first if you, for example, override your relevance function or rank function, et cetera. Okay, next, very quickly it's Sphinx, I only put on slide differences, Sphinx written in C++, sorry, C++, I found back and it's used, for example, MySQL as data source and in comparing with Elastic, it's written not in Java and Sphinx assumes that you already have MySQL database and all other stuff based on MySQL but it's not like mandatory, you can use Postgres, you can use any provider and about Sphinx search server, a little bit differences it's DB table, it's Sphinx index, DB rows, it's Sphinx documents and DB columns, it's Sphinx fields and attributes, it's not similar to Postgres and Elastic search, maybe similar to Elastic search and query language, it's not SQL, it's Sphinx query language but it's very similar to default SQL and you can find from your test one it's index name where much Euro Python and it will look something like it, it's, I mean, I put only differences. All other stuff very similar to Elastic and last but not least, it's pure Python whoosh which created by Matt Chaput, his idea was like, okay, my clients have no ability to install Java and that's why he create full tech search engine in pure Python and it's not super fast but in comparing with another pure Python search engines it's super fast and it has pluggable scoring algorithms, you can add lots of and configure lots of stuff and by the way, more information you can find on his talk, I'm not ready to repeat it and some small examples, whoosh depends a little bit on Postgres because it use Postgres for example, stop words and it's create frozen set dynamically for Postgres stop words but you can select any set of stop words, I mean, it's just example from source code. You also can highlight search results, assume that we have hits in title and the most interesting it's, it use best match 25 algorithms which it's, by the way, it's ranking function which used to search engines to rank matching documents according to relevance to given by search query, it's the common use algorithms and it was developed in 90s, 70s I hope and now I created some comparison table for you because when I started work on my first project with full tech search, it was difficult to understand lots of information how to structure it, that's why I created some table where I can find and you see that Python 3 support most search engines, Swings OpenPR, you have lots of clients, you can use this table like reference. Interesting is that Postgres and Elastic have both Async lines, Swings and Hush no and I added Django just for example, if you use Django, sometimes you need some Ramps, et cetera, that's why you can find Hashtag very useful and but talking about Hashtag, it's like provides modular search for Django and it's create one API layer under couple of different, different search engines and provide you Django, like Django or RAM functionality for search but I can't believe that it's really useful when you have project, you create full tech search, you apply Hashtag and then you decide, okay, tomorrow I will be use Elastic search, today solar and the day after tomorrow, Hush and et cetera, it's strange because it's like only very, very simple set of features or other features different in Hashtag, that's why I called Hashtag like Swings Knife and it's useful but not for specific task and I created small prompts and cons for you about Hashtag. Yeah, it's easy to set up, looks like Django or RAM, search engine independent, support now for engines. If we go deeper, search query set API very poor. I mean, it's very poor. You can't create very smart queries. Difficult to manage top words because you need go to search engine backend and do it by hand, by yourself. Hashtag doesn't care about it. Lose performance because you need like convert results to search query sets and work with that, maybe in memory and model based. It means that most full tech search engines try to promote no SQL concept when you have like object or when you have document, not model, not one table. That's why it's a little bit difficult and the most, I think the most ugly with Hashtag it's lots of hard coded settings in search engine. If you open source code of Hashtag, you can find hard coded elastic search settings, hard coded settings for solar, et cetera. And it's annoying. If you want to change something, you need to change Hashtag or patch it or something like it. Let's continue with my table. Next very difficult and interesting things which index each search engine use. And I put like elastic user patch Lucina. You can find more information about it. It's a default inverted index. As I said before, Postgres use generalized inverted index and generalized search trees. Sphinx has three opportunities. It's disk indexes, real time indexes and distributed. By the way, distributed index, it's just like container for lots of disks and real time indexes. It's how you can scale your Sphinx and Hush use very simple index folder. As I said before, guy who create Hush, he said just you have only Python and folder without any database, Java, et cetera. That's why he use simple approach. And last column, it's interesting. Sometimes when you have database, you need to search in the memory without creating index and it's possible only for Postgres. I like this feature because you can use it in all databases, no need to create. If you want, you just need to create index but you can search. All other search engines you need to get data from data source, put it to index, build index and then you can search. But Postgres can do it in real time for you. Next interesting, it's ranking, relevance and et cetera. It's how, which probability algorithms each engine use for search. Elastic use very common term frequency and verse document frequency. It means how often your term or your query occur in the whole document database. And according to Postgres, we already talk about CD rank. You can, it's interesting that you can put some weights for CD rank like input parameter but it's, you can influence to CD rank formula how to calculate rank, just only some parameters. Sphinx, it's cool because lots of, lots of variants. By default, it use two factors. First, it's major part, it's approximately between the document text and the query. It's called like longest common subsequences or something like it and very common known best match 25. And Hush use from my point of view the most smart relevance because it's improved best match 25. But interesting that you can replace any relevance function to Hush and Sphinx has big table of lots of formulas how to, I mean, you also can configure it not like for Postgres or Elasticsearch, you can do it. According to configure stop words, you can do it in all engines. You can highlight search results, all engines. It's like common features that you need. Sometimes it's useful to use synonyms and you can find that all these engines support synonyms only Hush but you can do it manually to replace words or create dictionary which associate like one word with set of words like synonyms. About scaling, I would like to say that the most scalable it's Elasticsearch because it works from scratch and you can use it. For Postgres, you should think about partitioning, table inheritance, et cetera. About Sphinx, I already said that it's use distributed searching and you can include lots of indexes in distribute index and it's how you can do it manually. Hush does not support any scales. And in the end, I would like to present for you some load tests that I made in real production. I have one million music artists and I put it to each search engine and I try to search because most of load tests that I found for search engines use like white noise. They generate like combination of letters and try to search, it does not make any sense. And performance result, it's interesting because if data, I put it in one table, for example, for Postgres and Postgres when I create index, Postgres returning four milliseconds, it last version, the latest which I found like 9.6 beta or something like it. Elastic return in nine milliseconds, it's also pretty awesome. Sphinx returning six milliseconds but I'm not sure that I configure it correctly. That's why maybe some results not super useful and Hush also has less performance. And the question only if you have more data which not putting in Postgres. My next task for me, I plan to do more smart queries and I have database with 300 million records which I'm not sure that I can put in one table in Postgres and et cetera. And yeah, maybe results will be different. In the end, I would like to propose you to read some books which I found very useful for me about Elasticsearch, the scene, if you're interesting in Sphinx, if you're interesting in Sphinx and very cool book about tone break which called Red Book about database systems. I created some list of references for you because it's really difficult to share with you details of each index and you can find it in some very useful links and read about it because when you're stacked and your customer decide, okay, relevant should work, I mean indexing should work that or this, you can read about each index and find for you in which case your index will be more efficient. Also about ranking, ranking is a really difficult part. That's why I also put it to links. You can read about each scoring, how it's calculated and et cetera, et cetera. Because performance will depend on two big factors. First, it's ranking algorithm because you should calculate ranks and second, it's indexing, how you build your index. And thank you, slides you can find on this link. And thank you for your attention and we hurrying and question please. Well, any questions? I got a question about operators in Django full-text search. You mentioned that there are only and and or operators. Can we combine it? I mean, John and Doe or Foo and Bar. Yes, you can. By the way, it's feature not about Django, it's feature, I will show you. It's feature of maybe this slide. It's feature of Postgres. Okay. All other questions? Yeah, please. Hi, what's a good way to compare the performance of different search engines, not in terms of speed of response, but in terms of the quality of ranking? Yeah, I understand. Thank you for question. What I am doing on everyday basis, I work with our application, not just full-text search through data. We try to match users and et cetera by his interests and it means that the ranking, it's very important for me. And I have lots of tests for that, how I build very big queries with and or with synonyms without synonyms, et cetera. And I prepare expected result manually and I run my test and see results and et cetera. Yeah, it's like only manual unfortunately work. It depends on your real task. All other questions? Apart from Haystack, do you have any recommendations from Django and Elasticsearch? Django and Elasticsearch? Yeah, for combining those two. From my experience, it's like using just Python client. I mean, you can create managed task which will refresh your index if you plan to get data from, I mean, maybe you plan to store your data in Postgres or MySQL and you need like on some action, you plan to refresh index and search from Elasticsearch. I found great solution that you can just use simple Python client Elasticsearch.dsl or Elasticsearch.py which Honsack will maintain mostly. And just but only add managedpy command to refresh index, create like asynchronous tasks for refresh index, et cetera. If you plan to use Haystack, you can, I don't remember the name, but I found interesting library which overrides some settings from Haystack and you can like add your synonyms, change configurations, et cetera. And I recommend to use it if you plan to use Haystack. But problem of Haystack is that it's not support last version of Elasticsearch. And you will stack on, I don't know, 1.7.5 or something like it. Yeah, yeah, please. Is there a reason why you haven't talked a lot about solar? Could you please repeat? Is there a reason why you haven't talked a lot about the solar search engine, S-O-L-R? Ah, solar. I have no experience with solar, but I hope that solar also use Lucene. And only difference is that I know solar is not easy to scale. You can, but it's not so easy. And if you already use solar, maybe you should continue. But for new projects, I think, I found Elastic more like useful for me, something like it. Okay, thank you very much. Thank you.