 Thank you. Who here is using Elasticsearch already? Awesome. Elasticsearch is becoming quite popular these days, whether it's for backing your apps, search, or your web search, or having your applications and servers logs all in one central place. Elasticsearch is gaining lots of mind share. However, as a search engine, it's quite different from more traditional data stores. So this talk is about how a search engine works and how a distributed one, like Elasticsearch, in particular. My name is Alex. I work for Found. We do hosted Elasticsearch as a service. My background from the university is within search. That's mostly what I've been doing ever since. And through Found, I've been in contact with hundreds of developers and have an impression of what kind of challenges they face when they go from the basic usage of Elasticsearch. So this is about the sort of background theory I have great experience from sharing with other developers. So the kinds of questions you'll hopefully be better able to deduce the answer to are things like, why isn't my search returning what I expect, even if I search for exactly the same text as in my document? Or how can it make sense that deleting documents doesn't immediately shrink the index, but adding documents can cause it to be smaller? And why does Elasticsearch use so much memory? So before I get into the good stuff, I just want to set some context around what we're going to talk about. This is sort of like an agenda in reverse. I'm going to first go in and then back out later on. So when you work with Elasticsearch, you have a cluster of nodes. And within the cluster, you have lots of Elasticsearch indexes that can span multiple nodes through shards. And a shard is essentially a Lucene index. Lucene is the full text search library Elasticsearch is built on. Elasticsearch makes Lucene's awesomeness available in a distributed setting. So this talk is also a lot about how Lucene works. And lots of Elasticsearch documentation sort of assumes some familiarity with Lucene as well. So within a Lucene index, you have segments which is sort of like mini indexes. And within the segments, we have certain data structures like an inverted index, stored fields, document values, and so on. And this is where we'll start. So the inverted index is the key data structure to understand when you work with search. It consists of two parts, the sorted dictionary, which contains the indexed terms. And for every term, you have a posting list, which is the document containing the term. So when you do a search, you first operate on the sorted dictionary, and then process the postings. So if you have this quite simple document, you can index it by first lower casing the text, removing some punctuation, and splitting or tokenizing on whitespace. So when you want to search for the fury, for example, you first find the terms in the dictionary and then intersect or union the postings, depending on what kind of search you want. So this is quite a basic example, but the principle is the same for all kinds of searches. First, you operate on the dictionary to find candidate terms, and then operate on the postings. So the terms you generate that end up in your index structure decide how you can search. Therefore, how you analyze and process the text is key when you work with search. You really need to understand the text processing that's happening. So, for example, if you want to do a prefix search, like, in this case, find everything with C, starting with a C. In more realistic case, things like autocompletion, you can easily do so by doing a binary search in the dictionary. But if you want to, for example, find every term containing the substring hour, you have to essentially go through every term in the index. And this is quite expensive and doesn't scale. But it's what happens if you, for example, wrap wire cards around your search. So the right approach in this case would be to generate the proper terms. And there's lots of different things you can do. When what you have is the inverted index, you want to transform the search problem until it looks like a problem where you have to find some prefix. So if you want to search for suffixes, you can index the reverse text and search for the reverse. When there's things like geo locations, Lucene will convert the data into a geo hash, which, as your prefix is longer, means more precision. And something similar is done for numerical data because just indexing the string 1 through 3 doesn't really allow for good numerical range searches. So even things that doesn't appear to be about string prefix lookups get converted to it. So this ranges from the rather simple to the mind-bogglingly complex, which we won't really get into, but it's an interesting story about how some really bright people came up with what's called Levenstein automatons to sort of go through and find misspellings in a really efficient way. And they found a Python library that they used to generate some Java code. They didn't know exactly what was going on, but the tests proved it worked, and the benchmark said it was like 100 times faster. By now it's cleaned up, but it's just an example of the really hardcore things Lucene will do to make things insanely fast. So when you work with search, text processing is really important. The inverted index is not very useful, however, when you want to look up the value given a document, like what's the title for document number 2? So to do that, there's other data structures like stored fields, which is essentially a simple key value store where you have some data blob that you want to retrieve when you want to render the search results. By default, elastic search will store the entire JSON source using this. But even this kind of structure isn't very helpful when you need to read millions of values for a field, such as when you sort or facet or aggregate, because you would be reading lots of data that you don't really need. So there's another structure called document values, which is sort of like a columnar store. It's highly optimized for storing values of the same type. So this is quite useful when you want to aggregate or sort on millions of values. If you don't specify that you want these document values, elastic search will use what's called the field cache, which means that it will load all the values for the field in the entire index into memory. It will be quite fast to use, but it will use tons of memory. So these data structures, the inverted index, stored fields, document values and certain caches are chunked up into what's called segments. So when Lucene searches across an index, it searches all the segments and merges the results. There's a few properties with segments that's quite important. First, they are immutable, so they never change. So this means, for example, when you delete a document, there's a bitmap that marks the document as deleted, and Lucene will filter it out for every subsequent search. But the segment itself doesn't change. So an update, for example, is essentially a delete followed by a re-index. So keep that in mind, for example, if you store things like rapidly updated counters in your index. On the upside, however, Lucene can use all the tricks in the book to compress things. Lucene is really great at compressing data. And as it turns out, segments are a great scope for caches, and we'll get back to why. So these segments get created in one of two ways. First, as you index new documents, elastic search will buffer these documents and then every refresh interval, which defaults to every second, it will write a new segment and the documents will become available for search. This, of course, means that over time you'll get lots of segments. So every now and then elastic search will merge them together and during this process deleted documents are finally completely removed. So that's why adding documents can cause the index to be smaller. It can trigger a merge, which causes more compaction. So say you have these two segments that get merged, they'll then be completely replaced by the new segment. And we'll get back to it a bit later. But this new segment will, of course, have cold caches, but the majority of the data is in the older untouched segments at this point, which has warm caches. And this is key for elastic search real-time capabilities. As new data comes in, the amount of cache invalidation it has to do is quite limited. So all this happens within a single Lucene index, which is a shard in the elastic search index, which is allocated across nodes in your cluster. So when you search these shards, it's pretty much the same as searching segments. You search them all and then merge things together. But at this point, the searching can happen across different nodes and as you merge data here, you need to transfer things across the network. One key thing to notice is that an elastic search index with two shards, searching one elastic search index with two shards, is essentially the same as searching two elastic search indexes with one shard each. In both cases, you are searching across two shards, that is, two Lucene indexes. So sharding and partitioning into different indexes are two different yet similar approaches to slicing up your data, to prepare for handling massive amounts of data. You can easily feel a talk about different approaches to this, but one approach is so common it's worth mentioning. When you have log-like data with the timestamp, it's often a good idea to partition it into one index per day, for example. This will massively reduce the search space when you only need to search today's data, for example, or last week's. And when you need to delete older data, you can simply delete the entire index. You don't have to delete, have the documents marked as deleted and then eventually removed later on. And also the indexing performance on today's data isn't affected by the fact that you have all data in other indexes. So we have multiple elastic search indexes with two shards each in this case. So shards are used to evenly distribute data across one index, in this case because you don't, you have too much data for one single node to cope with. So when you plan how you're going to scale, it's important to remember that you cannot split a shard. You can easily add more nodes and move data, move shards around, but you cannot turn one shard into two. While this might be possible in the future, the reason is that if by the time you realize you need more shards, you probably have a high enough load that adding the extra load of redistributing everything would be problematic. So it's important to plan ahead. So lots of people try to avoid the problem by, okay, I'm just going to make a thousand shards and forget about the problem. But then you have lots of duplicated internal data structures like the dictionary. And there's also overhead to searching multiple shards. So you want to have a balance between having enough and having too few. So these shards get allocated to nodes in your cluster. You can associate any attribute with the nodes, like this node is running in data center A in a certain rack or is quite a powerful machine. So you can do things like make sure there's a replica in every zone or make sure this popular index is hosted on the more powerful machines. The cluster also has what's called a cluster state, which is replicated to all the nodes. It has things like mappings, which is sort of like the schema that tells how a certain field has its text processed, for example. It has the entire shard routing table. So any node in the cluster knows how to route any search request. So at this point we're essentially back on top abstraction-wise. So we'll try to piece things together by looking at how a real search request is processed. So say you have this search with a query it is of type filtered, it has a simple term filter, and a match query across multiple fields. We also have an aggregation on authors. We want the top 10 authors as well as the top 10 hits. And I also specify shard size, which is something I'll get back to. So this search request can be sent to any node in your cluster. That node becomes the coordinator for that search request. It will decide which shards to route the request to based on what indexes you have specified to search across and which replicas are available and so on. So it sends the request to the relevant shards. But before the search can actually be executed on the shard, there's a certain amount of rewriting that needs to happen. Elastic search query DSL is sometimes criticized for being quite verbose and deeply nested. I actually think it's quite awesome for precisely the same reasons. When it's deeply or it's nested structure makes it a lot easier to work with in code. You don't have to compile this huge search string. And there's also quite a close match between how elastic search defines it filters and queries and how the Lucene operators it ends up being converted to works. So your knowledge of elastic search or Lucene will sort of go both ways. One exception to this rule, however, is the match family of queries. And the match query is something you're going to become quite familiar with because it's the kind of query that we look up in the mapping and see how the text is processed. And as we remember, how text gets processed is really important when you deal with search. And quite a common source for pulling out here when you work with elastic search is having incompatible text processing when you index and when you search. So when you do not get the results you expect, the text processing should be your first suspect. But the match query does not exist in Lucene, so it's elastic search abstraction to make different things quite a lot nicer than having to do it yourself. What it would actually look like when converted to Lucene is something like this. The match is actually converted to a bull query that puts together the different fields. And the text, holy grail in this case, has been processed, it has been lower cased, and so on. If you were to configure your match query differently, say by specifying fuzziness, this would be rewritten to something with fussy query in the bottom. So at this point you have a Lucene query that can be run. It will be run on all the segments. And at this point it matters what has happened before. Often you need to use the same filter or the same fields you aggregate or sort on across multiple requests. And elastic search will cache these as we remember per segment. So assuming these two red segments here are newly created because of new documents or a merge, it will have cold caches and the filters and fields will need to be reprocessed. But the majority of the data is in the segments with warm caches. This is sort of the source for elastic search mind-boggling performance. When the filter and the fields are already in the cache, using them is really fast. So filters are pretty much the same per search. They can be cached as a really compact bitmap. Whereas queries are scored, it's not just whether the document matches the document matches to a certain degree. So queries are not cached. If you need to do the same query over and over again, you should probably cache it in your application layer. So knowing this you should prefer to use filters when you can and use queries only when you need scoring. So this is run on all the segments within the Lucene index, which is a shard in the elastic search index. And the results get sent back to interesting data. And the results get sent back to the search coordinator. And the amount of data transferred here can matter a lot. By default, elastic search will just ask for the IDs of the documents for the top hit. Because it doesn't really need all the documents, sources, it just needs it for the top 10 results. But this is quite different when you do aggregations. It's quite possible that an author that should be in the top 10, the global top 10, is in the 11th position of one of the shards. That's why we specified a shard size of 100 to make it less likely that that happens. Of course, it's still possible. So we always need to weigh and balance the amount of data you transfer to the precision you need. And this is inherent in any distributed aggregation. So the coordinator has all the data. It merges it together, asks the shards again for, hey, can you please give me the source for these documents? And send it back to you as the user. So at this point, we have been through, we have looked at the inverted index and seen how the index terms you generate largely dictate how you can search. And that the text processing that generates these terms are quite important. We have looked at how a search happens by segment and how a segment has several data structures, some used when you search, some used when you aggregate and so on. We have discussed the consequences of the segments being immutable and that this can affect indexing performance, for example, when you need real-time or when you need great indexing throughput, you may want to, for example, adjust the refresh interval so you don't constantly merge new segments. We have seen how a shard is essentially the same as a separate Lucene index. And that the elastic search index is generally just an abstraction on top of a Lucene index. And you can combine them either in as shards in one elastic search index or across multiple indexes. And at this point, of course, across nodes in your cluster, it's a distributed search engine. You can easily add nodes. But you need to also be aware of the kind of data being transferred between the nodes as you search. So this was intended to be an introduction to different things I hope you want to learn more about. The talk is based on an article of the same name. And you can find it in Foundation, which is our article collection about elastic search. We try to keep them as helpful as possible for anyone using elastic search. It's not just for found customers. There's also an elastic search meetup later today. It's here around six, I think. So if you want to learn more about elastic search, I hope to see you there. And if you have some questions, now it's your time. Thank you. No questions. Hello. I have a question about replication. So, for example, if I have some important documents that I would like to search even if one of the nodes or several nodes go down, what's the recommended way to do it in elastic search? You want to... You have documents index in replicas and a node goes down. Well, so I'm adding a new document to the index. What's the recommended way to add it in such a way that one node failure doesn't take down the document? Yeah. Okay. So this talk wasn't that much about elastic search in production. I used to do another talk about it. There's lots of different things to keep in mind when you run a cluster of any distributed system. You want to have a majority of nodes available, for example, to avoid things like split brains. You want to have replicas available in different... For example, if you're on Amazon in different availability zones, to make sure that you always have a replica available when errors happen. And in a distributed system, failure is guaranteed to happen. So in any production configuration, you should have multiple nodes in running on infrastructure that's not have any common failure points. You should make sure you have, in bigger production clusters, you should have dedicated master nodes, for example. And you need to have at least three to have a majority in the event of failure. A quite common setup is to have two nodes in your cluster, one with one replica each. But when there's a network partition between these, you cannot have a majority when you have just a single node and your cluster is composed of two. So there's lots of different aspects. And I'm happy to talk more about Elasticsearch in production after. So just come find me. Alright, thanks. Thanks a lot for the fascinating talk. What would you say about the code base of Elasticsearch? Is it worth reading through? How is the quality? Would one actually learn something by looking through it? Yeah, it's quite a complicated system. I think the code quality is generally quite high compared to other search systems I've read. Lucine has really high quality code. It's a bit higher than Elasticsearch, I'd say, but Elasticsearch is still quite good. You can see the fact that you now have tons of new developers, which is good, but it's a code base I'd recommend looking at. It's pure Java, right? Yeah. Thanks a lot. One over there. Thanks. So if I recall it right, Lucine has this precise formula to rank documents and how is this working between charts? How is the ranking working between charts, if you like, having documents and term frequencies and inverted document frequencies just like you showed it? Yeah, I'll try to find the relevant slider. Still have a few minutes. So when Lucine is scoring documents, it takes into account things like the frequency of the term. For example, the words like the and in don't add much value, but more rare words are considered to be more relevant. And so it uses sort of like it tries to find rare words in your query and prioritize them while sort of not caring too much about the really common words. But of course, these frequencies can be different across the different charts. So that's it's possible to tell elastic search to as an before the search itself happens, have all the charts report the true frequencies so you can get more accurate scoring. But when it comes to actually ranking and scoring, I would look paid just as much attention to things like function score where you can boost based on, for example, filters. Or prefer new documents or prefer documents within a certain section of your content and so on. So do not just judge relevancy out of the default relevancy that Lucine gives you, but also look at all the tools elastic search has to tweak your scoring. Do we have any more questions? Is it measurable to compare it if you just have a single Lucine index and you put all the documents in the single Lucine index and then you have the same index across charts. Do you get the same results or is it different because you have statistics between it? So if your data can fit in a single chart and you don't need to scale it, for example, you should prefer to have a single chart. Storing it in two charts will be more than twice as expensive. So usually you want to prefer having fewer charts. When you search multiple charts, these frequencies can differ between the charts so you can get different results. So indexing everything into just a single chart can yield different results from having it in two. Usually it shouldn't be huge differences and again you probably want to also look a lot into function scoring, for example. Thank you. Any more questions? People are hungry. Thank you very much, Alex. Please give a round of applause.