 Hello Pythonistas, hello Chennai and hello world. We are here to present on performance optimization with Elasticsearch. I am Anishwini Das and she is Anisha Swain. And before we go forward with the talk, we would like to introduce ourselves. We both are Red Haters and we are active open source contributors. And speaking about Anisha, she is a research fellow with Indian Academy of Sciences and she is also an artist. I will hand over the control to Anisha right now. Sorry for that. So first we will talk about basic Elasticsearch and its architecture. And then Manashwini will show you some of the demos, how exactly we index data in Elasticsearch and so on. And at the end I will tell about the techniques that we can use to increase the performance when we are using Elasticsearch. So what is Elasticsearch? Elasticsearch is a distributed search and analytics engine. And please mark the word powerful. It's a very powerful system. Elasticsearch is the heart of Elasticstack. In Elasticstack we use lockstash and bits to collect and aggregate data. By aggregate I mean do more analysis on the data. And we use Kibana for visualization. In Elasticsearch we can store both structured and unstructured data and just not text data, we can store numerical and geospatial data as well. And it's comparatively easier than other analytics and search engines. So this is the whole structure of Elasticstack. We use bits to collect the data and then lockstash, aggregate and analyze on the collected data. Elasticsearch index it and store it. And after that we visualize the data in Kibana. Elasticsearch also provides a premium version of the software which is called XPAC which also gives security, alerting when there is a problem in the system, monitoring and also some machine learning analysis. So why Elasticsearch? Elasticsearch is a persistent data storage which is a distributed document store. It stores data in a serialized JSON format which makes it very reliable and very fast. The search time is approximately one second. So it has very high indexing rate which is approximately 2-3 times of normal indexing. It provides sequential input-output pattern and it also indexes schema-less data. With dynamic mapping it can automatically detect and add new fields and data types when we are storing the data. It's very easy to maintain data in Elasticsearch because of its scalability. Elasticsearch is a physical collection of documents. We can distribute the documents over various nodes and the nodes can again be distributed in our cluster. So if our data is increasing then we can simply add more nodes to the cluster and scale it. For faster search, Elasticsearch uses data structures like inverted index, KDB which gives approximately real-time full-text queries. And as I said, the ability of clustering and ease of data maintenance. We can also write three types of queries for Elasticsearch. Structured queries like we write in SQL or full-text queries like query DSL. And we can also combine both to make complex queries. Now we'll hand over to Manishani to talk about the basic architecture. Thank you, Anisha. Yeah, as you can see right here, there are in a node, there are two shards. One is in a dark blue color and then it's a light blue color. Both of them are shards. So each shard contains a collection of documents. And these documents are distributed over shards. And there are a collection of shards which are distributed over nodes. And all these nodes are distributed over clusters as is already mentioned by Anisha. Elasticsearch is a logical grouping of physical shards. And each document is a part of a primary shard. And obviously we can have a replica of a primary shard. So what are primary and replica shards? Here we can see there are three nodes, but we'll concentrate on the first two nodes. There is one primary shard, one replica shard. But you can note this that the replica shard of the first node is there in the second node. That is in case if the first node clashes, then there is second node which can give a hot reload of the data. Moving on to performance consideration about shard sizes. As discussed before, if the number of shards increases, the cluster will grow. And so there is always a maintenance problem. The larger the shards, it is difficult to move it within the elasticsearch. So obviously it depends. Average shard size is from few GB to a few tens of GB. Nodes need to be in the same network for each search engine inside a cluster. Cross cluster replication. Cross cluster replication is a type of replication in which a copy of the remote cluster is stored in a local cluster. So this is used to provide backup. For anyone who doesn't know here, elasticsearch also has a Python client by the name of elasticsearchpy. If you want to know more, you can just go to this link. It's demo time. I already have Kibana in the last set up in my device. I'll just check. So yeah, looking to the first query, I'm trying to index some documents. So this contains a put request. We can also have a post request, but the difference is put request can have an ID that we have right here. Post request doesn't need to have an ID. There can be a hash value that is automatically given to it. If you run this, we get this answer. So here we have a total of two shards, one successful shard. And since this is already indexed, we can just run a get command. So yeah, we have it right here. As we have already indexed it, we can also try to delete it. So here we can see the result is deleted. I have tried to do this indexing of bulk of documents. We already have accounts.json that is already there in the website of Elasticsearch. So if you want to look at the database, you can go there and look. So yeah, let us just have a look at the response that we have received. The took key value shows the amount of milliseconds that Elasticsearch has taken to produce a response. Timed out is just to check whether the search has timed out or not. Shards means the number of shards that it is looking into. It is a total of one shard and we have one successful shard. Hits means the number of results that is received. And here we can see thousand. And relation is equal and moving on to score. Score is like how well the data matches with the search query. So here we have a maximum score of 1.0. So these are the hits. Elasticsearch provides a default of 10 documents. So we can see 10 documents right here. We can also have a match query. A match query in which suppose we want to match a particular, you know, right here we have an example. That is we want to match whether the account number should be 20. So here we write a match query. And if we run it, we have one successful shard and we have one hit. Moving on, we also have a bool query which gives a response when only the output is true. So here we have a match query again where the address should be either mill or lane. If we just run it, we get this value of 19 hits. Then we move on to executing filters. So Elasticsearch has two types of contexts, like two types of query DSLs that would say query context and filter context. So this is the filter context. Filter context, as you might be knowing if you guys are already using Flipkart that everyone uses like Flipkart, Amazon, all of that. So we have a filter feature where we can, you know, have T-shirts with full sleeves or not full sleeves. So it just should match. Yeah. Summarizing, we index some documents. We did some put and post requests. We deleted some, like, deleted some documents. We conduct, like, ran a match query, bool query, executed filters. So here coming across Elasticsearch, we always get to cross this world, come across this world called aggregations, which just provides an ability to extract statistics from the data. That is, we can conduct a lot of SQL group by queries or queries similar to that. This is very powerful and efficient. That is, we can run a query within a query and we can run queries and multiple aggregations and multiple of these queries are together, like in a nested manner as well. And the results back are both either in one shot, avoiding network round trips. These are the types of aggregation. Since there is a lot of time constraints, so I'm just, you know, skimming across this. Metrics aggregation, pipeline, matrix and bucket. So matrix is, like, just providing the statistics of the aggregation. Pipeline is, like, aggregation of aggregations. Matrix is, like, providing a matrix out of the number of documents that is doubtful, whether it will be there in the next versions or not. And we have bucket aggregation, bucket is a set of documents, and it will just run an aggregation on a bucket or on a particular set of documents. So we have analysis, which is, like, converting a text to terms or tokens according to a given filter. So these are, there are analyzers, normalizers, tokenizers, token filters and character filters. Moving on to what these are. Analyzer, whether it can be built in or custom. It is just a packet that is, like, three, which has three building blocks, that is, character filters, tokenizers and token filters. We'll slowly come across this. Analyzer is, like, it only emits a single token and it requires the use of token filters. Tokenizer is, like, just taking a text and, you know, dividing it into unique tokens. So there are many types of tokens that we come across. This is just a very small subset of the tokenizers that we have. There are token filters, like, these accept a set of tokens from the tokenizer and can modify them. You know, just do it in a lower case or just, you know, remove stop words, remove some words that are being repeated. And we have character filter, which have, which are used to pre-process the stream of characters before it is passed to the tokenizer. We have HTML strip card character, which is an example, which is just you to know, remove the opening and closing tags in HTML. And also, we have, suppose we have, you know, patterns such as ampersand apos, it just converts it to apostrophe. Moving on to scan a scroll API. So the scan search type and the scroll API are used to, you know, retrieve large number of documents from Elasticsearch efficiently without paying the price of deep pagination. Scroll is like, you know, sending multiple requests and just giving us the request till, you know, we reach the end of the page. Then we have scan. This just removes the costly part of deep pagination. So this just makes the tasks really easier. So I hope this was not too fast for you still. We have 12 minutes left. I'm handing over the control to Anisha. Yeah. So we will discuss about some of the performance enhancement techniques that we can use to enhance the performance of indexing, searching, and make it less expensive for us. So generally, these are the don'ts that we shouldn't do while using Elasticsearch. We shouldn't return large number of data set at a time. Instead of that, we should use scan and scroll API so that we can get some data and then again using a particular ID, we can fetch another set of data. Second is indexing large documents. So indexing a very large amount of document can effectively affect the performance in a high amount. So instead of that, the max content length of a document is by default 100 MB, but we can scale it up to 2 GB by some setting, but it is expensive. We can have more memory usage and more stress on network. So we should know what exactly to index and we should index the whole data in small, small parts. So as Manashini has already showed you the scoring, as the score shows how efficiently we have searched the whole documents, like what's the score of the document that we searched or filtered. So scoring might not be consistent because when you search for a data, it might go to replica charts or primary chart depending on your search. So Elasticsearch has a setting of preference so that if a particular user is logging in with its ID, the search will go to that particular chart so that we can get a consistent scoring. And each chart is responsible for its own scoring. So the scoring of a document in a chart depends on the index statistics. So either we have to index all the charts with the same index statistics or we have to go through all the charts and get a relative index statistics with DFS query than fetch. So indexing performance enhancement. To index the performance of indexing, we should try to put bulk requests. But again, we have to figure out the optimal amount of bulk requests that we can get. So it's a trial and run process. We have to run a benchmark and get our optimal size. We should use multithread system but again keep a watch on too many requests because it can throw an exception. So if it is showing an exception, we can pause and again resume. Should increase refresh intervals. Refresh intervals is elastic search. Periodically refresh the indices which receives one or more than searches in previous 30 seconds. So either it's a good way to do it when your site has less traffic or increase the refresh interval. And yeah, we shouldn't create replica charts at the first. We should disable the replica for initial load. I know it's a little risky but it's very good for enhancing the performance of indexing. And we can use the technical index generation to lower the risk and make a replica chart according to the updated document. So there are some more points like shouldn't do swapping. Should allocate more memory to the file system cache for buffering input-output operations. And it's a good practice to use auto-generated IDs by elastic search so that it doesn't have to go and check for duplicacy. And yeah, of course, faster hardware and virtualized storage works with better input-output per second works pretty better with elastic search. And we should also keep check on cluster health of elastic search. Because if a node goes down then it takes some time to initialize it again. So at that time it would be yellow but it should be green when it's having primary and replica charts. Disable merge throttling. So if we're having problem in merging then we should disable merge throttling for some time. But again it's a trade-off because it might degrade the search performance. Trade-pulling rejection error comes when we send too many requests to your nodes at a faster rate. So either we can scale the nodes or slow down our process of sending the requests. Disk usage. So we should disable to preserve disk usage. We should disable unnecessary feature like if we want to create a histogram out of numeric value then we don't need filtering. So we should disable indexing. In string by default they have normalizers but if we don't want normalization then we can disable norms. So we should also avoid dynamic string mapping because it maps every string as text and keyword. So we can specifically give our template for string mapping. We should disable source if we don't need original JSON or we can use best compression for taking negligible spaces. Force merge API reduce the number of segments per chart which is physical memory which a chart have. String API is used to reduce charts per index. We should use to decrease the disk usage. We should use smallest numeric type. So if data nodes are running low on disk then we have to add more data nodes to our cluster and scale it horizontally. So if we have only certain nodes running out of disk then usually it's a sign that we have initialized the index with very less amount of charts and we need to add on more charts. So if there are more disk usage then we can scale the nodes horizontally and we can use rollover and alias for that. So I think that's it for the performance constraint. Thank you. I need questions. Am I audible? Yeah. So how are the performance metrics for traditional relational data? Will it work well for relational data also or should we prefer an SQL over it? It will work for relational data as well but the only difference is it will be in a JSON format. Does that answer your question? Yeah. I mean what are the performance metrics compared to traditional SQL databases? Have you used GraphQL ever? No. So GraphQL in a particular way there are certain constraints that we have in relational databases that is... Let me just take this off. Yeah. That is sometimes we have to fill in null data. That is something that you have to give a particular data to a particular field. Like it is mandatory but it is not mandatory when it comes to Elasticsearch. That is a particular document can have a particular field or cannot have a particular field so... Steamless. Okay. Yeah. Hi. The question is very relevant to Elasticsearch 7.x. So we are actually using the nested indexing with Elasticsearch. So it would be nice if you can just give us some best practice to search or do filter queries in nested things. So what we are trying to achieve is actually to only get the component in the nested not as a parent in the result while searching. We have a similar kind of use case for our... We had to index some directory structure so it was pretty nested as well. So what we are doing is as I mentioned earlier, if there is a range query, I can just move that to a keyword or a term and then I can search that term instead of nesting. I can elaborate you the whole structure that we are using after the talk because it might need pen and paper. Okay. I will see you offline. Sure. This way. We are using Elasticsearch 7 and I am getting the same performance issue what you have mentioned. And even I tried most of the things. I avoided lots of fields. I have removed lots of fields. Then still I am meeting the issue. So can you just tell us the, you know, pin pointedly the particular issue? So after 30 days I am getting the issue. So I need to delete the data. Again I need to do a clean new setup. Then the problem is being resolved. Everything is the same hardware. Everything is the same. Only the data is matters. So whether we need to delete the data every particular day or something like that? So that is the thing. If you still need the old data, you can use this snapshot and restore. That is the functionality like that. So it will just take snapshot of old data. And it will like, I am not pretty much sure how exactly. But it stores it somewhere like, you know, compressed way and it uses it later. Or it is a good way to delete the data if you don't use it. Reindexing concept is, that is what you are talking about? Sorry. Reindex. Not reindex. It is like. Snapshot you are talking about? Yeah. Okay.