 So, I will be talking about the operating at scale with elastic search, what is the big deal about it, right? You can just download elastic in your local and then you will start it, what is the big deal about it? The problem is, every problem you face will come only in production, not just for elastic search, for any application you use this mean. The same case here, we have faced many problems which you cannot even imagine, it can exist. A little bit interaction with myself, I am Jeeva, I am from PayPal and I have been working with elastic search for like three and a half years and our team built the search platform which is powered by elastic search across PayPal and we serve like more than 20 billion requests every day with more than 100 million searchers every day over petabytes of data, not terabytes, petabytes of data. And we have large customer base, I mean internal PayPal customers over more than 40 customers we have, that is about it. So I organized my session into three different things, first to give the interaction of basics of elastic search and then the challenges which you face and then I come to the resiliency aspect in our application which handles the billions of data as I mentioned. So how does the data is organized in elastic search, that is the basics. So the index is the main data unit of elastic search. If you want to correlate it, you can correlate it with table in database or collection in Mongo, similar to that. So if you have a single index, then you need to scale it. So you need to split the index into multiple shots and the multiple shots will be residing on different nodes, each will be backed by its own disk. Together, they form the cluster and if you see if the node 1 goes down, then the data is lost, part of data is lost, in order to cater that elastic search has inbuilt replication mechanism where every request you give will be replicated into other node like this. So if node 1 goes down here, the shot 0 will be there in the node 2 as well. So you would have problem with that. So next thing I would talk about is routing. This is when I insert a document into elastic search, what happens, right? I will just demonstrate that, let us say node 1 and node 2 and I try to insert a document A, this is sample document, you will be inserting with a particular ID. So what elastic search does is it will use a routing algorithm of its own, which will use the ID and then it will decide which node it needs, which shot it needs to go. So in this case, it decides that the document A needs to go to shot 1, but shot 1, we have replica and primary, but elastic search always writes to primary first and then it replicates to the replica. This is how it works, sir. It will write, the writing we call in elastic search as index. So we index the document in the last primary and then we replicate the same in replica and then we send the response back to the user. This is how the routing works. Now I will come to the, as I have said the context a little bit, now I come to the challenges which we faced and the learnings which we have. So before going to this, I will tell you a little story. There was a businessman and he had two associates, as was the next call, as was it 1 and 2. And the businessman promoted associate 2. The associate 1 asked the businessman, why did you promote him and not me. He told, he wanted to demonstrate with an example. He asked the associate to go to the fruit shop and then he asked for the price of apple. He went there, asked for the price of apple and came back. The same he did for orange and some other things. It took like some minutes to do the same thing. Then he asked the other, called the other associate and asked for the same. He asked for the price of apple. He went there, he asked for price of apple, orange and other stuffs and where they are from, all the details. What I am trying to say here is, the more the round trips are, the more time it will take. So if you insert a document as a single request instead of multiple single request, I mean with a bulk request, it is always better. Then we started off with, we tried to do with single insurance. Every request will have only one record. When we did that, it was awfully slow. Like we were able to handle like 5000 requests per second, which is not going to scale with our scale. So after we did this bulk insert, we were able to see the performance improvement of like 10x. 10x is a lot. So from 5000, we were able to handle 45000 without any tuning in elastic search. With further tuning, we are able to scale right now with 200k per second, but the cluster is big though. And the second aspect is routing. As I mentioned, elastic search uses the ID to route the request to different shards. It also has a routing facility where you can tell an ID to route a particular shard. So we had a use case where we have to scale like crazy, push like 6 million per minute and the document size is also high like 3, 2, 4 kb. We were not able to scale much because of even with the bulk request. What we did was we created a single route ID for the entire record, entire bulk request for 1000, let's say for if bulk request is having 1000 records, we set the same routing ID for all of them. What it does is it will send, let's say if we don't have the routing, what happens? I will tell you. If you don't have the routing and if you have bulk request with 1000 records, what happens is it will, node 1 will receive that as I explained before. Yeah. It will send all the 1000 records to 1000 different shards and then it will be replicated to 1000 different shards again and then send the response back. There's like lot of round trip between elastic search itself. So routing is always makes it efficient. As there are no round trips here, all the 1000 records will be returned to one shard and then it will be replicated to other shards and then it will send, response will be sent back to the user. So this is a lot faster. We were able to scale till 6.3 million consistently without any issues. Yes. We will give a route ID. We calculate that based on previous experience or elastic users hashing algorithm to calculate particular route it needs to route to. So we do some experimentation before doing that. And the mysterious search timeouts, we recently had this issue like a month back where the problem was elastic search started scaling up like we started getting more and more documents and the search also started increasing and we started facing more and more timeouts. We did not know what is causing the problem. We tried to do some things and which made it more worse like the error rates have come to 50% out of 100 records, 50 records and fail because of the timeouts. That is crazy. And then we dug deeper and deeper and then we figured out that the machines which are running elastic search are having more and more page faults. You cannot even imagine page faults can be the issue for this. With the help of elastic team, we changed the map, the file system type from default to M-map FPS. Once we did that, the 50% of failure has reduced to 0.3%. That is a lot. That is one thing and the second thing is query optimization. When we looked into the query, there were, if we execute the query in a certain order, it will be a lot faster than the other. But in the elastic search block, one of the block, they have written that after 2.x version, they have fixed the issue. You can give the query in any order. It will execute it in a correct order. But it was not happening. We had to change the order to fix the issue. Before fixing the issue, before fixing the order, it was more than one second. It took more than one second to exit a query. After fixing this, it was taking 45 milliseconds, which is a lot of improvement. As you all know, ignorance is bliss, right? But ignorance is power. It is a disaster. You give a monkey control to nuclear missile. What does it do? Does it keep on clicking? It does not know what is the impact. So if you keep on clicking, so that is the same case that has happened to us. Our users does not know how elastic search operates. So they try to give a expensive query. Expensive as in, I will give you an example. Let us say, you try to do group by on a unique ID field. What happens? You try to load everything into memory, and then it will just blow. That is what happened. And it has happened multiple times. And elastic has a lot of expensive queries, not just the group by on unique ID. They have nested aggregations where it can fail as well. So what we did is we created a gatekeeper between elastic search and application Kibana. So Kibana and Grafana are basically visualizations and tools. We built an application gatekeeper, which what it does is it has a rule engine which will check the rules and if any of the query matches the rule, it will just fail it instead of sending it to elastic search successfully based on this. So what we did was like when some cluster gets blown up, we will take that query and then we will put it as a lot of rule. But whatever plan is like to use ML to detect the issue and then fix it automatically. What do you guys think of this diagram resiliency, right? So even in the barren land, it is able to cope up. This is resiliency. What is resiliency? Resiliency is basically to staying responsive, even there is failures. There are different ways to achieve it in any system, not just elastic search. This is basically on the application we built on top of elastic search where we ingest billions of documents every day. So we have multiple ways to achieve it. One is timeouts. Without timeouts, what happens, I will say. If you make a connection and if it is not responding at all and if you try to make a connection again and again, your resources will be exhausted. So that is where the timeouts come into picture and then the retries. So they are supposed to be there for any system, not just for our application with elastic search. And then back pressure, I will talk about it little later. This is related to reactive systems and the last one is circuit breaker. So basically circuit breaker is similar to the fuses in electronics. What happens when a high voltage comes into a fuse? It will just blow. You have to replace it. In the same case here, when more and more failure comes, the circuit will open and after certain time it will close and then it will check if it works or not. So before I explain about back pressure, I will tell about reactive streams first. I will tell about streams first. What is a stream? Stream is basically, let us say a river, river is also a stream. It has an oscillating point and it has an endpoint. It would be a mountain and endpoint would be this sea, right. And there will be multiple villages or towns in Bedouin. It will go through all the way. The same thing, reactive streams are similar to that. The main difference is reactive streams has back pressure. In river, if more and more water comes, you cannot stop it. It will anyways go to sea. That is not the case with the software stream. We can control where the sink, the sea can control how much data I want. So this is a part of our application where we consume from Kafka source and then we transform the data. Once we transform it, we will bulkize it. I explained about why we need to bulkize before. Bulkizing happens based on two things. One is based on time and then the number of requests. The number of requests exceeds n or if the timer triggers, it will send the request to the next layer, which is elasticus client. Elasticus client will send the request to elasticus and then get the response back. The last one is sink. The sink has circuit breaker. I will explain about how the back pressure works in this case. So as I mentioned, the sink controls the demand. So sink send a demand to the next layer which is elasticus client. It has for a demand and then it goes all the way to source. And then the source starting, start pushing the data if it has or it will pull from Kafka. Once that is done, if you see once the push is done, the demand is getting reduced, right? Okay. Once it goes to the bulkizer, if the number of requests does not exceed the amount it is supposed to or the timer did not trigger, it will ask for a demand again. And then it will send the request back again. Once the number of requests exceeds, particular thing or the timer triggered, the request go to the AS client and it will get the response from AS client and then it will push to sink. Now the sink will check if the request has failed or not. As it is a bulkizer, there will be multiple records. It will check if the request has failed or not. If it fails, it will not ask for demand till that particular time. Once the timer triggers, it will ask for demand again. So this is how we built Resilience C with circuit baker and back pressure, both of them plays in our system. And to summarize, while you are looking at it, the two main key takeaways I would say from this, Sassan is the gatekeeper which I mentioned. If you do not put a layer between the users and underlying system, not just for elastic search, your system will crash. They do not know what they do. So they will keep on clicking it. You will make it make the system more worse. We have had downtime like 2 or 3 days, days, not hours because of just this issue before building this gatekeeper. And the second thing would be Resilience C. So if this underlying system fails and if you keep on sending a request to the underlying system, what happens? It will fail again and again and then it will make the situation more worse. We do not want to have this. So any system need to be resilient, not just elastic search or the application on top of elastic search. And most of the things I covered are applicable not just for elastic search, for any database for that matter and QA. We have multiple clusters, not just a single cluster. And we have data ranging from let us say GBs to like 300 TBs, it is not the same single cluster. Data to RAM ratio as in for single node you are asking, as our clusters are very massive, we cannot do that. The data is too huge. So we do not follow that particular process. We, it is not specific how it deterministic, it will be different for different use cases. Yes. Hi. Yeah please. Right. So basically what we do is we will push the data again to another Kafka key and then we consume it. By that way we handle the errors as well. If it is a handleable error, we will consume it later and then we will push it. Else we will just fail because it cannot be handled. I did not mention here because it will make the graph more worse. Yeah. I did not talk about that. So we use Aka Streams, Aka Streams which is a reactive stream and we use a framework called Squbes which is open source by paper for bootstrapping the whole application and productionizing the application. Aka Streams is basically low-layer density streaming. There are many cases, not just single. So if you want, we had a use case like they had to traverse the entire data over let's say 200 TB. They want to search for account number across three years of data and they want to get all of them at once. So we faced a lot of issues because of that and then we used routing for account level. But the problem which we have there is it will skew the short size because some of the accounts account will have more data, more transactions compared to others. Hi. While you go through this, you can just look into what are the use cases that we have. Yes, come to see. Hi. Great talk. So we also run Elastic Cluster. I currently do one million documents per second. We ran into a lot of problems same as you and we did a lot of things exactly like you did. So a couple of things I wanted to ask was did you kind of build a monitoring system to understand what the issue was? Because there are about 45 parameters that you need to look at because when you first start, you will not know where the issue is unless you talk to elastic people. That was my first question. So actually monitoring was part of this, my session was around 40 minutes before. So I had to cut short everything, I removed the monitoring part. So we actually have a monitoring system where we built a plugin which collects the metrics, all kind of metrics, every possible metric and then push it to another Elastic Cluster and then we use that to monitor our system. And the second part of the question was as Elastic uses a mesh network, you cannot scale it to multiple data centers. That's right. Yes. So multi-data center is not available yet, but they are saying. So that what happens is then you go to the same geolocation and then the network latency increases because the same data center has all the data. Then you can't scale the bandwidth of the network. So basically you are talking about the active-active right, active-active or disaster recovery. Not really. Like your main cluster itself. You want the same data to be available in some other geolocation or you want the data to be distributed across geolocation. No. So what happens is if you start to grow the cluster in the same data center, your bandwidth goes on increasing and beyond a limit, like you can't go on adding nodes in the same cluster. So for example, our data center had a 100 gig line, but that didn't support it later. So yeah. So how did you scale that? So basically there are two ways. One is multiple clusters and then the recent cluster is supporting the multi-elastics cluster. You can use that. By that way, you can push some of the data into one cluster and some other data into other cluster, but you can use a single query to query across all the clusters. By that way, you don't need to have, need to add more nodes in the single cluster. Try nodes, you can try, but the problem is if index name is same, you have problem. That we can do, but that's like writing another elastic search. Well, I think we'll need to take the other questions offline. Okay. We need to just set up for the next video.