 Hi everyone, I am Dennis and today I am going to talk about the long analytics of the LPCSF. So before I get started I just wanted to know how do you optimize your LPCSF, what know what the LPCSF is. So about me, currently I am working as a DevOps Engineer at Bontrog Labs. I have over six years of experience as a DevOps Engineer at SRE and Madden Center. So I work on a variety of technologies in both service-based and product-based organizations. So we have mobile gaming company and we are making mass social media games. We have more than 5 million daily active users, 15 million daily active users. We make games which are real-time, cross-platform games. We are profitable. A current state is half a million feed concurrent plain users. So I want to talk about the problem statement. So the thing is we keep launching new games and as we launch new games and as the existing games become more and more popular, we found ourselves in a game of a good long analytics platform. So about one and a half years back we evaluated some of the common long analytics platforms and we decided to build ourselves managed. So in this presentation I am going to talk about our business requirements, our LPCSF architecture, optimizations we did. Finally we will also talk about the cost savings and key technical points. So let's start with the business requirements. So we are looking for a long analytics platform for web server application and database logs. Data access rate would be 300 tp per day. Frequently accessed data would be last 8 days. That would be the data which was frequently queried. And frequently accessed data would be 82 days. And we wanted to have an uptime of 99.90 or retention period would be 90 days and cold retention period would be 90 days. Now I am going to explain to you what exactly is hot retention period and cold retention period is in the upcoming slides. Finally we wanted something which is very simple and cost effective, not something fancy. If you look on the right side you can see the recurrence which are fairly predictable. And we are not going to show user data of business data in our site. So these are our requirements. So using the right option was very important for us. So here we looked at EMK check which is self-managed. We looked at slump and slump cloud and semi-logic professionals. So as you can see the pricing policy. So the pricing is taken from the product website. And this is not an Apple's Apple comparison. There should be an idea how to put it there. Because EMK check is self-managed. But slump cloud and semi-logic professionals are managed tasks. In fact they have a booth here where you can go and take a demo and see the product, the semi-logic product. So slump and semi-logic are managed tasks and they do offer a lot of features. But for us at the end of the day what we wanted is a simple solution and a cost effective solution. So it made more sense to go with the EMK check which is self-managed. And one of the advantages of having a self-managed task is that you can do a lot of optimizations. And in this presentation I am going to talk about the biggest optimizations that we have done. So just to give an idea of the EMK check overview, so what exactly happens? On the left side you can see file width. So file width is basically an agent that you install on your servers. File width will collect the logs. And then it will send to log search. This is the pipeline. File width will collect the logs, send to log search. In log search we have filters. You can define filters and we can process those logs. After log search the process is going to elastic search. So elastic search is the database where your data will be showed. And we are using Kibana for visualizations. So for querying the data from elastic search we are using Kibana. And in Kibana you can create dashboards and graphs. For pattern detection and alerting we are using elapsed alert and custom apps. And for monitoring and alerting we are using fixed set. So fixed set is telegraph, influx media and graphanax. So we already have a fixed set and we are using that set. So in this presentation I am going to use some elastic search terms. So just want to get to that standard with these terms. So as a point we have an index. So index is kind of a database in elastic search. And then index is then divided into shards. So you can sense how many shards you want. Shards can be of two types, primary shards and replica shards. So what's the difference here? Primary shards is where your data injection happens. And the replica gets a copy of the shards. So why do we need replicas? The reason is if the data gets corrupted or you lose some of the elastic search, if you lose some of the elastic search nodes, you can use the replica to recover the data. Finally the shards are divided into segments. So they have multiple segments of varying sizes. And finally you have nodes. So what these nodes are is basically a virtual machine or your EC2 instance where you run elastic search that is your node. So this is the architecture that we came out with. So this is a force-optimized architecture. And an architecture for infinite chain. So let me just go through this architecture. On the left side you can see file width. This is a gate that is installed on the various servers. Then we have log search. Now for log search we are using output scaling. So we are scaling up and scaling down the instances in local log search. Then we move to elastic search cluster. For elastic search cluster we have two nodes. The hot node and warm node. So what's the difference between the hot node and warm node? The hot nodes use general-purpose size as we and the warm nodes use cold edge as we. So I am going to explain this architecture in detail. This hot, warm, cold architecture is the upcoming slide. Finally we have tibana that we will use to query the data. And a point of consistency of backup strategy. So we are having a two-shot backup strategy. We are taking daily index snapshots from elastic search and putting them in S3. At the same time we are taking hourly backups in the form of compressed logs and storage and S3. So moving on to the whole form cold data architecture. So we implemented this three-year storage. So we have three types of storage here. So before I go into that I will explain what is hot data retention and cold data retention as we talked about this in the earlier slides. So hot data retention is a data that is in elastic search and you can query it. You can run query normally you can get the data. Cold data retention is a data that is not in elastic search and you cannot query it directly. It is stored in the form of snapshots. It is kind of a backup strategy for a storage. So we have the two steps. Hot data retention and cold data retention. In hot data retention we have hot nodes and warm nodes. As I talked about earlier, the primary differences in hot nodes are in the general purpose SSE. So they are higher than the price we also had. So when the data comes in, we are storing it on hot nodes. We keep it there for eight days because it is frequently accessed. Data is eight days according to our requirements. Then we move it to the warm nodes. On the warm nodes, we have cold HVD. So here they have load items. The price is also less and we store the data here for 82 days. Then also a total compound called retention period is 90 days. After 90 days, in three months, we move it to cold data retention. How we move it here? We take very intense snapshots and we put it as three and here the data resides in 90 days. So total we are having a retention of three months plus three months, six months. So why are we implementing this complex architecture? So the whole idea is to spread your data across different storage types. So there is three different storage types to spread the data. By doing this, we are able to reduce the cost significantly. Other than using one storage, we are using three types of storage. And also this architecture is scalable in the sense that it is S3. So S3 can store not just terabytes but terabytes of information. And so you can scale your data without incrementally increasing the cost. So that is an important thing here. So size and scale of our sands. So as of now we have 11 nodes. We have 35 cores, 127 GB of RAM and storage of 25 GB. So daily we are ingesting about 300 GB of logs. Our retention policy is 90 days. And logs per second ingest at 7th day. So let me just explain to you what are logs. So in logs, we have log lines, right? So you can cut a list of these log lines as logs. Because when they get shown in glasses cells, we have known as logs. So we are actually ingesting 7th day log length per second. So that is the capacity we can start. So now let's get to the optimizations that we have done. So the whole presentation is the whole focus of the optimizations that we did. So primarily we will be optimizations on two fronts. The first point was application side. And I will talk about optimizations for log sets in last week's session. The second was infrastructure side. Since we will be 8 million students infrastructure for EC2, ETNs and data transfer. So let's talk about applications and optimizations that we did for log sets. So in log sets, we have something to do with the contribution of the optimizations that we did. We have something called pipeline workers. So what are these pipeline workers? Pipeline workers is the number of threads that log sets spawns. Okay, let me move on to the present. So by default, the number of threads is equal to the number of cores. So let's say you have a virtual machine with two cores. The default value of pipeline not worker is one of the two. So what we found is increasing this worker to four times the number of cores helps to improve the CPU utilization as most of the threads that are spawned is equal to time in higher weight. So let me give an example of what this is. So here's a graph with workers as two. We can see that every CPU utilization is about 25%. So when we increase this value to 8 which is 4x, the CPU turns and goes up to 30% which is... Here we can see the 5% improvement but under heavy load, we have seen a much bigger improvement. So this value can be created according to your requirement and you can... whatever work that's going to start. Next, we're going to talk about lock set, straighter level optimization that we did. So as you know, the logs are of different types, right? There are error logs, there are warning logs, there are info logs. So what we found there is info logs are not very much... because it turns because they use info not as much as error and info logs. So on the right side, we can see we added a statement. Each of the logs type, if the log level is info, that means the type of log is info, then put it on separate index. And once you set it up, by once you put it on separate index you set the separate index policy of length as it is. So let's say that you have a retention policy of prejudice. So you can separate out these logs and have a retention of maybe one month. By doing this, you are reducing the storage footprint significantly because most of the logs are going to be info, right? Same thing we did for 200 response load logs. So if you look at Apache Engine as its web server, right? So they have this response load 200 or non-200. Most of the times, it's going to be 200. In fact, 95% of the times, it's going to be 200. So what we did is we were not very much interested in the 200 logs. So again, we wrote a statement. Each, the status is 200 then put it on separate index with the LSA retention policy. You can decide whatever retention policy you want. So this had a direct impact on force but because it reduced the storage footprint. So the next optimizer that I'm going to talk about is a filter optimizer. So I need to explain this thing for you. So if you look at the image, you have something called as a message, right? So when you, when the logs are processed by log shares and put it to the log shares, there's going to be an original message. So this original message is going to contain the original log. The entire log length is going to be stored in this original message field. Now let me give you an example and explain. Let's take an example of, let's take an example of engine x log. So this is a typical example. Here you have IP address and corresponding time set and so on and so forth, right? Below, you can see a log pattern. So in log stacks, you have to keep a log pattern so that it can decode the logs. So if log stack was not able to use a log pattern, for some reason, it will create a tag with the name grog-trailier. That means it was not able to use a log pattern. So what we have done is, on the top right, you can see if the tag does not contain grog-trailier, that removes the original message. If it contains, we are not removing it. Because if it was able to use a grog pattern, it's already divided the original log into corresponding key value tags. There's no point in giving the original log message. By doing this, you are able to reduce the footprint by 30% per talk. So 30% per talk, that means the overall footprint was also reduced. So now we move on to the application level optimization that we did for elastic search. So as you know that elastic search is a Java-based application, right? And so it is a JVM, Java Virtual Machine. And in Java Virtual Machine, you have something called as heat, right? So let me give you an example and explain. So this is the example. So you can set the size of heat. So if the heat size is set to less, to less, which one? It will result in a graph like this. This is a graph with selection, GC. So because the heat size is too less, not of graph is equal. You set it connected and the GC graph which is connected is hitting it very frequently, resulting in a degraded performance that even out of memory. So it's not good to have a heat size too small. The second option is to set the heat size too large. This will result in a lot of graph with selection and it can result in something called as GC pulse. So which is again not good, you can end up in a degraded performance. And this is an example of optimizes. I'm not saying it's a less example position but it should be an identity taken from our set. So what we want is, in our set, even 33% for JVM and 66% for non-heat, work out for us. Which means that if you have 15 degree M, you get 5 to JVM and 10 degree to non-heat. But let me tell you that elastic requirements are more than 50% for JVM. So you need to set the GC graphs and then decide what is the best combination for you. So moving on to the range of optimizations with elastic search, template configuration. So on the right side you can see a template, right? So in the template that you create a template, you can get the number of charts. So for that, you can get some of the charts you want. So I recommend to have these chart values and multiple of the number of nodes that you have. So suppose you have 5 elastic search nodes, recommended to be distributed 5 times 15. So then the charts are evenly distributed across all the nodes. So the chart imbalance which you want to learn. So this is a small optimization. Which what we did is we did some, we took some random drafts here. For example, we got rid of all the replicas that were there in our cluster. So we removed all of the replicas. So by doing this, one of the advantages was we saved 50% on storage footprint. We were able to save 50% on storage footprint, not just the storage footprint. There was a 30% reduction in compute resource utilization as well. Compute in the same CPU, RAM, utilization also went down to 30% where even the replicas. Of course, there are some playoffs. First of all, the replicas are used while searching. So when we search, we use the primary chart as well as the replicas. So you want to see some sort of performance if you remove the replicas. I mean, not divided. There's going to be a slight hit on the searches if you remove the replicas. So there's nothing to keep in mind. The second thing is, it's not recommended to run production clusters without replicas. But we have real mechanism in place which I'm going to talk about in the upcoming slides. So let's talk about the infrastructure side of optimization that we did. Primarily, we'll be focusing on these two ETS and data transfer. So let me say that ETS is the number one contributor to Quartz. So most of our Quartz is of vendace and storage. So any optimization from ETS side is going to significantly help us with Quartz. The next contributor to Quartz was EC2. And then finally, we'll talk about data transfer as well. So let's talk about infrastructure optimization that we did for EC2. So one of the things, so basically, we are using a tool for spottings. Spottings is a third party tool and their partners are great audience and they help us to effectively do spot instances. And we are heavily using this tool. So what we have done is we have moved all of our ELK nodes into Quartz. All of our nodes into Quartz with the help of spottings, if you look at the image below, you can see that the spot is even option, possess the root volume, possess data volume and maintain ID. So what it does is they have an intelligent algorithm to detect where the spot detects are going to happen. So before the spot detects happen, they do this action. So they take an AMI of the root volume, they launch a new volume, but they launch a new instance. They detects the IDS here, attack it with a new instance and maintain ID address. So these are the three steps they perform. So we get a new instance with the same ID, same root volume and same IDS volume. But it's going to take some time. So the rate of the time is less than 10 minutes. So per node to cycle the node to launch a new one, it will take less than 10 minutes. So this is how we are able to effectively move entire EC2 instances into Quartz. There are trade-offs here. Since we're using Quartz, there is a possibility of spot takedown, right? Quartz are like, we get it out of maxes, this is our output, as in when they need the temperature, they can take a spot instance. So what we have done is we have moved to previous generation instances. By doing this, we are able to reduce the spot takedown significantly. So the next optimization that we did is we implemented Quartz for scaling for logs. So we are scaling based on performance and time. Performance, as in, because if it goes beyond maybe 80% launch to loads, beyond below 40%, we can remove two nodes. Then we have time based on the scaling. As you know, at this particular time, you will need more nodes because of the contents. You can decide to scale up and scale down as they don't have, it's like, all these options are even important. So we have the same only for logs. So next, let's talk about the infrastructure optimization that we have done for medias, because better if you have a Quartz scaling, it will come from here. So as I talked earlier, we have implemented this hot, warm, cold architecture, right? So how? How do we implement this? So on the right side, you can see elastic, not biome, you can see something called a cycle loop, Quartz and Quartz. So Quartz and Quartz would be hot. If it's a hot node, you can select as hot. If it's a warm node, you can select as warm. And then you need to do some template configurations. Below you can see template configurations. Here we are saying the required Quartz and Quartz, which means that when the data is there, it has to put the Quartz and Quartz nodes. Why? Because they have BDS. That will help us to get higher IO. So first, we need the data to go to, the hot nodes. That is how we do that. And then later on, we will decide when we want to move to the warm nodes. I'm going to talk about that as well, how we move it. So there are trade-offs here. The trade-off is hot nodes are super fast because they use BDS and they have higher IO for the Quartz as well. And we have moved to, for warm nodes, we are using S1 type of storage, which is like a cold storage. And IO's are lower, so that's too cool. The settings are going to be a bit slower here, because we are using a slower disk. So how do we move to warm nodes? So, by default, we have done it in such a way that after that configuration says to move the data to hot nodes, now we need a mechanism to move to warm nodes. So on the right side, you can see we are using a tool called curator, which is a free tool. And this is a free tool configuration. I'm going to show you all of this. But if you look at the bottom, bottom of this, set it, you can see it's unit as base. And below it's unit not as base. What that's saying is, after it is, move these, move the required settings for the shots that you want to move. The index that you want to move. So after it is, you have to move the data from hot to warm. Because our preferred text data is state-based, right? So after it is, you're moving it to warm nodes. And that's showing there for a longer period of time. This is how you move from hot to warm. The next optimisation that you will be is data transfer. So what we saw is, per day, there was a lot of data transfer happening between the ELK nodes. I'm talking about data transfer between ELK nodes. Data transfer was about 700 GB per day. If you break it down, you found the lock set to last week, that was 300. The lock set communication was a lot of inter-galactic search communication. There's a lot of communication that happens there, so about 180 GB. So what we did is, we migrated all of the ELK nodes to a single easy. All of the nodes to migrate to a single easy. Of course, there are plate-off sensors that doesn't recommend to run your entire and try in a single easy, right? Because if the easy goes down, you will not just do the data potential, data loss also can happen. So, that is why we have implemented this backup and digital mechanism, so, let me just explain how we take a backup. First of all, you have to install a plugin and then you need to register a S3 and record a piece. Once you have done that, then you can use such as, you can use the curl request, you can use the name of the snapshot and the index that you want to include in the snapshot and that will take a backup to the S3. This is how we take of daily backups. Then restoring. So, restoring data, you can use the curl request, you can use the name of the site, you can restore to on-demand cluster, that means you can launch a brand new cluster, you can set the cluster anything and you can restore it there. All you can restore it to existing cluster also, so you can restore to different cluster on existing cluster. What are these are possible? So, let's talk about the disaster recovery mechanism. So, since we have done some aggressive cost optimization, we got rid of the replicas, right, then we have moved to a single region. Now, we need some sort of disaster recovery mechanism to handle if something goes on. So, first thing is data corruption. You see it happens, it's a red, so what is your data that's corrupted? What is some of your indexes that's corrupted? So, what we do is first we release out the indexes in the state that's a red. So, if the index is having a state of red, we can say that it's a corrupted index, we delete the corrupted index and then we restore the index of existing snapshots. So, because we're doing daily snapshots, we can restore to the next disaster recovery, node failure to the underlying hardware issues. So, it happens that there are many instances and they could be hardware failures because of which your node is not accessible or you enter out of memory issues or how do you deal with this? So, during that time your node will be down, right? So, one thing to keep in mind is because we have removed the replicas, one of the nodes goes down your entire process will be down for that duration, right? Because we don't have replicas, the entire process will be down in the data. Keep in mind. So, what we can do is we can take out the features and recycle the nodes. So, if you do the same process that we just did AMR server volume, launch a new node, again a support node, maintain the EDM volume and maintain the ID. So, this entire process as I talked about earlier, it will take you less than 10 minutes, we saw about 7-8 minutes, it's enough, it's enough, it's enough, it's enough, it's enough, it's enough, it's enough, it's enough, it's enough, it's enough, it's enough, it's enough, maybe 10 minutes, it's enough, it's enough, it's enough, maybe 10 minutes, it's enough. So, all of these things happen and we don't have and then start looking to restore the data from SRE, for the daily SRE snapshots. So we restore the data from SRE. The complete time depends on the time it takes to provision decision, right? So how long it takes you to provision the new cluster? Second thing, time it takes you to restore the data from SRE. So just to give you a perspective, so when we did the tower testing in our VOC, we called that to restore a 20G snapshot for taking less than four minutes. And in this example, this entire presentation where we just 300 gigabytes of data, so to restore one day of data, it will take us less than one hour. So finally, coming to the most important part, we have done a lot of optimizations of errors and faults, optimizations has been done there. So what is an impact? Okay, so this is an important part. So fault settings are easy to solve. This is an example of the traditional sense that no optimizations are done there. You would need five elastic search nodes, three lock search nodes, and one Tivana node. So after optimization, what we see is, so here we have two elastic search, plus we have the hot nodes and the warm nodes, right? So we are using five elastic search hot nodes, and two elastic search warm nodes. Of course, the four end configuration is different, and as you see, he has been using three generation instances, all right? For lock search, again, we have three lock search nodes. Tivana, we have one Tivana node. So if you look on top, we were spending our daily cost of $49 and by using this optimization, by implementing the hot form architecture, and by moving our entire nodes, all of our nodes to spot, we were able to reduce the cost to 46%. How we did that is, by moving to spot, we assumed that we had a saving of 65% saving code, right? By going to spot. The second thing is, for which is not free, they will charge you 20% of the savings. So they're saving 65%, 20% is their commission of charges, you can say. So adding that, we were able to reduce from $49 to $26, which is a saving of 47% for Tc2. So let's start to the saving that we did for storage. Other than storage is the number one for cost, anything that would be able to help us. So this is a very short time. So if we had invested $300 Tv per day, and we had to take the rate of 90 days, and we have roughly that also, we would theoretically need 24 terabytes of Tv to storage. So our daily cost is going to be $960 to $37. So this is our cost after the optimization that we have done. So here, let me explain. So we have implemented this optimal market. So what we have done is, the index and rate is the same, data is zero, they have removed the replica, the replica is showing the rate of S3, performance data, even data. So here we have implemented optimal architecture. So we need now, because we have removed the replica, we need 27 terabytes of Tv. So we have spread across, what they want to, what we need Tv for Tp2, 24 terabytes for Sc1, which total adds to 27 Tv, and then 27 Tv of S3 data, right? So any on that farm, you were able to use the daily cost from $27 to $58, which is a massive saving of 75 percent. So this is, these are the total savings. So easy to be saved, 47 percent, sold in 75 percent. Data transfer, in the EFT data transfer, 200 percent, we want to move to single-lizzy, and which translates into a daily cost saving of 70 percent. So this is our cost versus other platforms, which we didn't idea. So prediction, you have to say the cost of daily per day was $0.98, but we were able to reduce it to $0.58, which is a massive saving of 71 percent. Now, this total saving is exclusive of some of the application level of the matching that we have done. Earlier we talked about moving the T4 and 200 drops from separate index and having the resolution, right? We also talked about, there were no rock failures, there was no original message. So I have not included the pricing here. I mean the cost savings, because it was kind of critical to be your accurate number, so I excluded that, but if you were to edit, you can add roughly about 5 to 8 percent to this number. So it is possible to get this number to somewhere around 80 percent savings, which is awesome. So let's talk about the three takeaways. So we have done a lot of optimizations, so these are the seven things that you would like to know. So we are going to try to scale a little bit. How do we scale outside? For long-sales, as I said, we are using optimal saving. For elastic cells, the best way that we want is to use over-provisioning. So as of now, we are running at 60 percent deficit, so we have 40 percent deficit in growth. And our contents are pretty much predictable, right? So we know that our use of data is going to increase, and at once we can do a predictive vertical function of saving. That is how we scale outside. And handling potential data loss, as Daisy goes down, as I said, we have mechanisms in place, and we have daily and hourly backups, and we can restore them and we can restore data. Now one thing to keep in mind is there is a potential chance of data loss of about one hour, because we are taking daily snapshots, hourly, of that time, right? So in between, something happens, we can lose about one hour of data. But in our case, we are not using our stack, so we use a little bit of data. So users are not only getting infected, business is not going to get infected, we are just using it for our purpose, for each other's purpose. So handling potential data corruption in the last six years, is also, we have real mechanism where we can detect data which are corrupted and recover the data from that speed. Managing downtime between spot takedowns, which is very, very important, because they're going to be spot takedowns, because you use your living spot, but any time, we don't understand, decide to take that as spot ascent, so how do you kind of deal with this? How do you reduce the downtime? First of all, for log sets, we have multiple nodes, as you have seen, we have three log sets nodes. So if one goes down, the others can enter, so it's fine to automatically detect that this node is down, instead of data for other nodes. Only for other nodes. Then each elastic search of Tibana node goes down, any of one of the elastic search of Tibana nodes goes down, and that's what we want to be. The only solution we have red state is that node comes back online because we have removed the elastic search, all right? So how are we dealing with this is, we are using previous generation instances, all right? So we have moved to previous generation instances because of this, the spot takedowns are very less in our case. Like, in fact, we have instances that are running on spot for more than one year. Instances are running on spot for more than one year, and on a type of decision we have about one out of maximum three takedowns. And the time it takes to recover nodes and new nodes that are spotting in less than 10 minutes, all right? So let's see, take away points. Handling break pressure when a node goes down. So if a node goes down as a set of clusters, going to be down, there's going to be some logs that are accumulated, and then a node comes back online, there's going to be massive back pressure that's going to be 50 to 50 clusters. How do we deal with that? So first of all, a pilot will send the data, a pilot will keep doing what we're trying, so data won't get lost, and then cluster is down, it will send the data. So a lot of times we are using a data filter. What the data filter does is, it maintains your document attention, being a document in elastic search, is same as the attention of the law, when the law was stated. So even if the law is intentioned later on, it will change to the correct attention, and you are not going to lose your data. The second thing is for elastic search, as I said, we're running at 3% capacity, and we have 4% provision, so we have 40% capacity left too. So when the back pressure is set, we can manage it because we have 4% elastic search left. So as the law can manage the platform, as I said, we have related TNK, self-managed plants, and some other things at this point, but there are other options out there which we haven't related. And TNK is trying to upgrade parts. So how are we going to upgrade? So as of now, we are thinking of blue-gray development, but we haven't thought about the substitution of this. So finally, let's reflect back on what we have done. So we have done a lot of documents, right, some other things, what documents we have done. So we had our business requirements. The first slide I talked about our business requirements, and we have done this, what's optimization? Which are our requirements? Your requirements might be different, okay? The whole idea of this presentation is not to promote anything, but to give you an idea of the latest optimizations that you can do, right? So that based on your requirements, you can choose the latest optimizations that you want to look for yourself, or if it works for you. So one thing is, whenever I have done optimizations, if there were any trade-offs, I have tried my best to mention those trade-offs. So when you're picking and choosing optimizations, keep an eye on the trade-offs, and you should be fine. Thank you all.