 Hello and welcome. In this session, we're going to talk about elastic search or open search. We're going to use the terms interchangeably for the most part. We're going to talk about them in the context of logs and other time series data. And we'll focus on using a Kubernetes operator to manage elastic search clusters in this context. But before we start, let us introduce ourselves. I'm Radu. I'm a search person at Sematext. And most of my time goes into consulting, production support, and training for elastic search, open search, and solar. And sometimes I contribute to our observability platform, which is called Sematext Cloud. Hi. My name is Ciprian. I'm a consultant for Kubernetes and automation and also work as a software engineer for Poly Poly, mostly on privacy-related projects. Also, in my spare time, I'm an open source maintainer contributing to many of the projects in the Kubernetes ecosystem. So let's start with the agenda. We'll talk about why we want to have such an operator, so the use cases. We will talk about how it should work, when should it scale up and down, and what should we do when it does that, and available options. And last, a quick demo of our proof of concept. Let's start with the why. If you have a small cluster, then you don't want to think about elastic search to learn all about it, the ins and outs, and you would like to have as less maintenance as possible. So ideally, zero maintenance. This kind of operator would help you get started with your logging without caring about learning elastic search too much. For bigger clusters, usually these clusters are multi-tenant. And you could split them easier because it's not such a big maintenance to have multiple clusters if you have the operator. OK. All right, moving on to how. So in other words, what does the operator need to do in order to order scale? In order to get there, I want to talk about three things. So one is using time-based indices. So if you use, for example, LogStash before you might have noticed that it creates one index per day, we're going to talk about why that's a good idea. Next, I'm going to argue that for most use cases, rotating by size instead of by time will work better. And the third thing would be how would those indices behave when you scale your cluster up and down? So let's start with time-based indices. They don't have to be daily. You can have one index per month or one index per hour. But the idea is the same. And the advantages are pretty big. So when it comes to indexing, typically the bottleneck is down to the Lucene segment merges, which happened in the background. And if you're indexing in a smaller index, it's going to be much, much faster. So we're basically comparing the time-based indices with just indexing everything in one index. When it comes to searches, most of the searches tend to be in the latest data. So again, if we have it broken down by time, we can hit just a slice of our data, and that's going to be faster. But even if we search in all the data even, because older indices are not written anymore, they're much more easily cashable by both Elasticsearch and the operating system underneath it. So in my experience, both indexing and searches will be orders of magnitude faster with this design compared to having just one index. And finally, when you delete, when you have to expire data with time-based indices, you can simply delete whole indices, which by and large implies deleting some files on disk, as opposed to deleting documents from within an index which are only soft deletes and will trigger additional Lucene segment merges. In practice, you would have multiple series of such time-based indices. Normally, we would break them down by how you'd search them. So for example, if we search nginx logs separately, then syslog, in general, we don't have to. We can always search through everything. Then it makes sense to keep them in separate index series. But this design is not perfect. And we may run into what we call the Black Friday problem. So let's say we are an e-commerce website, and we're logging access logs. And so hopefully, during Black Friday, we will have a lot more traffic. And so that index will grow larger, but much larger than the indices from the following Saturday or Sunday. And then you have Cyber Monday. And then again, it's a spiky traffic. And then again, it goes down, and so on and so forth. And the problem is that the big indices that will get generated on Friday and Monday will behave much like that big index that we talked about before. So indexing will be slower. Searches will be slower exactly when we need them the most. So to fix this problem, we can rotate indices by size instead of by time. So the way this works is, at least typically, we would have an alias that will point to an index. And LogStash will write to that alias, or whatever puts data into Elasticsearch. And then when that index gets to the target size, typically 10 gigabytes per shard is like a good rule of thumb in our tests, we're going to create a new index. We're going to flip the alias to the new empty index, and then we can continue writing there. And we just keep on doing that. Thankfully, there's some automation in recent versions of both Elasticsearch and OpenSearch. In OpenSearch, this is called index state management. And in Elasticsearch, it's called index lifecycle management. And these can be used to automate this process of, OK, let's create a new index. Let's flip the alias. Even let's remove all indices, which is going to be a bit more challenging in this context, because you don't have indices strictly divided by time. Of course, this design isn't perfect either. So for example, if you need to backfill data, we're just onboarding some new project that already has lots of logs. Those will all go into the latest index. And that's problematic. And also, when we're searching, let's say you're searching the last 24 hours, which indices contain those 24 hours? This is a bit harder to figure out, though Elasticsearch, again, will have this kind of stuff out of the box. So shards that don't have data matching your time frame will quickly dismiss your query, saying zero. OK, so how does this work in the context of auto scaling? So let's start with one of the simplest examples. Let's say we have two Elasticsearch nodes, and we have one index with two shards. So we have a spiking load. Let's say we want to auto scale. We add a new node. Obviously, we cannot take advantage of it. Even if we had previous indices before, some shards will migrate. But the third node will not be able to contribute to indexing, which is our main load. So our suggestion is to say, OK, even if we're not at 10 gigabytes per shard or whatever our threshold is, let's force rotate the index and create a new one that will be evenly spread out throughout our cluster. And this way, all three nodes can contribute to indexing. And this will typically imply three steps. So one, we would change the index template. So normally, all your settings and mappings and all that stuff will live in index template so that when you create a new index, it will inherit everything. So we'll need to change that template to, say, three shards instead of two. And we're going to do this force roll over, like create the index and flip the alias. And finally, we may need to adjust the lifecycle policy. So if the lifecycle policy said 20 gigabytes before, because we have 10 gigabytes per shard times two shards, now it has to say 30 gigabytes in order to keep consistency. And we're going to do much of the same when we need to scale back down. The only difference would be that we need to make sure that nodes are properly drained before we shut them down. So we would exclude them from allocation, and then Elasticsearch would move the shards off of them before we take them down. But once we take them down, we get into a similar problem as before in the sense that the cluster is not balanced. So in order to make it balanced again, we can do the same thing. Change the template back to two shards, force roll over, and adjust the policy again. Does that make sense? Cool. So that's what we're trying to automate. But before we get there, I want to talk about three more sort of best practices. We can't be comprehensive here in terms of best practices, but I think these three are important, especially in this context. So one of them is that it's often tempting to judge the size of a cluster based on indexing throughput, because that's our main workload. We do much more indexing than searches, typically. So let's say we indexed 1 million documents per second, what kind of cluster do we need? But typically, the unit of scale would be either search latency or just outright disk usage. Because if we index 1 million documents per second, but we want to keep this data for like one month, we will need a big cluster. And that big cluster would be big enough to index 1 million documents per second. So this is typically what we will look for. And since searches are bottleneck for scaling, then I think it's worth mentioning that searches do lots of random IO. So IO latency tends to be more important than throughput or even IOPS. So with most cloud providers, we would typically go for the local SSDs, the fmrl storage, rather than the managed over the network disks, because we have much, much better latency, and so we can put a lot more data on the same node. Now, obviously, this has a downside in the sense that if the node goes down, it takes the storage with it. So we'll have to account for that either by having more replicas. Obviously, it depends on how important the data is and regular backups to sort of compensate. And the last one is you may have heard of hot, warm, cold, colder kind of architectures. The idea is you have, let's say, best hardware in the hot tier. That's where your indexing happens. That's where most of your recent searches will happen. And as data becomes less relevant, we move it down to lesser hardware. Frankly, our proof of concept does not support this, but I'm not sure if it's needed because I think there's limited use of this design because you cannot put trash hardware on the cold tier because elastic search will still need to keep the data open. It still needs to be able to monitor itself and stuff like that. And if the cold tier is too slow, it will just become unstable. It's not the problem that you wait like 30 seconds more for a query. The nodes will drop out and stuff like that. But if the cold tier is fast enough, at least this is what we know. It's like, hey, these nodes are idling. We might as well get them back to work and do indexing and then you end up back to like a flat kind of design. OK, so this idea of an elastic search operator is not new at all. So we thought, OK, let's see what existing operators can help us with. And unfortunately, a lot of them are either un-maintained or they don't do auto-scaling at all. But there is one byelastic that obviously supports elastic search very well. It's called Elastic Cloud and Kubernetes. Trouble is, it's not exactly open search. We don't want to get into that flameware, but not open search, not open source because it's on the elastic license. But particularly for auto-scaling, you need to pay, you need an enterprise license. I think it's worth mentioning Opster, which is an operator used by open search. It's under active development. But right now, it does not support auto-scaling. It's on the roadmap, but it's not there yet. So the one we went for is called ESOperator. It's from Zalando. It was presented at KubeCon, CloudNativeCon before. And it is open source. It does support auto-scaling. It does support draining nodes. Like what I mentioned, we need to move shards off before shutting the node down when we scale down. And it works really well for, let's say, e-commerce type of use cases. So when you have one or more indices and you just want to increase your cluster capacity to handle them, even if that means adding additional replicas. So that works really well. But we wanted to change it so that it works for logs so that it does this template management and lifecycle policy management and the force rotate, everything that I talked about before. And this is what Chiprian will show you in a second. So we're going to do the demo right now. I will start with a little about the Zalando operator inner workings. It requires you, first of all, to have your own master deployment before running it. So it's not fully managed, the cluster. But after that, you have some simple options that control scaling and other various aspects, like excluding system indices from any calculation that the operator does. In our case, the most important options are the minimum replicas, maximum replicas that we allow for it to create. And we will be using this usage percent for scaling up and down the cluster. OK, so at the moment, we have already prepared the demo in Kubernetes Docker. So it's pretty easy to do without networking. As you can see, there is only the master. We already applied this configuration. And once we will start the operator, it will create the data nodes. And then we will show how it scales up, how it applies the templates, the ISM configuration, and what it does at each scaling step. So starting the operator now, immediately, it notices that it needs two replicas, and it has none. Already started the pods. And looking here, it should take very little to get the cluster up and running. Until then, let's look at the cluster health. So it's a green cluster. There's no index yet. There is no template. There is pretty much nothing. As you can see, it already found that the nodes are healthy. It created the component template for lockstash and also a separate component template for scaling. And then it applied it to the index. You can see that it applied both lockstash and the scaling components. And it also created the ISM policy with the minimum size 10 gigs. OK, let's see here. Indices, we have one index with one primary shard and one replica. We have template with lockstash and scaling. And the cluster should be healthy. So it's still green. If we go back to the configuration, we can change this scale up percent boundary to 21%. This is something that works on my setup. We don't have the time to ingest that many logs or data to do a real demo. So we just fake it with decreasing the thresholds. So let's see. OK, it should be applied. And very soon it should detect that the disk usage percent boundary has been reached. And it should start adding more replicas. While it adds more replicas, it will reconfigure the scaling template, the ISM, and roll the lockstash index to match the new number of nodes with best practices. Let's see. OK, still waiting for the third node to be ready, I guess. So let's see here how it does. OK, it should be up and running. And we should see how it sets the scaling template right now to match the number of nodes. And it increases the ISM policy to a minimum size to 30 gigs to match the three primary shards. Because all the nodes that we have share the same storage, it will not stop at three replicas right now. It will go up until the maximum allowed by the configuration, so four replicas. And once it reaches that threshold, we will change back the scale up disk percent boundary. And we will make it go back to the way it was, to two nodes. And at that time, it will start recreating each index, no, not recreating, sorry, rolling the indices, and applying the correct scaling template for them. So right now, if we will be looking here, indices, you can see that we no longer have one. We already have three created at each step. So it started with one primary and one replica. Then it went to three primaries and one replica so that the allocation can be uniform on that amount of nodes. And lastly, it went to two primaries and one replica. So I guess right now we should have the cluster with four nodes. It should be healthy. Let's see. So it's healthy. We can try to see that the ISM policy was applied correctly. So it's 20 gigs, the minimum size for all over, as we expected. And the scaling template matches what we see in the logs. OK. Reducing this, changing this back to 75% and increasing this to 25%, the scale down this usage percent boundary, so that it starts decreasing in size, applying. So right now it should quickly scale down, also drain the nodes at each step, and reapply the scaling templates to indices while rolling them. OK. Let's see indices. You can see already have the fourth index that's matching the second one created, like three primaries and one replica. And pretty soon you see it draining the pods, ensuring the cluster is green. And pretty soon it will get to two ready nodes instead of four as it was in the past. Indices, it's already at five indices. And we have, again, two data nodes in Elasticsearch. At the moment, this is pretty much hard-coded, many of the things that we did here. It's mostly a proof of concept, but we want to improve it and publish it as a usable solution for people that want to do logs. And I think this concludes our demo. So thanks, everyone, for coming here and watching us. Thank you. Any questions? So, basically, at the end, you have Logstash05 index with three primary shards and one replica for each, but you have only two nodes in your cluster. So is the cluster beginning to be yellow at that point? No, no, no, it wasn't three. It was one at the very end. You have an index with three primaries at some point, and you keep it, so the cluster should be yellow, right? It's not yellow, because when we... Oh, do you want to take this? So we modified the shards per node setting while scaling down, so it allows the cluster to scale down properly. We don't want that part to hold back the scaling. Okay, thank you. No problem. Any other questions? Yeah, just more of a high-level question on the operator. So the reason I'm interested in this is because I have some customer-facing real data, not necessarily for the observability aspect. I mean, how confident would you guys be with this operator for genuinely customer-facing critical usage as opposed to a little bit less worried if I lose it? Well, this particular one that we showed, it's not production-ready, I can tell you that. The one that is sort of open-source already that we linked, if you don't have logs, if you have, like, let's say, e-commerce, it should work. I mean, as far as we work with it, I think it's obviously up for sort of testing for your use case, but they use it for quite a long time now, because Alando had their shop, so I guess it works. Cheers. Yeah, thanks. Any other questions? What metrics are you monitoring in order to know when to add more nodes or scale more or less nodes and stuff? Or you don't know yet? Yeah, yeah. Initially, the metrics were CPU usage and also number of shards per node. So you can say I want at most five shards per node or something like that. That's what the original operator does. We added the disk thresholds. So the one that Chiprian modified, like if you go over a specific disk then we scale up. If we go under, we scale down. And there's another one that has to do with disk, which is like a safety net. So when you scale back down, you don't want to get over a specific disk usage because you don't want to run out of disk when you scale down or run into some sort of endless loop. So those are the metrics. The plan was to add something that has to do with search latency as well. We can pretty much get any metric that we want from Elasticsearch, because these metrics are from Elasticsearch itself, like from CAD nodes or something like that. We want to use metrics from inside Elasticsearch because Elasticsearch does quite a lot of stuff in the background. And if we use something from external sources, like some monitoring tool, it may differ from the data that Elasticsearch uses for internal thresholds. So we cannot use for disk external tools. I think the Zalando people are trying to extend their operator to use horizontal podato scaler. So that would make it pretty easy for us to gather more metrics and add logs related ones there. So we improve on that. But at the moment, from what we read in their comments, they are heavily developing this feature. So we cannot do a merge of our code there or even discuss the possibility because we don't know how it will look. It's kind of frozen at the moment. Any other questions? Hello. In a big cluster that has big charts, like 50-gig charts and, I don't know, 60 nodes or something like that, how does the scaling up and scaling down perform? We did not test it with that size. I think my recommendation would be if you can not get there and, but rather, split it in multiple smaller clusters. I think for a larger cluster, what we would want is to scale in larger increments. You don't want to scale up all the time because we force rotate, and then it would be kind of suboptimal if we go 61, 62, 63, and then go back and so on and so forth. So I would add some sort of increments. The operator already has stuff like cooldown, both when you scale up when you scale down, so it's not that jittery, and you can obviously configure that. But I would add some sort of steps so we kind of reduce the noise. And for large charts, this is, again, something that you can have control of. And if you use it for time series data, I think 50, 60 gigabytes is kind of large. But, of course, you cannot have like a million charts. So there's a trade-off there. You may be aware of that. Because if you have a lot of charts, then your cluster state becomes large, and then it's going to be difficult to coordinate that across the nodes, and that's, in my experience, at least ultimately, the scaling limit of how much data you can put in a single elastic search cluster, because at some point it becomes unstable because of the large cluster state that has to be coordinated. I hope I answered your question. Yeah, okay. Any other questions? All right, if not, thank you very much for attending. Hope you have a great rest of the conference.