 I've been here for like four months and I learned a lot about Kubernetes in my time here. Thanks Vincent for that. I also learned a lot about Elasticsearch and how to run it on Kubernetes. So yeah, I thought I'm going to share that experience with you guys tonight. So let's jump right in once we have a picture. Okay, to just a bit of this disclaimer we're not going to talk much about what Elasticsearch is because this is not an application level talk, this is more an infrastructure level talk so we're going to talk about how to run Elasticsearch on Kubernetes. Just who's in this room already has used Elasticsearch or is running Elasticsearch? Okay, that's about half I guess. Who's running it on Kubernetes? Okay, so you guys can scrutinize me after the talk please. So just yeah, I said I'm not going to talk about what Elasticsearch is, I'm just going to give a quick quote from Wikipedia. It's a distributed full text search engine with an HTTP web interface scheme of adjacent documents. It's based on Lucene, so what can we take out of this? First it's distributed, that's important, second it's based on Lucene, which means it's based on Java, which is important on the infrastructure side. It's got JSON documents and it's got a web interface so it's got a rest API. So I guess that's pretty much the most important thing we have to know about it since we need to run it. What are we using it for Elasticsearch? Surprisingly, we're using it as back into power, our search feature. We're not using this for log analysis, we're using it really as a search system, which makes it a pretty critical part of our setup. So basically if Elasticsearch is down, everything is down in OSP. It's quite critical for us. But to start, so we're running our product index, I guess you know what OSP is doing. Does anybody not know what we're doing? Okay. Nobody knows. Nobody knows. I guess everybody knows. That's great. So we're delivering groceries and food, that means we have a lot of products. And our product index is about three million documents at the time of our speaking. We're getting about 15 to 20 k queries per hour into our cluster. Latency is about 30 milliseconds, so I guess it's pretty fast. We're running a bit of an older version of Elasticsearch, which is mostly due to our back end not being able to support anything better. At this point, we're working on it. There's quite a bit of work on refactoring on it, but yeah, so far we're stuck on 2.3, even though we're also running 5.3 for some other services. So we're a bit on a microservice side of things, so some services are using a newer version of Elasticsearch. Okay, we need in the cluster is 1.5 or 1.7. Production is still 1.5 in the process of being upgraded to 1.7. Staging is already 1.7, so we got most of the experience we get. Two from both clusters, from both versions. So most of what I'm saying actually applies to all these versions, so it's pretty universal. Couple of concepts of Elasticsearch, I guess I have to talk about this a bit just to set the context for the rest of the talk. There's something called a cluster. Cluster is a bit of an overloaded thing in this talk. There's the Kubernetes cluster and there's the Elasticsearch cluster. So now when I'm talking about cluster, I'm going to talk about Elasticsearch clusters. Clusters consists of nodes, same problem, so I'm going to talk about Elasticsearch nodes predominantly. Yeah, a cluster is a collection of nodes, a node is an instance of Elasticsearch taking part in indexing the search. You can have a cluster that consists only of a single node. On these nodes, we have indices, which can be aliased, and index is a collection of documents. It's pretty much like your average NoSQL database, even though Elasticsearch is not a NoSQL database. But yeah, and index is a collection of documents that are kind of somewhat similar. So for instance, all the products are in one index. Even as a piece of data, Express is tracing, and the whole thing is broken down into shards. Each shard is basically a Lucene index in itself, is that a lot of things, so that sharding is on a level of Elasticsearch, not on a Lucene level. And shards are there for, and that's on the next slide. Shards are basically there for ensuring scalability, availability, one thing you have to keep in mind if you design your cluster, you cannot change the shards. Once you have an index, you say, okay, I have five shards on my index, you can't change that later. It's not possible, so you have to plan a bit ahead. But again, I'm not going to talk much about that kind of stuff, because that's application design. There is a lot of documentation on the Elasticsearch website, it's pretty good. So you have nodes, you have shards, so that's your index, it's broken down into five pieces. This problem is one of these nodes dies. That shard is gone, which means your index is gone. How do you help with that? That's what we have replication for. So basically what Elasticsearch is going to try, we say, okay, we have our index has whatever, okay, these are four shards, actually, not three. Our index has four shards, we're going to have one replica, that means there's going to be eight shards in total. Each shard has a copy. The Elasticsearch will be trying its best to allocate these shards to different nodes to make sure that no two replicas of the same shard on the same node. So if any of these nodes goes down, you always have enough copies to restore the cluster. One thing that's also important to know is that you can search into the primary shards and then do the replicas. So queries can hit every one of these shards, so you basically get eight active shards. Nodes in Elasticsearch have roles, also quite important if you plan a deployment. They can be specialized and should be specialized in anything but your local testing setup. There is a master, which does discovery, so the master basically knows all the other nodes. It does shard allocation, it creates indices. There can be only one master at a time, also quite important to know later. They are elected, so with the usual problems of election. Then you have the data nodes, these are the nodes that I actually do. They have a lifting deck containing the actual shards. They do the search, the indexing, anything. Then you have clients, which provide the REST API. Since each index can be distributed over all the data nodes, the client nodes also have quite a bit of lifting. They have to do the aggregation of the data. So they basically pull all your documents from your data nodes and aggregate that. So it's not just some glorified, thin REST client. It's actually also something that does quite a bit of work, important when you're planning resources, which kind of knows what is basically controlled in Elasticsearch.yaml, as I said, each node can have multiple rows, so you can have one row that does all of these things, which is what you want to have if you have a single node cluster. But if you go into production, it's better to specialize your nodes. It looks pretty much like this. So this would be the configuration of a data node. It's not a master, it doesn't do ingestion, it doesn't do remote search. So it's just holding the data. And depending on how you set these flags, these nodes have different configurations. OK, any questions so far? That was the general part about Elasticsearch. What is the advantages to use Elasticsearch in Kube cluster? Sorry? What would be the advantage to use Elasticsearch in Kube? I think we'll cover you. Yeah, well, so far this was just the general thing. Just first talk about what Elasticsearch is. If you have any questions about that, then we'll do it on Kube. Any other questions? Why did you use Elasticsearch as the index and not a proper node SQL database? I mean, we're using our primary database as Postgres, right? So the system is backed by RDS. So we're using Elasticsearch just for the search part. Basically, our backend is using some gem, which is called SearchKick. SearchKick gem, doing all that indexing. So everything that goes into the database is also piped into Elasticsearch. Yeah, so, yeah, we decided for whatever reason to use communities on Elasticsearch. Right, that's the other thing. I mean, I kind of inherited that thing. So I wasn't part of the reasoning. I think Vincent's going to be able to give more insight in that. Anyway, so if you want to deploy that into Kubernetes, we're going to have these three types of nodes. We have a deployment for each of these nodes. We're going to need a couple of services. So basically, you need two services in the simplest configuration. You need one service, which is just a client-app service, which is not exposed, which keeps your masters together, serves as a discovery for the cluster. And you have your API, which is obviously exposed at some point outside of the cluster. So it's either a node port or an hour setup. It's an ingress. I suppose you're all familiar with these Kubernetes terms. We're using Helm to deploy it. There are other ways, but that's what we chose to do. We're not sharing our own Helm chart, but there is a Helm chart in Kubernetes charts, which you can have a look at and adopt for your own setup if you want. So just why one deployment per node role? Basically because of scaling. So you want to be able to scale them independently. For instance, for masters, there's no point in having more than three masters because only one can be active at a time. So you want to have an odd number of masters because you need to do election. You cannot have only one master because if that's down, there's no master. So there's three masters. That's it. You want to have more than three data nodes to properly distribute the indices. I mean, at least three data nodes to ensure proper distribution of the indices of your charts. Clients is needed, so I don't know, I started for you to see how far you can go with that and scale them up as needed. You need a discovery plugin for Kubernetes. So basically the way discovery is done in Elasticse which is pluggable, and there's a specific plugin for Kubernetes. The image we're using, the Docker image we're using, has everything built in already. I think here's the link to that plugin. So basically, essentially it's a Docker image, BaseImage is Elasticse, it installs that plugin, and that's it. It's not very complicated. I'm going to talk about that in a minute. I'm going to get there. Services, again, you need to do the discovery service, you need an API server. And if you use STS, it's a stateful set. You need another service for that stateful set. And again, optional, you can have an ngress, you can have a config map. You can be using Chrome jobs for something, for snapshots. Yeah. Service accounts. That's quite important, you need a service account. If you have RBEC enabled on your cluster, you need to have a service account that is configured to be able to use the Kubernetes API. That's 1.7, right? 1.7. That's one of the pitfalls we stumbled upon when we upgraded from 1.5 into 1.7. All of a sudden it stopped working. The discovery stopped working because we had RBEC on a new cluster. Sorry? Okay. Well, that's when we started. We say, okay, let's do this stateless, because elastic search is awesome. Elastic search can recover, right? If a node goes down, the index goes down, it's automatically going to redistribute t-shirts. It's going to recover. So if we lose a node, no problem. So, yeah. And that's what we're still learning in production, by the way. And it works, actually. So it works pretty well. As long as you have enough data nodes and you have enough hosts in your Kubernetes cluster, it's not actually a big problem. So you're just running, so effectively you're just running this on empty days? Yeah. Empty days? Yeah. And you have some sort of snapshotting? Hang on. I'm not doing it like that anymore. Part of the reasoning is that this is not data we cannot afford to lose. It's a copy of the production data. There is a reg task in the back end that can basically re-index the whole thing. It takes six hours. So there's a downtime involved, but it's possible. So we said, okay, let's go for it. Let's try it. And again, it works. So we didn't have a lot of downtimes. I think we don't have one in like two years. Oh, initially we had. Well, I wasn't... You store the charts on the... That's exactly the point. So far it's using empty days. So it's basically stored on the host, but that doesn't matter. Oh, okay. So no persistence. I'm not... We're going to get there. This is the start. This is the original setup. That's what you're getting. Yeah. Right. So the problems are obviously... You can't afford to lose one node. You can't probably afford to lose two if you're lucky. If you're lucky. But what if multiple node goes down? What if you want to upgrade your cluster? I mean, if you want to upgrade your cluster, you need to roll over. And the elastic search isn't going to be able to recover the charts fast enough for large indices as communities is going to roll them over. So communities doesn't know about your indices, right? You don't have to roll your... Sorry? You don't have to roll your containers for cluster upgrade at all. I mean, if you're looking out, then someone's doing your cluster upgrades wrong. Which cluster? Elastic search or Kubernetes? Oh, Kubernetes. Kubernetes. I'm talking about elastic search. Oh, yeah. I'm talking about updating the deployment. Elastic search also has this concept of the revolutions, how many charts. How many times do you want to revocate your charts if you put, like, scale of three? So basically, if you have three copies, so it's giving you flexibility. It's possible. You can do it. I mean, there's a way to do an upgrade. I'm going to get there. It's just not exactly easy. In the end, this whole thing looks like this, right? I mean, you're right. It's not a thing you should be doing. So one thing is, somebody mentioned it. We have snapshots that mitigate the whole thing a bit, mitigates the risk a bit. It's a feature built into elastic search. It's accessible via the API. It can be used... Basically, it takes the entire index, writes it into some kind of persistent storage, local file system, S3, HDFS, Azure, GCS. This is what's supported by now, by means of some plugins. So you need to actually install a plugin into your elastic search cluster to make use of these repositories. Yeah, if you combine that with cron drops in elastics and in Kubernetes, that works actually pretty well. Yeah, we have our own custom tool for that. We have, like, a Python wrapper that wraps around the Kubernetes API and can be used as a command on Kubernetes, so you can run that as a job. One thing you can also do with that, which is a pretty interesting feature, is you can replicate your cluster. So you can say, okay, I take my production cluster and replicate that into staging. As long as both elastic searches are roughly the same version, there's actually a compatibility table on the website that actually works. Obviously, you still have a data loss window, so you need to look at your RPOs, how much loss of data, what time interval you can afford for the data loss if something goes wrong. We also tried to use that with Helmhooks, so we tried, okay, let's just take a snapshot before we do an upgrade with Helm, but that caused Helm to time out because you had to wait for the snapshot to complete, so that was not a great idea. But it's possible, so if you want. So how do we do manual upgrade on our status clusters? Basically, we start another cluster and roll the data into the new cluster and then we upgrade, we replace the old cluster and yeah, then we're done. But you have to do that one by one, so it takes a bit of time. It really takes one hour or so or something because you really have to shut down every single node at a time and wait until the cluster recovers, which takes 10 to 15 minutes. Do that for four or five nodes. It's going to take you an hour or two. So that's a lot of YouTube videos. One thing you have to take care of, I mean, it's not a big deal if you use a Helm chart. You basically need to make sure that both clusters are attached to the same discovery service, which is not a big problem. And then basically, yeah, you have one Helm deployment. That is your active cluster. You start another Helm deployment, which joins your existing cluster. Then you can start rolling over. So this is what we're doing in production unless, until we can finally go for this. And this is what is really required if you want to use Elasticsearch and sleep well on communities. Essentially, I don't know, stateful sets, everybody's familiar with that, right? So I'm just going to quickly run for that. It's the communities approach to a stateful application. If you want to run a database in communities, you better use stateful sets. It's quite similar to deployment. It really doesn't look that much different. It's got some extra properties. First of all, pods have a defined order, which is quite important when we come to persistent. The naming pattern is a bit different. We're going to see that in a minute. So it's not just random characters. They're actually sequentially named. They will always be launched and terminated in sequence. So they will always come up 0, 1, 2, 3, and then it comes down 4, 3, 2, 1. There are a couple of other things. So check the documentation. And they have PVC templates. And this is the persistence part. So how does that look in practice? If you think about our stateless deployment, we had a deployment here for the data. Now we have an STS, which has basically a PV persistent volume attached to every single data node. And basically when we scale that thing up, it's going to automatically allocate a new PVC for us. Then for some reason, I haven't quite understood yet, to be honest. You need to have a headless service attached to the STS, which is to ensure the network identity of the pods, whatever that means. But yeah, you need to have it. Yeah, I heard that question somewhere from someone. Why not just use PVCs in a deployment? I mean, you get pods attached to volume to every pod in that deployment. And you're done, right? You have persistence. It doesn't work. Why? Because pods in a deployment are not related to each other. So that is random, right? So if you have a pod that comes up, it gets terminated to get a new pod that has no history with the previous pod. There's no identity that is maintained across restarts, which means while you can technically attach a PVC to each pod, how to do that in multiple pods? You cannot say, at least in a deployment, okay, I want to have a PVC for each of my pods. You can only say, okay, I want to have three pods and three PVCs and somehow manually attach them. So how do you maintain that? And how do you, for instance, when your pod gets rescheduled, when it gets kicked off the node and that pod is going to be restarted on another node on another community to host? How does the PV, the volume, follow that pod? While in a stateful set, and that's why they are ordered and why that is important, they maintain their identity. So if pod number one goes down, pod number one comes back, and the PVC is also number one, so communities can basically keep track which PVC belongs to which pod and can reattach them, which is done by something called a volume flame template. So that basically defines a volume for each pod in the stateful set, which interestingly even survives a Helm delete purge. So even if you delete your entire deployment, your entire Helm release, your PVCs will still be there and when you bring it back up... Sorry. Who's familiar with Helm? Oh. Yeah. Is anyone not familiar with Helm? Pretty much everyone. Everybody else is sleeping? Okay. How was the package manager for Kubernetes? In the context of this talk, templates for Kubernetes manifests. Basically, Kubernetes manifests plus code template or handlebars if you're not familiar with code. There's a lot of projects that template manifests, but Helm takes it to a level where you really not only think of it like just templating for your manifest, but also managing your installations, your releases, rolling back, rolling forward. Yeah. Quick question about Helm. So what's the limitation of the publicly available elastic search chart, which, you know, why are you rolling your own? I think when we started, there was no publicly available chart, so we did our own and then kept sticking with it because it's like, okay, we have this thing deployed and it's easier to make incremental changes than to replace the entire thing. Did we... You never, like, merge upstream back into ours? No. Maybe it was merged initially, but maybe it was forked initially, I don't know. We have our own repository of hand charts where we also keep some kind of proprietary stuff, so it's nothing we... To clarify a chart is basically a package. So if you do upget install elastic search, Helm is the same thing. Helm install elastic search, so if you do it on a Linux machine, it's going to set up Java, it's going to set up the whole thing, set up your configuration directory, make sure that there's an unique task that whenever your computer starts, elastic search starts as well, all of that. Helm is doing the same thing for a cluster of machines, so for communities. I'm going to show you a bit of our hand chart later. So I'm going to give you a bit of context, I guess, when we get there. Max, how do you set up? Is it separate cluster only for elastic or is it more general cluster that you also run elastic? How is it going to set up? Yeah, we are trying to... One of the attractions to Kubernetes is to share a workflow, try to share clusters. We separate based on namespace. So we are using the same clusters, but we separate the different deployments based on namespaces, so deployments are constricted by namespace. The plot is running per cost because elastic is pretty much consuming all of the memory. You could use another affinity. We had a bit of... we had to do a bit of tweaking recently to get... when we set up a new Kubernetes cluster, then it started running out of memory and one of the... I can talk about that. Obviously, yeah. I guess it's going to get there. Do you have something around Helm still that handles the configuration or the concrete management? We're using... Fabric 8. The Fabric 8 clustering that uses the Kubernetes API to find the other nodes in the cluster. That's one of the reasons... You're talking about Helm, right? I mean, with Helm, you still have to give some parameters to... Oh, okay, I guess. I mean, I... So do you have something to manage that? To answer your question, no. We just keep the values files somewhere with the deployments or for other microservices we're using for... Joe and Helm plug-in where we basically put that into the .doni.aml file. So our CI system. That was the same question you raised when you joined on this VVC. It's like, how do I manage the different elastic serve deployments, right? And you're using... in the git commit the values, but the values are no secrets, right? It's just configuration for deployment. And it's also bold as used for... actually called bold as used for Kubernetes. Anything that's a secret is not there. So we are looking... there are solutions. Like there's a value store plug-in for Helm, which I've been advocating, but I have not implemented it yet. So that would be a centralized index. It's actually built based on DynamoDB. So you get like a central index of all your deployments across all of your clusters, all of the values, all of the parameters. And that's... I feel like the way forward. But at the moment, the git actually... I mean, committing some values to git gives you also an update, gives you also a div, gives you also some control. So that's a solution I like. Right. But then again, I mean, somebody could make a talk about Helm Config Management, right? Yeah. So he's talking to you. So it's a... focus. Let's keep... Let's go back to the last extension. Okay. I'm probably going to show you that in the real thing instead. So this is our... So this is a Helm chart. So that's why I say it's basically Kubernetes Manifest plus handlebars in the most practical level. That's not handlebars though. That's co-template. But I guess handlebars is more... more known. Yeah. Do you have an example of the values bar? Oh, sorry. Do you have an example of the values bar just to... Yeah, yeah. Here's my... the values here. That's what you mean, right? So basically... Okay. The structure of this Helm chart is... Can I close this thing? Yeah. So we've got a bit more space here. So this is... this is a Helm chart. It basically contains a couple of metadata files. There's this chart YAML file which contains, like, the version and stuff like that. There's a readme. There are these values files. This is... These are my default values. And then you have your templates, which are templated Kubernetes Manifests. And all these placeholders are going to be replaced with the values. And then I can... for specific... deployments, I can just override specific values. So I can say, okay, this is my default values file. It's always called values YAML. All the defaults are in here. And then for specific deployments, I create a... values file that basically gets merged into the default values file. And it can override certain things like, for instance, I want to override the hostname for my ingress. I want to use a different ingress controller than the default one. And, yeah, that's basically your rendered Kubernetes Manifest then. And Ham is basically going to take all these YAML files, all these manifests, everything that's in the template directory and run kubectl apply on that on a very simple level. So let's just look at... That's why the structure is very similar. So just ignore the placeholder. Just think of this as a manifest with a fancy syntax. So you have a stateful set. It has a spec. It has replicas all like deployment. This is new. Port management policy default is serial. So basically, if you set it serial, it's going to start each part at a time, which can take a bit of time. For certain applications, that might be important for elastic searchers not. So we can just fire up all the parts, all the data parts at the same time. No problem. Update strategy. By default, it's actually not doing any updates. You have to delete the parts manually. We overwrite that. Set that to rolling update for elastic search because we're lazy. Okay, what is this? Okay, this is just metadata. Affinity, anti-affinity is quite important as you can imagine. You don't want to have all your data parts on the same host. So you want to make sure it's only one per host. This is the service account name I said. As I said, we have to create a separate service account for elastic search, which has more permissions than our default service account. And then, yeah, here's your container. Which is just, this is how we set our roles. Essentially, we can use environment variables to override config file settings. So we say, okay, this is a data node, right? So we said, this is master node is false, data node is true. We use config maps, but I can show you the config map. The problem with config maps is that config maps are mounted into the container mutably when the pods start. So if you want to change the configuration and you change the config map, you then have to go and delete the pod for it to pull in the new config map. Whereas if you change an environmental variable when you go kubectl apply, it will automatically roll your pods for you. But with Helm, you can also make annotations that check some of the files and automatically force rolling updates. Oh, that's nice. Yeah. That's, I'll get through you later. But the reason why we're using, okay, the elastic search configuration, this is something that works on application level. This is not a feature of communities for Helm. It takes these placeholders in the config file and replaces them with environment variables. And the reason why we still have that is because originally we didn't even have a custom config file, we just used the default one, which is for the basic use case, it does everything you need. You can customize whatever you need to customize with these placeholders. We only had to introduce our own config map when we started enabling scripting, which is something I'm going to get to in the end of the talk. Okay, I'm going to talk a bit about the configuration later. This is something you want to have an eye on because this is obviously, you need to be careful how you set your Java memory in relation to your memory limits because that can be a big surprise. That's a surprise if you do that wrong. Access keys for AWS are for the snapshots. So this is basically for allowing the elastic search class to derive snapshots into S3. Obviously, I'm not going to show you the keys. They're involved. Memory requests, ports, volume mounts. Here are the volume mounts, right? As you see here, this basically refers to a PVC. This is the specific volume claim. This is the config map mounted in the configuration directory. Now, you want to have a readiness probe and here are the volumes. This is the volume for the config map itself. This is just a straightforward config map. This is the volume claim template. That basically creates a volume PVC for every part that comes up. Then it's mounted as a volume mounted to container straightforward. A couple of things you want to set on your volume plans, but that's just a PVC again, persistent volume claim. You have access mode read-write once, obviously, so it can be mounted only to one part at a time. It's not you can only write once. The one stands for one part. Storage class name, whatever your cloud infrastructure provider is and then how much storage you want to have depending on your indices, obviously. Okay, so this is our stateful set. Yeah, results limit, I mentioned it. I just did that last week on a staging cluster. I messed up. I gave him too much JVM. I gave the JVM too much memory and what Java does when it starts up, it takes this value XMS. XMS is the memory at the start and allocates that right away. So you have a huge chunk of memory and if that memory is more than your memory, it will limit your pod crashes. You also need to make sure that your operating system also has a bit of memory left. You shouldn't use swapping, I think. I think you should know that. Don't swap because it's going to be very bad for performance, so you have to live with memory, with physical memory. On data notes, you should only allocate about 50% of the available memory as heap space. The other 50% are going to be used for the OS and Lucille is also going to cache stuff itself outside of the heap space. And your masters and clients, since they hold no data, they don't need that much cache, so you can just use about 75% of your available memory as the heap. The recommendation for elastic searches to just use XMS at the same value. So if you say, I'm going to set a limit, a maximum amount of 4GB for my JVM, then you can also just as well start with that amount because elastic search is going to fill it up. You're going to see that in your monitoring. So it's usually utilized about 99% of the heap spaces used by elastic search right away. I mean elastic search is smart enough to allocate your shards across your notes so you don't have two shards of the same index on the same note. But it doesn't know about communities' hosts. So what if our host goes down? So what if the first one goes down, 1021 goes down, then you lose. If you're unlucky, you lose one index because the two replicas for that index might be just on those two data notes. So what you want to make sure is that you have each of these data ports on its own note, on its own host, which communities does by default, as we observed, but yeah, why take chances. So what you want to do is you want to set up anti-affinity, which basically makes... And anti-affinity is... Or affinity in general is a pretty predecessor, no successor to node selectors. So node selectors are going to be deprecated. You're supposed to use affinity. You can do node affinity, which doesn't help us much here because what we want to say is, don't get your data port on a note that already has a data port on it. So it's basically pod affinity or pod anti-affinity in this case. How do you configure that? You say there are basically two levels. You can say required during scheduling, ignoring execution and think requested or preferred during scheduling, ignored during execution. That's a lot of letters for saying soft and hard limits. There's this thing called topology key, which basically means on which level of infrastructure. So you want to have it in the same availability zone. You want to have it on the same host. You want to have it on the same whatever. Straight forward, we just say, okay, we don't want to have two of these pods on the same node. And then we need to say, okay, by what condition do we select nodes that we want? Don't want to have on the same node, right? So we say, in this case, it's straight forward. Our data pods are tagged with whatever app, ESDemo, Elasticsearch and draw data. And we don't want to have another pod with the same labels on the same node, relatively simple. So maybe I can just try to give you a quick demo. So let's say we have, this is just a demo cluster. So it has four data nodes. You got three master nodes. Yeah, three master nodes and four actual nodes that can run stuff. So these are our pods for Elasticsearch. So you see like, they are free deployed. There are two deployments, client and master. They are like, as you would expect a pod from a deployment to be named with some random appendix. And then you see like these data pods are special. They are named by index. So they're always going to keep, they're going to keep these indices when they get restarted. They're not going to be renamed. Yeah, so we see we have four data pods, right? And we have four nodes. And if I have this anti-affinity set up on my pods, I can just, okay, keep, get STS. That's my stateful set. STS replicas. So now I say, okay, I have four pods and now I want to have a fifth one. Yeah, and this, so you see it basically started the fifth data pod and it's impending that I'm never going to change that state unless I scale up the community cluster. Why is that? So keep, I'd like to describe pod STM on data form. It's going to say it's fairly scheduled because I have no nodes that match that anti-affinity condition. There are no nodes that not already have a data pod on them. Now this is because I said required, I could also say preferred, it would still schedule that node somewhere. It would probably just put two data pods on some node. But yeah, why take chances? Just keep it that way. Yeah, so again, it's just going to stay that way until Kingdom come. So let's get it back down. Okay, anti-affinity. A couple of other things you want to tweak. You want to change your cluster name, even though I think I haven't ever tried it. If you run two elastic search clusters with the same name and the same community cluster, I think as long as you don't connect them to the same discovery service, it doesn't matter. But you still want to have that for your monitoring. So if you use, for instance, data doc as we do, you need to use that cluster name to identify your cluster and to find your stuff. Again, JVM, tweak that. It can't be done in the config file. It has to be done on command line as an environment variable. That's an interesting, node name is an interesting thing. You didn't do that in the start and it always led to that. We had to kind of figure out where, which pod name, Iron Man, or Captain America, or whatever, relate it to, because by default elastic search name denotes by random Marvel characters. So that gives you a bit of a headache at 3 a.m. in the morning when your cluster is down. So just, yes. So now if you basically set up your node name equals host name, it's basically name your node after the pod it's running on, which is a lot easier to identify using the cat API. I'm going to talk about these endpoints a bit later. Yeah, what you also want to do is you want to set your node counts. For instance, you want to tell elastic search how many masters you're going to expect you're expecting to have. Otherwise, if two masters come up at the same time, they say, okay, I'm the master, I'm the master, and you have a split brain. So you need to tell elastic search that it's got to have to wait until there are three masters before it's going to elect one of them as your master, as your active master. It also can be used to say, okay, I want to have like two or three data nodes at least before I start recovering indices. So if you start up the cluster, if you restart it, it doesn't start thrashing your CPU or memory before everything is ready to keep the data. Monitoring, we're not going to endorse Datadoc, but it's pretty cool. Basically, just point the last point point Datadoc agent at your elastic search URL, and it's going to give you tons and tons and tons of metrics. It's really, really a lot of stuff. If you're using Datadoc, there is a build in elastic search dashboard by default. I can only recommend opening that and looking at what kind of metrics that thing can track. It's pretty interesting. So what we are looking at, the way we configure it in our newer clusters, we're using pod annotations. So you basically have annotations on your pod that tell Datadoc to pick up metrics from that particular pod. Things that are important, obviously memory, memory, memory, memory. Look at it per pod, look at it per host to make sure that your host is not going to run out of memory. CPU is not that critical so far. We never had any issues with CPU. Might be because we never set any limits, but yeah, so you want to probably track your CPU limit, your CPU limit at some point to set a proper CPU limit for your pods to make sure that the communities cluster makes intelligent decisions. I think so far we noticed if you don't, and this is also something Kelsey clearly says. I don't know if anybody knows about Kelsey's high tower and his presentation of the community scheduler and how he compares it to like Tetris, like scheduling Tetris blocks. If you don't tell it what to expect, it basically doesn't use your cluster properly. And the same if you don't set CPU requests or limits, the scheduler is not going to optimize your resources. We did start setting CPU limits by just recently actually. Basically just observing how it's going to be used without limits and say, okay, what is this thing going to use over a week, one month, and then we set the limits accordingly. I guess that's a way to make decisions. You also want to look at how many healthy communities have and if it's just to figure out why your cluster went from yellow overnight, maybe it's just because some elastics, some easy to host restarted or some like your cores did an update or whatever. So have you ever hit disk IO limits? So far we only hit memory limits. We did have some like where the IO was going wild was if we have multiple data ports on the same node. And you know, like when we were doing, like there was some node became unavailable and then ports are being rescheduled and they start to be co-located. And then we started to see huge problems, latency issues and all that. But as long as we like provision them properly and we set up all the anti-affinity properly and we only saw it so far as a symptom of somebody else going on. And then you also see other things. You see network log, your ccp or the spiking memory. Do you have SWAP enabled on your nodes? You're competing at these nodes? It's actually an algorithm. Yeah, you don't do that as a cloud. Did you drive around in the iterated cluster? It's called the covariance cluster? Like active? Not active. We can have a whole discussion about that. Right. That's a bit out of scope. I guess we're going to have to talk about it in the clusters at some point in the future. Yeah, we should. What you also want to look at is the S matrix. So you don't need to worry about where you get your JVM matrix from LSU, so it's going to report them for you. You want to check your cluster state. Very important. If it's yellow, you'll find because you still have a chance of recovery. Yellow means one chart. So basically you say, okay, I want to have two replicas of mine days. And he says, if you don't have enough replicas to stay in this cluster, goes yellow. If you lose all replicas of one index, of one chart, then your cluster is red and then you better have a backup. Yeah, your search queue size, it's an important, quite important metric to see. Like if you have a, normally it should be flat. You shouldn't have any queue. We don't have any queue normally. If it goes up, then you have a problem. Then somehow the node is in trouble. Storage size, obviously if you use PVCs, you want to know how much of them are used. Yeah, and yes, will be a good test for your memory reserve. So if you want to have a reason to do an elastic search on communities, that might be one. If you want to stress test it. And also going to test your cluster autoscaler. It's a good way to test the autoscaler. Maybe we're missing someone here. So tell us what we miss. Maybe you want to highlight. How do you collect the metrics? Do you use file bits or metric bits? Oh, data dock agent. Yeah. Data dock agent. Yeah, we're paying for it. It's a, it's another, it's a demon set. There's a part one initial doing the following. What about triggering alerts based on this? Is it integrated today? We're using Victorovs, but I'm not sure how the integration exactly works. Yeah, so all of the data dock metrics can be plotted on charts. And you can set like alerts based on the metrics and then you can integrate it with your major duty, whatever you manage for your own call duty. So all of that is like, we, I initially we thought about running Prometheus and doing everything ourselves, but this is kind of very critical component and we already tried to run too much. So data dock is, I mean, not saying that data dock is the best solution, but there are several good solutions out there that help you with that. Right. So what's the, what was the improvement from going from a non-stateful set empty directory to using a stateful set with PVCs? What long does it take to fail? The most important thing is you can just have Helm upgrade on it, and it's going to take down your data plots one by one. It's going to bring it back up because it's all PVCs. The volume is back, the volume is back. How do you measure your performance improvements between the two of them? How much, how much was it? I don't think we measure performance improvements. We don't have a comparison scenario because the stateful one is staging and the stateless one is production so far. So, I mean, we're going to upgrade production at some point in the future. One big thing we're upgrading, they could make this cluster for production. That's part of the cluster. So that's one of the next things we're going to do, and then maybe we can give you some feedback on how that performs in practice. Was there anybody else? Coming back to the first question, what is advantage to run elastic in various clusters? I think one thing is governance. You want to have everything in the same cloud. If you want to have that, and as far as I know, you cannot run elastic search in Singapore, in IWS. Is it? You're comparing to what, to using the hosted solution of AWS I'm sitting up, putting all the Kubernetes cluster, as you did, or I'm sitting up just three nodes, running outside the cluster and just feeding all the data from the cluster inside the elastic. I mean, okay, there's... Yeah, right, so it depends on what you're comparing it to. I was comparing it to AWS Elastic. There's a hosted elastic search, right? Obviously, you can just use your favorite configuration management and fire up your own cluster bear on easy two instances, which is... Why run on Kubernetes instead of on VMs that apply to any application? I mean, our strategies run everything on Kubernetes as much as possible, because it gives us more... We can focus on one thing, one piece of technology. We don't need to say, okay, you know, we're the popular chef who are responsible to provision our elastic search cluster next to our Kubernetes clusters. We're going to use Helm and Terraform to fire up our infrastructure, and that's pretty much it. So this is where we want to get. I mean, you could say... I was thinking, you know, spinning up a couple of extra data nodes is very fast. If you have the sufficient Kubernetes nodes available, you just increase the number of replicas and it automatically joins the cluster, but then you would still need to... I mean, we did that when we had a performance issue, right? I mean, we had a huge performance issue, and we really scaled out. But, I mean, you can still do the same if you have configuration management and you have more VMs. Maybe you have a baked image that can go also quite fast. So, I don't know, it's just Kubernetes is like a unified way of doing it for everything. I think the benefit also is you can run other stuff on your cluster. Yeah, right. Yeah, like running a dedicated elastic search cluster. Usually with elastic, you don't run anything else. Okay. We can manage that. But you said... Well, you can set a system... You can use... You can say, okay, I'm going to have the same... I'm going to use cluster, but I have an instance group that is dedicated for elastic search, and then you have elastic set only on these nodes. And everything else runs on a separate set of nodes. That's a perfect... We do that where we are. It's a perfectly valid deployment mechanism, like to deploy things which look like they'd be... They'd be good enough for their undedicated instance. But, like, let's say you want to run an elastic search node and you know it needs 60 gigs of RAM, you know it needs eight cores, and you have a machine with 10 cores and 80 gigs of RAM, and there is somebody in your company that needs to run, like, 100 microservices, which is one millicore each or something, then you're going to fill up those extra two cores. So it might... If you're running hundreds of nodes, or even tens of nodes, the little extra bit that you're not using every now and then really kind of adds up. Right. That's the whole... I mean, that's one of the main... If you read the paper about board, like, basically, one of the main advantages is to run mixed workloads across your nodes. So, you know, at different times of day, you might need batch drops running at night, and you can use resources that are available. If you're during the day you need to scale out more stateful... I mean, batch services need to run at a high peak, but then you don't run your batch drop, so that's the advantage of running things on community. Right. Yeah, the problem is predicting memory limits is quite easy for elastic search. Right. CPU, because you don't know how much it's going to be. Yeah. But, again, so just from the trenches, so far, we didn't have any issues with CPU or at least not so many. Usually, CPU... There are services that may be on the same node. Not even that. I mean, we track the CPU usage on each host, obviously. So, we see that that it's not overloaded. None of these hosts are overloaded on a CPU level yet. So, I have to say, yet. Right. It might... It might... It might... It might... It might... It might... It might... It might... It might... It might be a buffer. But, I think... What's your practice on estimating CPU load and memory? For memory, you've got guidelines for CPU? As I said, so we basically just track the usage over time and set the limits accordingly. Yeah. So, you basically start with... I have no limits at all, which is probably not the best way, but probably a large limit to have enough... Buffer. Buffer to see where it goes. And if you see that's too much, then you scale it down. If you see it's I'm hitting the peak, then you scale it up. Let's just continue. Maybe we can have a bit more discussion at the end. Troubleshooting, there's just a couple of tips. So, basically, Elastics that already tells you a lot about itself. There are these APIs, CAD APIs, and the cluster node APIs. CAD APIs, their primary distinct factor is that they are human readable, so you can basically just use watch and watch the state of your cluster over time. It refreshes. I think, okay, these are the ports now. But, yeah, if you see, here's the curl and I see my nodes. I curl, I see my indices. Okay, there's only one index in this cluster, which is our example index. But it throws you the health status, so this one is green. It's fine. Quite interesting one and this one is also quite interesting to watch charts. This gives you your shorter location and why is that important? Why is that interesting to watch? Because now if I for the start, I'm just going to add one. I know, I have four already, I think, so I have to kill it. So now basically to kill the node, scale it down. It's still running. It hasn't recognized yet what's coming. So let's just watch Q, Kettle, CAD nodes and the other one. No, pods, sorry. So this one is still there, right? So keeping it set, I'm still happy and now it's gone. And now my charts go. And what, now if I scale it back up, Elasticsearch would always try to utilize all the nodes. So if I scale it back up, it's actually going to relocate. So it's actually, okay, now what we're going to see is this index is yellow. That's one side effect. So it says, now somebody would get woken up at night because the cluster state went to yellow. No, I disabled that a lot. Yeah, it's probably going to take a minute or so to recover, but it's going to recover because now it still has enough nodes to rebalance. So it's going to take these indices and copy these charts and copy them over from the replicas. Yeah. Can I explain one that actually shows the migrations? Yeah, you have this cut. Now you see it's actually went pretty fast because this is a relatively small index. If you have a larger index of our product index, it takes a minute. So you would actually see the states that would go into recovery and recovery. So you have this one second. You have this, this is important once you're, normally you don't have to look at that, but if you're on production and your production is down and you have to give a progress report, then you want to have a look at this thing. It gives you the status of your recovery jobs. What's the relocation? Relocation is what would happen if I bring that node back. So right now the cluster is green, right? Everything is fine. But if I fire up another node, another data node, it's going to start relocating some of these charts to that new node. So there's no need to do that because everything is fine, but it's going to relocate them anyway. This actually makes sense, right? Right. Because it also recovers everything. Right. So it's basically two levels of auto recovery, right? Community is going to bring a pod back at some point and then the last six are just going to recover by itself. That's why we disable data node because community this is going to handle it for us. Yeah. Thread pool. I'm not going to show you that because it's a lot of data. But you can also all these, all these end points you can basically monitor and data doc. So data doc is using is basically consuming these cluster node APIs. So they can give you, for instance, your JVM usage life. So you rarely have to actually SSH into your nodes to see what's going on. You get everything on there. Come on. Okay. I showed you that, right? Try the location. Keep using. So this would be if you query your node end point you can go buy node for each node or for all nodes and you get your, there's actually a way to give you human readable output, but this one says, okay, it's going to use, it's right now using one gigabyte of heap space. Main, all of it. Couple of things you can do dynamically. We're not doing that a lot, but if you're more concerned about downtime, there are a couple of things you can tweak before you restart nodes and stuff. For instance, you can dynamically adjust your master node, you can move master nodes. You can dynamically I guess, I don't know, you can set these things transient or persistent. Transient means it's going to reset to default on the next restart. I'm not sure that it even matters when it is because when you restart the pod and it's supposed to be persistent, but the master pods are not persistent. So it's probably going to be transit anyway. Cluster level, try the location. This is something that's interesting if you want to, if you have a scheduled restart, right? The elastic search is going to start recovering the indices, but if you know that node is going to come back up and it has a PVC attached to it, you want to probably disable that allocation because it's going to waste a lot of resources. Yeah, if you're short on resources you want to, or you don't want to affect other surfaces running on the cluster. Could be done using lifecycle hooks or ham hooks. You can use shard allocation filtering, which is the elastic search pendant to cordoning off nodes. Okay, I'm going to tell, I'm going to say no index, no shard should be on that particular node and then it's going to move all the indexes off that node. It's a way to do that in a controlled manner what I just did in the less nice way by just killing that node. We're not doing it, but yeah, if you feel sensitive about that you can use that. Couple of other things which are more advanced we're not using them. I might want to add at some point shard allocation awareness so you can basically set proprietary labels on your nodes and then you can tell elastic search to take these labels into account when allocating shards. Which is interesting if you, for instance, available these on OREX. You can schedule all these shards and nodes onto different hosts and communities but then they might still be in the same if one AZ goes down your elastic search might still be down and if you do this shard allocation awareness stuff then you can avoid that. Again, we're not doing that yet. Yeah, the shard allocation filtering I just mentioned might be worthwhile looking into that at some point. One big pitfall we ran into and this is something I want to mention scripting which is something that is disabled by default for good reason. One of the reasons is security thing. The scripts run with the same permissions as your elastic search cluster. So, careful with that. If you want, if you need to do dynamic stuff in your queries or not you but your developers if they come to you begging please enable scripting try to convince them to use these sandbox scripts that are yeah, moustache expressions. I honestly don't really know much about them but if you really have to enable dynamic scripting which is using groovy which is if you haven't heard about that kind of like a scripting language running on a JVM is it dynamically compiled scripting language and this is the problem it starts compiling these scripts every time you run a query with a script in it with an embedded script in it and that's going to fill up something called your bytecode cache in your JVM and like basically what happened is we enabled that stuff without proper testing and it didn't have any immediate impact but after three days or something our nodes started earning out of memory and pretty, pretty bad out of memory so it looked pretty much like this to us so everything went down and we it took us a while to figure out what the problem was to avoid that what you want to do is using parameterized scripts so what our developers basically did that used string concatenation to put dynamic parts and static parts of the script into one string and send that to the cluster which let the cluster recompiling the scripts every time but there is something like a pandan to parameterized queries and databases where you basically have placeholders that are replaced at query time and then it's compiling only once to parameterized script yeah, it's pretty much like parameterized SQL queries yeah, right and if you go for it test the impact on your cluster test your CPU usage have a look at your memory usage make sure your cluster is not public it's not public it's not public anyway, right it better not be don't run your elastic searches route inside your pods so it's just a user you deal with diligence I'm surprised that the scripts cause an auto-memory problem cause they ate up memory it's because of that memory cache the way that the scripts were written basically I explained that there were dynamic parts of the script that were the whole query was sent as text each time so you know if you use a database you can prepare you can prepare statements prepare statements one time complied and then you just pass in the parameters within the JVM reclaim the memory after it's finished compiling I think that there is also a bug in the JVM involved actually so it's something that's when Java 8 and in Java 9 that's kind of mitigated so there is a lot of we found a surprising lot of different factors that played into this but basically it started compiling these scripts every time you send a query so there is there is something there is something called a script cache it compiled code cache and it kept compiling until that cache was full and once that cache was full it compiled every script every time it got the query and that started eating up the CPU with the memory and at some point the whole thing went down I think this picture is a bit too soon too soon yeah it was in our case it was already done this was the cluster okay I want to mention this thing we're not using it but I suppose if we would start over this is one of the things we would look at it's called elastic search operator which basically gives you a controller which has a custom resource definition called the elastic search cluster so you can basically you can basically in your this is this thing I hope this is readable you can basically create a resource of type elastic search cluster and set custom properties for this type so like you create a deployment or a stateful set now you create elastic search cluster and that is going to create all the other stuff all the elastic all the deployments and stateful sets and services and everything does everyone know what an operator is? it's a bit of a help chart I guess you know so just a a bit of background on that the terminology operator is used in Kubernetes as a it's a bit of custom code that embeds best practice around the particular bit of technology so in the case of the elastic search operator it implements a bit of YAML that's familiar with the deployment or a pod or anything else that you're familiar with but it takes those things to be able to deploy elastic search in the best possible way and combines them all together so behind the scenes it will go through and it will create event loops it will watch to see if the pods are running making sure that they're running healthy it will provide things like backups and snapshots attaching store agent so you don't have to worry about if it's going to be a state full set or if it's going to be a deployment it'll handle it all for you based on the team who's the group of developers who are putting together this operator so you see things like elastic search but there's other things for Kafka and etcd and a bunch of other tools out there so it's sort of something that's being developed in the last year or two as a way of adding a bit of intelligence around running a particular piece of technology so it's going back to Helm it's a lighter weight heavier weight heavier weight yeah so Helm I mean it gives you it gives you lighter weight Helm charts in the end because you don't have to a Helm chart can wrap around an operator because at the end of the day this has just passed to the Kubernetes API the Kubernetes API will store it and then it'll pass it up to a bit of code and so that bit of code being the elastic search operator will then operate on that talk to the Kubernetes back end and create a lot of pods stateful sets and everything else based on best practice for you so it's usually a and then a controller itself in most of the cases it can go but it doesn't have to be it can be an iPhone or anything else it can talk to the Kubernetes API to build something that will give you a really optimal experience running these sorts of things so how much are the elastic folks developing these themselves that's the thing that's the thing they don't yeah but I had a there was a talk on YouTube where they discussed exactly that question so far they don't it would be obviously I think it's part of the concept behind operators that the vendor creates the operator because it abets the domain knowledge and who knows they're doing better than the vendor but yeah so far they don't so this thing is maintained basically by one person which is also one reason for us not to use it because it's a bit of a gamble I beg you any second so you can focus yeah alright from an economic standpoint how does it compare with a hosted elastic search department is it cheaper or more expensive it definitely has more control because it's on the old cluster I mean I did invest quite a bit of time in this thing recently but then I'm also confident that this is almost closed it was related to upgrading into elastic search and the community is 1.7 so we had a lot of new concepts why did we not host it we considered it we considered it I think I mean one thing is there's a lack of you can't run hosted and we have elastic search in Singapore you couldn't I think no but we could use we could use like you can use AWS but you could use the one from elastic companies a lot of service providers for elastic search right but then there's not only a cost factor there's also a bit of as I mentioned before there's a governance factor involved you want to have your data in a third party location but it has latency it has performance involved I mean it's probably for elastic search but not that defining criteria but for something like Kafka I might need I think elastic search was the very first run of community concept that Kubernetes would handle it yeah it was one of the things okay just to wrap this up so as Hunter said it gives you a level of a higher level of abstraction over the community's permissions it embeds the domain knowledge and it's in the best case it comes from the banner so you get something that's actually sound solid yeah and if you want to see a demo of that thing there's a video on YouTube I'm not going to show you that now because see it in the phrases yeah but again it's one if you look at the Gertrude Wiesenschadt and GitHub it's one guy doing this so yeah okay that's it the result of this attribute so it's not just one guy it's not that much right you type those in documentation I lost it where are these charts again it's like insects it's like nah it just kind of contributed yeah you see it's one guy it's predominantly one person Steve six billion miles what happens is let me guess it's it's all the go vendors six million lines it's a long line to go he's a very he's a very productive person okay yeah that's the end of my talk we're running a bit later we should not discuss too much now yeah thank you you're because now I don't have to talk about you're welcome