 So thank you for coming to our talk about building a patch with on Kubernetes or the motion says pattern data. So first, there won't be any line of YAML in this presentation. So you can get out now if that's what you can for. And also we're going to talk about what we are putting in our Kubernetes, what workload we are running. So maybe not the best plan in KubeCon. So I'm Alex Triget. I work at Daily Motion as a Senor DevOps engineer. And I'm Cyril Corbon. I'm DevOps architect at Daily Motion. So at Daily Motion, we are the DevOps team. And we are building Kubernetes platform over different cloud providers and on-premise. We are also helping developers to set and productions all their workloads. That's what we use to deploy the CSCs and to enforce best practices for them. And we are also doing all the on-call stuff. At Daily Motion, we are companies that build video platform, if you don't know us. We have like 4 billion views per month. And we have two main projects. So the DailyMotion.com service. That is our main platform. And a network of partners with advertising platforms for other customers. And we serve all the metrics and all the partners that are through Druid. And we will talk about that in this presentation. So yes, I'll introduce to you what Apache Druid is. But first, does anyone here run any Druid? Quick show of hands, please? Really? I'm so happy. Great. So basically, what Druid is, it's a database. You use it for analytics capacities. It's all up. You want to use it for time series. And that's a lot of benefits, like time series. And it's a column database also. So plenty of exciting capacities. So anyway, here we have a presentation of the architecture. It's pretty complex. There are different ways to run it. Basically, let's take a top-down approach. You have your query nodes at the top here, which are with what we will interact. Basically, they handle all the routing, all the queries that are made to the historical nodes that will serve the data. The indexer of the mirror walls of the historical is in that they ingest the data into your clusters to be served by the historicals. Underneath, we have a layer of deep storage. We are running on, well, S3 protocol at our CSP. And on the side, you have a control plan, which is actually fairly complex, because here we can see Apache Zookeeper, metadata storage, actually. It's a database. We are running some MySQL. And you have some other Druid nodes that handle all the persistence of the data, all the balancing, so that all your historicals are in synchronization. Now, I'll explain to you how a query works with Druid. So basically, we have a scatter-gather approach. So when a request is made to the broker, it will scatter this query to all the historicals. That will each execute a partial query and then send back the data to the partial results to be aggregated, the gather part, at the broker. It has, it's felt nice because actually, if you have good patterns, you can have all the RAM intensive stuff in your historicals and just ends up doing reconciliation almost on your brokers, which is very efficient since all your historicals are built with a lot of RAM to be able to serve that very efficiently. Thank you. So why use Druid? Basically, I've listed you the pros, but really the big thing here is the ability to do OLAP on time series. It's rather unique. You can do that with other stuff, but not as well. And also you can do it with columnar data. So it's, well, it's a standard in terms of OLAP. So this way you can just add in other columns as much as you want and never have to think about your schema while what you're using as long as you aggregate the data. It's easy to scale because actually you can scale horizontally all your nodes. You can increase the number of nodes. You can just increase the storage if such as what you need. So very powerful on this. Yeah, also documentation is great which ties into a big cons here. It's that the configuration is pretty complex. Once you have your setup up and running, you will need to do quite a bit of tuning with your values. You need to adapt it to all your use cases to really make the best out of what you have. Otherwise, you can just throw more RAM at the problem. It will eventually work, but it's not exactly why we are here for. And yes, as I mentioned, the setup is pretty complex. We have a lot of moving parts and our Druid run like five different nodes, the different type of nodes. So that can be a bit of a hassle when you're doing a rollout. You need to make sure that everything is in the right order. So it's kind of thing. And finally, you need to need it because that's something that's pretty unique and that's only adapted for a bunch of use case, use case that are fairly spread. Mostly I'm thinking about statistics here. It's very powerful for that, especially since we are using a lot of approximate counts to save on group buys, this kind of queries, which makes it very fast. Yeah, now I'm gonna briefly present to you the Druid UI. I'm not sure we can see it quite right. We have some color problems, but basically this is a standard SQL query. You can interact with Druid by using native Druid query language, which looks like, which is a bunch of JSON, where you list all the operations you need and in which order. Or you can use SQL and we have a translation engine here. It works fairly well, especially with the latest version, but you always run the risk to have inefficient queries because we are on fairly specific data patterns. So specific queries run best here. And yeah, also I didn't mention you have two kinds of, well you have three kinds of fields in Druid. You have your timestamp, so that will be your primary aggregation means. Yeah, that is unique. And so you'll have dimensions, which are basically strings and can even be arrays. Not something we are using. I think it's dangerous to be honest. And I don't have any here, but you have floats for your aggregation, for all the stats. You want to be serving, that will be your metrics. Also I wanted to mention, so let's talk about ingestion. We usually ingest data hourly and daily. Basically the data is reconciled at the end of the day. It's standard business practice for advertising data. So some things are just done about everywhere. And we have usually data freshness of a bit less than two hours. Like once an event happens two hours afterwards, it is counted in our Druid cluster. We use Apache Airflow for that. So it's an orchestrator. I'm sure a bunch of you are familiar with it. It runs quite well. The only problem we have is when we want to run very heavy tasks. So when we are backfilling an entire table, it's called data source in Druid, but it's a table basically. It can take some days. It can be very cumbersome. It can kill the cluster basically. So what we are doing is we are popping adobe clusters. We use it as a SAS and it runs well. It's not my favorite part of it to be honest. I would advise against doing it. You can find other patterns, especially with latest version of Druid, but legacy is legacy. So now we are going to talk about how we set up our Druid on Kubernetes. Just to know, are you running stateful workloads on your Kubernetes clusters? Okay, we got some, okay. In prediction? Yes, okay. So that's nice. Just a brief recap. For some people that don't know operators, the goal of the operators is to have a declarative state and to have all the logic on the operator. You will reconcile the state and apply the resources. Each time you change the state, it will reconcile, apply the change and report the status. That's how it works, almost all the case. We have some use case at Dailymotion and we are running a lot of operators. These are the main use case we tackle. So we have data management, database management, application monitoring, application deployment, monitoring and logging, machine learnings and infrastructure management. We use with Druid some of the operators like set manager to provide certificates and uningresses. And we are using all those operators, such as logging operator, to manage our friendly and friendly dynamically. For Druid operator, it's a single, just to note, just a point, recently we forked the repo with the maintainers. It's now maintained on this GitHub repository, not on the Druid operator, the old one. For Apache reasons, they don't have any maintainers currently working on the project, but others people do. It's containing one and only one CRD for the moment, that is Druid clusters. And the CRD is deploying the clusters with several type of nodes that we have presented. And so the routers, middle measures, coordinators, historical brokers. Basically it's by default all stateful sets, but it's also, you can change the kind, depending on the type of nodes. So for example, in our use case, we have only these toy cores that are statefuls. And that, it was the operator. The goal of these operators is to do these features. So, a routing deploy, that's mean when you made the changement, a modification of the configuration on any Druid objects, on any parts of CRD, it will redeploy the cluster on the good way. That's mean in order we will talk a bit later about it, but with the historical first and using nodes, parts by parts. Then it provide auto scaling and HPA on all the components. Also volume expansion that's useful for historicals without any description. We can, we didn't use it, but we can manage our own PVC and delete PVCs of the stateful set if they are living. And the features that we really want to use is steering management. That's mean you can create several historical nodes and serve different type of data in these historical. That's really interesting. And the operator is doing it pretty well. Yeah, so basically what happened is we had a legacy cluster by Druid. It started as a POC. It ended up in production, but it was still kind of a POC. So, also mostly it was working. It was kind of reliable, I'd say, but main issue was that we didn't knew what we wanted to do with it. So basically we provided this to the data engineers and they had fun with it. So we ended up scaling a lot in terms of storage. We didn't think we would have that much stuff like this. And so after about two years, we ended up with an old version because we had never made the effort to update and a bunch of needs that had changed. So it was time to do a refactorization of re-architecture even. We got rid of TIDB. I really like TIDB. It's a good product. But the problem is we used it on our cluster. We run on a CSP as a database in cluster and it didn't really suit our need because it's good for performances, stuff like this. But what we needed was simplicity. Basically all you need is a MySQL to do all the progress, to do metadata data. So in the end we ended up cutting it, but we are pretty happy about the experience we earned on it. And we intend to reuse it on other projects. And also, yes, we had a lot of pain points. Mostly it's a rollout. Like it took 28 hours. We couldn't do any operations on our node pools. It was very complicated. So the main goal was to change this fact. We were using local SSD, which is good for the price. Very good performance for the price, but it wasn't satisfying enough. In the end we calculated that we saved a lot of money by switching to PVC, which are twice as expensive, but just the time, the engineering time saved was massive. So all we did that is that we used GitOps to FluxCity, which is something we use very widely. Basically we have moved about two thirds of our projects to FluxCity and all new projects are on it, obviously. We intend to be done by the end of the year, I'd say. So, yes, that was why we needed to change all of that. Also, the deployment was manual. It was really a bunch of legacy that was left to rot in a corner. And so we needed to take back the ownership of our stacks. So that's why we did this mission. And basically the point was to build brand new grid clusters that was in face with our needs. And that enforced best practice and also that would cost less because we were using big machines with 128 gigabytes of RAM and, well, money to save. Yes, and we are going to talk about this. This is the current state of our grid clusters. So we have the standard node pool with all the main parts, so the coordinators, the routers, the brokers and the external null parts of the real operator. I would like to remind that ZooKeeper and MancacheD are not deployed with the operator. That Mshad that are deployed separately. Even the ingress controller is not managed by the operator. It just deployed to the operator, just deployed in ingress. So we have this on a separate node pool. We have a node pool dedicated to the exposition of the service with an ingress and ingress controller and the application that serves the API and all the statistics. And that queries the grid. And we have then the stateful nodes, so the historicals. So pretty huge in memory utilization and on storage. We have like, if I remember correctly, it was more than one terabyte of PVCs and we have 80 gigabytes of memory for each port. These workloads are very memory intensive and that's a point of grid and that's why we do that. We have a part of our grid historicals to reduce cost that are in spot instance and because the grid cluster when it kill an historical just does not return the queries that are running and it's retrying on all the nodes and we have a very pigmentation factor at two. So we can backport the query on all the nodes automatically and that save us a lot of money. That's almost everything for this part. Yes, so as I said, we wanted to save money. We wanted to improve preferences. So what we needed was benchmarking. What we did was obviously made a set test of requests. So you have the number of seconds to run the whole test here as smaller is better and obviously the yellow one. I guess it's yellow here is the old one. What we did was we created two sets for benchmark. Basically one test to test all the limit edge cases. Basically recovering all the data in the cluster trying to break it. It all does to test resiliency as well as performances in this context which is not really conducive of real usage and also using real requests taken from production. So we ended up with this test. We have more fine grained results but those are the summary. We tested a bunch of configuration. One interesting stuff that we did was we mimicked what we had. So basically we had local SSDs and instead of replacing four local SSDs with one PVC, we replaced each time one local SSD with one PVC to ensure good performances. And we saw very important differences when using more PVCs and it cost exactly the same. So it was really almost a hack. Pretty happy about that. Also we tried remote caching. So we used to have caching or no do it but caching is very good for do it because as I mentioned earlier you have all those states for your queries. So you can have query results, partial query results or even data caching with segments on the historicals. So use the spare RAM to do that. So the more caching you have, the more efficiency you can have especially when you are serving traffic for statistics with often redundancy between your requests. Everybody wants to see the data from the last seven days, stuff like this on Monday. So yes and in the end remote caching was very interesting in terms of performances. It was very easy to set up with Memcached and NChart as Cyril mentioned. I really advise on adding caching when you can to trying to do it. I really, I think it's neat. And yeah, so now I'll talk about the migration process. So this was the biggest headache, I'd say with the configuration because what we needed was no downtime. So we need to do a double run of our clusters. The problem is we had to have the concurrency between operators, which were now the same version. So we had to chill a bit. And what we did, we started by killing the old operators. When that was done, we still had our CRG in place with none of the control of the operator but it was fine, especially with someone doing an operation on it. And afterwards we set up the new operator, all that with FluxCity. So everything was ready to roll out. It was really, really easier with this to perform this operation. Then comes the even trickiest part. So we scaled down some nodes, the ones that writes information to database so that we were able to do a database migration because we wanted at the same time to change what we used. So we scaled down the nodes that write to the DBs that interact with it. We killed all the ingestion tasks. So that was our only downtime. We had a bit of less freshness on our data, which is for use cases, perfectly fine. And then we set up all we had with the new cluster. Ended up with, yes, and so we migrated the database and in the end we just switched to the new cluster and we were about to kill the old one after two days. So the whole operation took about one hour to migrate about 13 terabytes of data, which is nice to be honest. And we didn't have any downtime and most time consuming was actually dumping the database and restoring it. I guess we could have made improvement on that but we were happy about the whole results of this. And now we'll talk about monitoring and exposition. Yep, so for our use case, we are using Datadug but there is a lot of plugins. So you can see all of them there. There are open metrics, Promethysense, that's the plugins available. And we have two ingresses. So that we deploy one for the console that we showed before. And one for the APIs that is protected with MTLS. And we use Set Manager to prove all the certificates on our case. That's how we did. Just a point, we also use Datadug.app to monitorize all the traces. And that's useful sometimes on Go API. We are going to talk how we get up, that's mean how we deploy it in our cluster. So as Alex said, we are using Flux CD in our case. That's working with two controllers, mainly the customized controllers and the M controllers. And with different kind of sources. So in our case, we have GitHub sources and ML repositories. And we deployed several customization to in our clusters. And we are doing some tiering too. That's mean we have a first customization that is deploying all the CRD stuff and all the operators. Another one that deploy all the business namesways. One with the memcajd and the keeper. And finally, the root clusters. That's all us to stop, to suspend the customization if we have to do a modification production and then to back port. If we have to back fix faster. And that authorizes us also to have dependencies between our deployments. That's really nice and it work perfectly well. On our common operations, we talk about automatic rollout that's a feature that's operator bring. That's mean when we made a modification, it will always roll out in this order. So that's mean if the historicals are impacted, so historicals will restore one by one or two by two, depending on your update strategy. And then it will take the overlords, the indexers, the real time parts, the coordinators and the brokers. Another part is the storage update. In Dread, we want to add that as sometimes and meta that as too. So we want to scale horizontally and vertically. The operator provides volume expansion dynamically and it's worked really well. And another point is recently we had to migrate from a CSI to another one and the operator helped us a lot in this case. That's mean when we are changing the configuration, only the CSI on the new SDS, state first, which change and it will load the data one by one. So we don't have with the replication factor any disruption. So yes, we're going to give you as a feedback. As I said, everything we did was perfect. It was very nice. Congratulations, we congratulated ourselves a lot. So well, no. So we had a few issues. One of the issue we encountered is performances. So basically we are using SQL as it's a part that is not maintained a lot. We have a lot of changes of teams. And so we don't have a lot of skills around using native queries right now. And the problem when you do that is that you don't have control over what is running on your clusters. And sometimes you end up with big inefficiencies like you are running a bunch of group buys, you're running aggregations of aggregations while it's not done for that and you don't need that. So that's one of the points. In the end, we were able to make it do what we wanted by playing a bit with the SQL by using the latest versions, which have a very good translation engine. To be honest, it's really getting there. So you can just use SQL. And if you have bad performances, rewrite your SQL to adapt it to the use case. And we also had a lot of tweaking to do about the configuration. As I mentioned, there is a lot of good documentation. So we will explain what is default, how you should do your calculation, but you need to get in there and pull your calculus, your Excel sheets, and do a bunch of operations on how many threads will I need, how many burphers, which sizes, what does it mean for my brokers, stuff like this, which is something really that you, you can do a pretty good job on your own, but in the end you will need to test it to make sure that everything makes sense and that it runs properly. And also about the migration, we ended up with a big surprise in that when we migrated our data, we changed the bucket we used underneath. And we didn't, we, we didn't, I didn't know actually that in the database for all your data, you have the location, the bucket it uses. It's not needed because it knows where, in which bucket to search for, but it's specified in there as well and we didn't change it. So we ended up with the staging not popping up, like that I wasn't there. It took us a bit of time to understand what was going on and we ended up running for two weeks on two different buckets at the same time, which was very messy. So we had to do a second migration and this time we were about to rectify a zero. And finally we had some very, very strange behavior, which by using preemptible spot instances, we had some very, very down to the machine problems that we weren't able to explain, but it was very hard to debug because we were using Java 8 and it's not done for Kubernetes. And so you are not able to monitor what it's doing with RAM. You can just give it a huge chunk of RAM and hope for the best basically, which is not the best because the more RAM you have on your Dread clusters, the better it operates. And especially with spare RAM, it can cache segments, stuff like this and it gets a lot of benefits out of this. So we had a lot of issues on this. Now we'll talk about all the improvements we see for our own clusters, but to be honest, it represents a bit of what is happening right now around Dread, I'd say. Yeah, currently we are planning to migrate to Dread 25 and to migrate to Java 17 because we had some issues with the Java 8 version currently. Just to precise, we got all the RAMs that is cached by JVM, but not usable by the Dread when we are running on spot instance. That's what happened. Then we have some nuts that are running on ARM and we are planning to decrease the cost to migrate all the Dread clusters on ARM, I think in the next quarter. We plan to add a proxy SQL in front of our SQL engine and to run Dread without the creeper. That's when we plan to use the DCD and the Kubernetes API to get all the endpoint and all the information that we need. It's currently unstable from this documentation that we plan to test it on staging soon and we hope to have this take all in the next months too. How long does it take? I don't think we have any time actually to talk about the long term. Anyway, maybe we can finish with a closing remark. It's that in the end, what we are doing is running stateful sets on Kubernetes and really taking advantage of it. I would understand very well that you wouldn't want to do that but basically it's not something that would run better on bare metal in my opinion. So in the end, maybe it's not that well suited for Kubernetes but what is it suited for in the end? We don't know and so we think, no, we think Kubernetes is the best really that can be done even though it's not maybe that easy and finally we'd like to address a special thanks to the DUDE and the DUDE operator community especially the maintainers whom we have worked a lot and thank you.