 All right. Thanks everybody for joining. I'm Gopal and he's my colleague Chaydeep. We are from Uber. Today we will present how Uber uses Cassandra to power our core applications at a very large scale. The agenda looks like this. We'll have some introduction. We'll touch upon the scale architecture and how we manage the fleet. Later we will also cover some of the features that we have developed at Uber and then push them to upstream as well. We'll also touch upon some of the challenges we face. So let's dive in. We provide Cassandra as a service to our internal teams. We aim to do heavy lifting on Cassandra side and allow teams to use Cassandra with minimum efforts. The first thing that comes into play is how we fit open source Cassandra into Uber's infrastructure. For this, we need to make several changes like service discovery, logging, metrics, change detection, backup. We make these changes on top of open source Cassandra and we have an internal fork of Cassandra where we make all of these changes. Our journey began with version 2 and since then we have been trying to keep pace with stable open source version. So we keep upgrading as and when possible. As of now, we are almost done with upgrading to Cassandra version 4. Another major responsibility our team has is to understand Uber's needs and develop solutions on top of Cassandra to power Uber specific requirements and then build those solutions. We push some of these features to open source wherever possible and later half of presentation will cover some of those features. We also operate the large fleet with lots of automation. We provide consultation to our product teams in mapping their requirements to Cassandra features. By putting all of these together, the Cassandra is made available to internal Uber teams as a service. So let's discuss about the scale. The scale is very large and at this large we exercise almost every feature of Cassandra. So when wide range of use cases are coming into Cassandra at a very large scale, we start to observe weird behavior or sometimes even the bugs. So for example, we are upgrading to Cassandra 4 and although Cassandra 4 is very stable in open source, but recently we started finding multiple issues bugs. So we fix them and again push them to open source. We use safe deployment practices to catch issues early. We do not encourage customers to do. We do not encourage a lot of custom tuning per use case because when we are operating at such a large scale, it is not easy to maintain so many custom configurations. So we want to make sure that our default configuration works for most of our use cases. Note that our default configuration is not the same as open source default configuration. So this is architecture. It's very high level, 10,000 feet. As you can see, we have multiple regions. The same cluster can have a footprint in multiple regions. And at the top, we have the orchestration engine that provides hooks where we write Cassandra specific logic for actions like create cluster, restart node, restart cluster or replace node. We also use a sidecar that acts as a bridge between orchestration engine and Cassandra. The sidecar interacts with Cassandra using the JMX port. On the data plane side, the applications connect to Cassandra using driver libraries. These are open source drivers and we make small modifications on drivers again for service discovery, load bales policy, host filter policy. The applications only need to provide the cluster name. Rest of everything is done by the driver internally. So this reduces friction in adopting Cassandra into the application side. We also have a hybrid deployment model where part of a cluster could be on-premise and part of the cluster could be on public cloud. Thanks to Uber's low-level infrastructure that makes that abstract out the differences between on-premise and public cloud. That makes migration to and from public cloud very easy. So this is a little bit about the fleet management. We have to have every operation automated because of the scale. We cannot manage any operation to be done manually. The fleet automatically reacts to the dynamics of any changes happening. This includes the vertical autoscale, CPU and disk bed host detection and evacuation. Node replacements to achieve better packing of VMs is also done as part of the automation. All dashboard metrics alerts are auto-generated when new clusters are created. So no manual actions needed there. We provide strict SLA to our customer teams. Clusters are grouped into tiers and SLA depends on the service cluster tier. This is about resiliency and fault tolerance. We have some of the practices that we want to share here. So when you book a ride or order food, it is very likely that the order or the request goes through multiple Cassandra clusters. And therefore we need to make sure that Cassandra is running all the time with highest availability. The bed host detection is very important for us because as soon as a CPU memory disk network error is detected on the host, the node replacement is automatically triggered. And this operation happens like multiple times a day without even anybody noticing. Next is the replica placement. We make sure that the replica, we never place more than one replica in a single zone, which means if a zone goes down, we will always have still majority of replica available, which will not impact the availability. So if a zone failure happens, Cassandra will not have any impact on the availability. Other thing is the node replacements can be triggered for another variety of reasons, not just the bad hardware. For example, we do fleet optimization. We may be doing some kind of OS upgrades, kernel upgrades where we don't want to do the upgrades in place, so node replacements are required. And these operations has to go very smooth. The way we do node replacement is we first add a new node, then we decommission the old node. This ensures higher availability compared to decommissioning a node and bootstrapping a node with a replace address option. In addition, we use full repair and node cleanup for guaranteed data consistency. One of the more prominent features we use is the data bulk uploading. We use Cassandra for feature store use for machine learning. Feature stores are petabytes of data that needs to be refreshed on a daily basis. So our ML platform called Michelangelo uploads data to Cassandra using bulk upload. First new SS tables are prepared on offline using Apache Spark, and then these SS tables are streamed to Cassandra using the storage port. For Cassandra nodes, these streams of SS tables are similar to inter-node streaming. Marmery framework used for managing bulk upload of Cassandra. Uber has open source Marmery framework. Now I'll hand over to Jaydeep for the rest of the presentation. Thanks Gopal. In this section, we'll talk more about what all features we have developed on open source Cassandra as well as some of the fixes or corner case issues that we have discovered at scale. One of the important items is repair. I do not need to talk or emphasize more about repair because it's an essential part of the Cassandra. I'll jump on to what we have done to solve the repair in Cassandra. So when we were looking at building a solution that can scale for the last split that we operate, we chose two approaches to run repair in Cassandra. One is, of course, a control plane-based approach where you can trigger repair outside of the Cassandra and the control plane will kind of manage the whole process of the repair one node after another. The other option we were looking at is having repair built into the Cassandra, like compaction. It automatically happens. We did not need to worry about it. So we did a bunch of experiments around it and then finally we decided to go with having built-in repair in the Cassandra itself. So what it however does on the right-hand side, we can see is all the nodes know the metadata about each other and they actually know who is going through repair and they know the sequence. So once we know the sequence, we can pretty much schedule how many nodes to go through the repair and the status is also maintained globally across all the nodes. So what happens is as soon as you start your Cassandra cluster, very much like compaction, the repair just automatically triggers based on configuration that you have and thresholds that you have set. It has been running for more than four years and pretty much does not require any manual intervention as such. We have taken care of all the corner cases that you can see like when a node is restarted, what would happen? If a repair thread is getting stuck in the Cassandra, how do we kind of automatically restart rather than having it manually done? So again, as Gopal mentioned, all these things are available either in form of like some of these are adopted as part of the official Cassandra or we have like private patches and forks available for you to kind of look at it. Because we are mostly on 3Doto so far, we were using Cassandra's full repair, but now we have moved majority of our fleet to 4Doto. So we are actively working on moving from full repair to incremental repair. So another feature or framework that we have built in the open source Cassandra is anti-pattern detection. So Apache Cassandra comes with a wide variety of kind of feature-rich dataset. It has the query language is also very feature-rich. It also provides lots of options to kind of end users like configuration like complexion strategy, consistency level that you specify as part of your read and write query and so on. So all those are really promising and that's one of the kind of power Cassandra is having where it can serve a variety of workloads. But there is a downside to it as well what would happen is if users do not data model it correctly, it could lead to anti-patterns like large partitions, tombstones and so on. Another thing is people do sometimes use consistency levels like local one or one as opposed to quorum or local quorum. And depending on your organization's resiliency story, you would have to adjust those queries and patterns. So when we started facing a lot of issues or incidents due to you know anti-patterns, one of the feedback we received from our stakeholders was like yes, we would like to know what is the issue but if we just tell them that oh you have a data hotspot it does not mean anything to them because you have to be more fine grained. So what we did is we kind of built this anti-pattern in a detector in the Cassandra where each and we inspect each and every single incoming read and write queries against list of anti-patterns that we have classified. And if they kind of you know fall into one of that category we you can see on the right hand side that you know the framework automatically you know emits the details and the details is like very fine grained. We provide key space name, table name, partition key, wherever possible, clustering key. So it is a kind of you know structured data we emitted to our log files matrix and once we have the data we kind of you know do a lot of other things using the data on the far right hand side we can see we kind of you know have an issue generator which generates unfortunately we cannot fix all these anti-patterns magically on the server side otherwise we wouldn't even have to do all these things we just apply some magic trick and then it will just fix it but that's not the case like large part issues some of these are data modeling things where it is not easy to fix it. So we then generate you know issues we file tickets and assign it to our users so they are now aware about okay what all anti-patterns they have been running into the Cassandra cluster. It is available on the Grafana dashboard these tickets are very I mean this information is very useful when you know when things are running healthy nobody cares but when when you are kind of running into a scenario where your Cassandra cluster is kind of you know not performing or you are in incident response more then this is very critical information that you can use and then in real time it will tell you that okay this is the partition key having issues so basically you can you can act it fast these are some of the you know improvements and fixes that we have done at you know as we as Gopal mentioned right now the the scale is pretty high so what happens is like we time to time encounter some you know bugs or corner case issues as we kind of you know run Cassandra so one of the first issues is is like you know token ownership split brain so in Cassandra there is a concept of storage cache and or service cache and gossip cache and these two cases could go out of sync in the worst case scenario so the in order to fix that you would have to restart your node so but but at scale you know it cannot work so what we have done is we have kind of you know introduced a mechanism in the Cassandra gossip thread itself where you know we compare these two cases and on demand we kind of fix them so we do not have to kind of you know restart a node and any of those things would not be required so there is a ticket there we have submitted a page here go and you know take a look at it the next category is node replacement we do replace nodes like you know pretty frequently like thousands of node replacements happen on a weekly basis and we encountered you know issues in the Cassandra when you know we are doing node replacements so for example when a node is bootstrapping and if it fails bootstrap step then there was no way in Cassandra to know what is going on so our control plane was just going in the infinite loop and then it would require manual interventions so another issue we saw was with the orphan hint files in in that case where you know when a node is evacuated but the the pn node still keeps the hint files so what would happen is eventually you will just transferring the orphan hint files unnecessarily and all the period of time it will just kind of create a big you know chunk of your data and it will slow down your node replacement all these four tickets in the node replacement categories have been part of the open source Cassandra and and have been much with you so i think you you would already be using it so you might already be getting some benefits out of it there is a corner case traffic redirection bug we discovered when we were upgrading from 3014 Cassandra wasn't to 3027 so this is again still in discussion with the community we have submitted a page there so if you are proactively planning to upgrade from 3014 to 3027 go and take a look at Cassandra 17248 the last part is the immutable Cassandra so the idea here this is a currently work in progress at uber so what we are doing here is we have certain use cases where you know we only write once and then never modify the data and the right path is again not through the regular cql port it is mainly through the analytical you know workload through the spark so it goes via the storage port so there are two things we are solving here so one is we need worsening mechanism of our data in the Cassandra that is one and second is we also wanted to kind of reduce some resources on the Cassandra side so what currently we are doing as part of this feature is create you know dump your data on a let's say on a daily basis you create a new table dump your data to that table so today you can create a table one underscore December 12 tomorrow you can create another table December you know 13 and so on you of course you know have to expire the old tables so that automation we have built in so once you write the data to table on December 12 which is today's table you do not have to compact it because then you are only writing it once so there we are going to save a lot of resources again all this thing is currently work in progress under development slash kind of rollout so we would kind of you know provide more updates you know when we meet next time the other the other idea is basically you know to have a revision so then you can control how many you know revisions of the data that you want current one of the ways you can control is through the TTL basically you can keep a TTL and then you know you can keep upending data but then what happens if your analytical workload sometime cannot upload the data then the TTL will just expire your previous version so you cannot maintain an exit number of versions there so that's one of the thing that currently work in progress okay so with all these things there are a lot of challenges as well as we operate Cassandra I think anti-pattern is definitely one of the you know top in the list specifically the large partitions because when we have a variety of workloads then you know you could end up having like you know large partitions for uber skis like one example like a larger city would have you know more rides compared to a smaller city so if you partition by city id then easily you can have a large partition for a bigger city versus a smaller city so that is one of the challenges we have as I mentioned previously we have introduced that framework where at least we proactively kind of notify our stakeholders on on such cases which has helped us a lot but there is still a lot of work to do especially in the anti-pattern world there second challenge that we have been facing is mostly around efficiency and managing number of Cassandra clusters so we do have like you know a lot of tiny use cases coming in the Cassandra that hey you know I just want to use Cassandra for maybe 100 qps right where for that you probably do not have to create a big or you know like physically isolated Cassandra cluster you could end up like having like one cluster and have you know multiple such use cases tiny use cases forced onto the same cluster but the major problem is like you know Cassandra does not provide like you know well isolation all the way to all the layers right so because of that at least for now what we are doing is even for those tiny use cases we try to always create a physically isolated cluster to avoid noisy neighbor issue and third is materialized view MV so it was a kind of blessed feature back in like 1670 we on boarded some use cases there but it was marked an experimental so we still have those use cases still kind of lingering and it's not that easy to kind of retire those use cases but again we are working on you know some of these items so this section we are going to talk more about like what we are doing next so we are currently building a general purpose rate limiter in the Cassandra so idea here is you might have seen that you know when you have a Cassandra cluster you will see like two or three nodes become hot so you will see like three nodes showing really high you know read or write latency but there is a ripple effect where you know then because of those three nodes other nodes who rely on those three nodes will also slow down so then your entire cluster basically becomes inoperable so idea here is that you know we automatically figure out that cluster has all like whether whether a node has saturated or not we are kind of probing CPU Cassandra's internal queues and other information and then once we get that signal we try to kind of shed the traffic you know proactively so this is currently kind of code complete and we are rolling it out to production so we'll once we have it you know kind of working successful at Uber will as Gopal mentioned we always kind of you know create a for coffee or we always try to merge our changes back to the open source so we'll create a ticket and and provide a patch to it and feel free to take a look at it another item that we are currently working on is context propagation where you know when customers kind of use you know Cassandra on the server side we are missing the context because the application context is available all the way from up in the stack all the way to application but from application to Cassandra there is no context propagation available here we are talking about the read and write you know for every single read and write query we wanted to propagate the context so there is a there is a custom payload flag available in the SQL spec which we are trying to leverage and this is also an item that we will be working in 2024 will be also switching from full repair to incremental repairs and upgrade to 413 that's pretty much thank you everyone you know for showing here and you know supporting us thanks we can take some questions currently I mean we do talk to them like you know time to time but some of these items that we have mentioned here are mostly kind of driven within the Uber only but yeah we definitely take their help in reviewing the code and you know the stuff are they multiple so maybe yeah so you mean like what all missing critical use cases are powered by Cassandra yeah so I think we have wide variety of use cases at Uber so you I mean if you have taken trip or kind of you know order kind of you know meal you can when you look at ETA when you look at you know basically the the fraud detection like you know legit user not legit user all the so when you open the word itself you can see all the feed that is like categorized for you all the machine learning models lot of this use cases have been powered through Cassandra so some of the things now so so far we have been mostly kind of kind of you know just logging it and then filing it to our customers but now we are slowly changing that course and some of the things we are kind of taking action or for example if somebody provides consistency level of one or we generally prefer like local quorum on that kind of consistency level so if somebody specifies consistency level of one or something we automatically you know switch it to local one and so on another thing that we are so is trying to enforce as much as possible and now if somebody specifies a compaction strategy you know some other time window or something we try to not you know honor them and like always behind the same converted to a level db compaction and so on so slowly we are kind of not taking actions on the server side because that's the best thing you don't want to rely on hundreds of stakeholders to take action but wherever possible we are taking action on it but there are some cases like large partition like data models we cannot fix it so there we have to rely on them yeah take the first one what prompted us to do that so uber is not using Cassandra just alone uber has its own storage technologies as well so if you go on the stateful side we have variety of storage technologies available and all of this storage technology needs some kind of orchestration layer to manage these things right so if you think of that stateful orchestration layer that's uber's own engine developed by uber itself and now if any new storage technology is adopted it has to fit into that right so that's the motivation for keeping it and and it helps us to keep every interface uniform and and we are able to scale by putting multiple technology into the same orchestration framework the second part you want so second so second question was like how many people for the Cassandra team you mentioned or the hall so everything is a separate team for example the orchestration is a separate team is not owned by us we kind of are the users so i don't know whether we'll be able to tell exact numbers but at least for the for the Cassandra team i can give like it's it's it's probably like less than 10 people you know managing the whole Cassandra uh at uber uh of course we have to like we use like our metrics we leverage from our m3 team and so on so but but core Cassandra and the uh fleet and everything and development everything is around that i think there was a question or yeah correct so so the question was like the anti-pattern detection framework uh how does it look like in terms of the anti-pattern and second is like uh is it available uh in the open so the to answer your first question uh high level what it looks like is you can uh you can think of like it will say that okay policy id zero uh then it will tell you that okay which is each policy is uh is one of the anti-patterns so let's say policy zero is your large partition policy one is basically read uh you know slow reads or kind of no tombstones then then that query uh one of the log entry that you will see will have enough details like okay the key space name is this table name is that partition key is this and violation is like okay the the size of that partition is like 110 mb and so on so it will be that granular and it will be it is configurable it can be emitted to kind of uh log file it can be emitted to a kassandra table itself you can even customize and emit it to your kafkat topic or wherever you want to emit uh along with that there is also matric also available which will high level tell you that okay uh policy id zero which is let's say large partition this many large partitions are present or you know violations have happened uh so that is first part and second part of your question is yes we did it is at least not much with the open source kassandra but we have our own fork which is published where you know this code is available yeah so there are like always a kind of you know uh anti-pattern uh depending on the type of anti-pattern if you are dealing with a large partition or kind of you know tombstones it is a difficult problem to solve versus if you are just dealing with like incorrect consistency level or you know kind of you know complexion strategy you are supplying when you create a new table right so what general first is actually a positive response where at least everybody knows that okay what is going on as opposed to like when cluster degrees we just say that oh we feel there is a large partition but then there is no nothing concrete available so now at least there is a concrete data available to us as well as to our stakeholders right so everybody are on the same page because which is very important thing when when any degradation or anything happens at scale right so that is one second is there are many stakeholders who have kind of you know proactively worked on removing some of these anti-patterns right again it requires you know dedication and effort because some of this might take like weeks to months of time but we to be honest like you know we have not solved all anti-patterns because it takes time right so that's where kind of slowly you know it is moving towards the right direction and and kind of you know getting some traction also it depends on the teams when they assess the risk of that anti-pattern right that ticket is filed to them it is it is the customer team who decides whether they are willing to continue with this risk or they need to take immediate action right we as a Cassandra team cannot help them we are available for any consultation but if a change is required on their side they need to assess the risk and then prioritize that and in in low-hanging fruits what we have seen is like sometimes people even don't test data there so for them it is easy that oh let me just drop this table because it is unnecessarily creating pain so I think we are out of time but we can take more questions offline yeah sorry thank you thank you thanks everybody