 Welcome. So I think this is a potentially common theme, but obviously the slides have changed since March. Nathan Jacobs and a principal developer at OSI and we run big clusters. So I've been with OSI for eight years using Cassandra in our historian product. We work on operation technology solutions. So utility, electric gas, water, transportation networks, manufacturing, kind of anything that has a control system. That's kind of our core business and Cassandra is an excellent time series database. And we've been using that as a part of our historian product for the last 10 years and I think it's been pretty successful. We've shipped for customers hundreds of clusters, centuries of data, petabytes of data, all kinds of deployments and it works. So kind of what's the customer use case of why you would want to use Cassandra as the backing database to a historian. This kind of common knowledge for the Cassandra community and its strength and time series data. For most of our customers that's a 90% right 10% read kind of use case from a regulatory perspective electric utilities in the US have to have a kind of a black box recorder of what's happened so that if there is an incident that they're able to go back and analyze and learn from that failure in the grid and you know change procedure to increase resiliency. We do have some other cases where it's the exact opposite. So some customers use the historical data on distributed green solar and wind generation to train models to predict what kind of power generation they'll have in the next days weeks months to plan the more firm you know gas and coal generation to match what the renewables will be able to provide and so effectively helping to make the grid greener. And it's highly available and resilient right it's. You know you think about the cap theorem and Cassandra is available and partition tolerant you know. So, in the worst case you can still go down to local one consistency and everything still works. And unlike, you know the legacy rdb mess databases. It's not blocking on a lot of operations and so everything scales out linearly with additional compute and disk resources. And that's important because the grid has like it's backing some of the data in the grid operators of the grid are relying on this data to give them context to what's happening right now. And it has to work it's not really something that we can say, you know, we have to patch this or, you know, something happened and it's not available. Right and I think most people that have worked with databases for a long time have seen some kind of issue where a database isn't available because of some issue in the back end right like I've seen. You know there's a lot of replicas that fall too far behind the leader, and you have to basically re initialize them or postgres instances that have a subscription that disappeared on their peer. And it says I can't start because I can't create the subscription for the replication, Cassandra doesn't have that issue it just starts and runs and works. And then it's also very cost and hardware resource efficient. We're going to look at some other time series focused databases and in nearly every single case Cassandra is 50% less disk space, or somewhere around that area less than the other databases. And because there's less locking it's more efficient to the compute, and we're able to run with fewer resources than we would need in another database. So in traditional RDBMS, we can scale to orders of magnitude more data than you would ever be able to run in one of those traditional databases. And then why does the business care that we're using Cassandra? What are the benefits to us that don't necessarily impact the customer? It's incredibly fast ingest and so we're able to run extremely dense instances. We're able to convert data from a legacy historian into our solution very quickly. You know, I think a common thing that everyone's heard is that Cassandra runs, it can ingest data as fast as the disk can write it. Your compactions will build up, but it will accept the data that quickly. It's incredibly scalable and so we effectively don't change any schema GC tuning anything. We run basically the same exact configuration and we run that same configuration on clusters that store anywhere from months to multiple decades of data, one or two instances to dozens of instances and all kinds of hardware deployments. I think because there are utilities and they can't, it has to work, we can't necessarily rely on an ISP and a connection to the cloud. So probably 90% of our deployments are on local, on-prem physical hardware, although we are seeing a lot more virtualization and hyper-convergence solutions being demanded. And so kind of what's the current understanding of what our limitations are? About 10 years ago, I think the typical recommended instance size was somewhere between 250 and 500 gigabytes for an instance. The current recommendation is one terabyte with, if you're using size-to-compaction strategy, you need to have double the space to handle major compactions. And we're not the only ones running bigger than this. Obviously, data stacks has a lot of expertise, but this is how we approach the problem. So kind of the limiting factors to how big an instance can be are based off of your query throughput, how risky you want to run your replacement times, and what kind of your ingestion repair frequency are. So basically compute requirements scale linearly with query demand, and you can only get a server with so much CPU. There are four socket servers out there, but at some point you are going to have to scale those out horizontally. That is an upper bound. As a stable count, if you have the more as the tables you have, the more time you're going to spend looking for your data. It's going to increase your latency. Increasing your latency means that you're spending more and more time handling those queries. So you want to keep that stable countdown. But in order to keep that count down, you need to run more compactions. That also adds CPU and IO load to the system. And then whether you're using TTLs or deletes, you're going to have some kind of load deleting the old data because disk doesn't grow forever. Everyone has some kind of retention enforcement policy in their system. Replacement times, again, instance size is going to be a factor here because it relates to how fast you can copy data. The bigger your instances, the fewer instances you have to stream the data to your replacing instance. And if the replacement time takes too long, then you're going to introduce more load on the other servers in your cluster. And you're going to introduce, you know, with that demand, you're also putting yourself into a case where, you know, if a second server fails, you're introducing even more load and it could eventually cascade to where a service becomes unavailable to the user. And repair consistency, you have to eventually have your data repaired. The more data you ingest, the more data in the instance, the longer your repairs are going to take. And eventually you might not be able to catch up if there's a hardware failure that prevents you from running the repair since repairs require all nodes to be up and running at the same time. So our first kind of approach to this is to promote some of the partition keys you would normally see to a key space level. There's kind of a trick that carries over from the legacy RDBMS databases where, you know, you'd create a new database each month in your SQL database and then you would drop it at the end of the month. This is kind of like time-window-compaction strategy, but at a key space level. And this is a number of benefits for us in how our clusters are ran. So for the first thing is there's no TTLs. This means that we're very flexible in how long the data is stored in the system. You know, a lot of customers are like, I paid for this hardware. I want to use this hardware. Don't delete it until I tell you. And so if we had a TTL, we wouldn't be able to give them any flexibility on that. We can either say you're ingesting data faster than you told us you were going to. We're going to clean it up sooner or your data is coming in slower than you thought. You know, you over-provision the cluster. We're going to delay until kind of the last possible moment to drop that oldest key space. And so where most clusters run around kind of 95, 98% utilization of disk. Yeah, so our disk utilization is much higher than kind of the typical 50% size-tier-compaction strategy. I think time-window-compaction strategy you can normally get kind of a lot closer, but because it's a TTL, you can't necessarily predict, you know, when you're just going to fill up. So you might have to be scaling your cluster out or in to kind of drive that cost savings. The other benefit is that kind of partitioning with a key space, it's actually based on the data in the table. Time-window-compaction strategy, you know, it windows the data by the right time, not by the data you're looking for. And so, for example, we have a 20-year retention period cluster. If we had that all-in-one table with time-window-compaction strategy, we're checking every single SS table, at least at the bloom filter level, for whether or not there's data that we might need for a query, even if it's only the last week's worth of data. Versus with this, we're reducing that SS table count from, you know, 240 months worth of data down to just the most recent months worth of, you know, 6, 8, 10 SS tables, at least while the data is still being archived. And so that dramatically reduces how many SS tables you have to consider in a query compared to time-window-compaction strategy. You don't necessarily get the major compaction at the end of the month like you do with time-window-compaction strategy, but that's as simple as setting up a cron job and, you know, run that a day or a week once that month ends, and you get a good speed-up. I think we had one cluster where a customer was doing a lot of reporting. It was a physical server with hard drives for disk, and they saw a four-times increase in throughput, even with the monthly key spaces after running major compactions on each month that had finished archival. And then what about replacing instances? So our first approach is to, whenever possible, run with replication factor equal to the number of instances. And so in that case, we can basically clone, and if an instance fails, we can just blow it away, clone an existing server, and that's the replacement. So we start by taking the system key spaces, copy those over, you flip one row between system local and system peers, and you have exactly what was supposed to be there in the first place. The other nodes in the cluster haven't been told that instance died, they just see it as down, so you've effectively bootstrapped it without copying any of the client application data, and that's a lot faster than having to copy, you know, all of this data over the gossip service. And then once that system key space is created, we can, you know, snapshot one server and copy all of those snapshot data back to the server we're replacing, and because our replication factor equals node count, in most cases, you know, we've effectively brought back all the data, and, you know, we have our repair that time stamp in the SS tables, and we don't lose anything as far as tracking when repairs, what data needs to be considered for the repair to finish making all the data consistent in the cluster. And it also, as a process we've used, for manual repair, don't use SAN, but we did have a case where a customer was using SAN and we had to manually repair using this kind of approach because the gossip service wasn't able to keep up and compactions were building up too quickly. But it is a way you can kind of save, you can push out the deadline on replacing hardware if you absolutely need to. So with that said, what are the reasons that we scale horizontally, which I'm not recommending everyone scales vertically like this. This is, we have some constraints obviously in how we have to approach our systems to be able to get away with this. But the two main things that drive us to scale it horizontally are the repair throughput, for instance, and how much storage we need in the system. So for repairs, it's a compaction-based process. You generate your miracle trees to figure out what data has to be copied. You actually stream the data, which at least as far as 3x is a compaction process that might have changed in 4. We're only evaluating 4.13 right now. And then after streaming the data, you run an anti-compaction to mark the data as repaired. On a server CPU, you get about 10 megabytes a second of throughput. So with having to do that three times, we have roughly three megabytes a second of throughput on a repair session per instance. We prefer to have repairs run during off-peak hours, and we prefer to have it only last about an hour so that we have plenty of time to catch up in the case of a hardware failure. So each instance can roughly ingest 300 or 350 gigabytes per month before we start to get uncomfortable with how quickly that data is coming into the system. So when we have to ingest more than that, we'll scale it horizontally. And so for that example, kind of the title slide, the 80 terabytes per server, that's running with five 16 terabyte instances as a result of that ingestion rate. The other case is when we have a very long retention period where just rate arrays get too big for us to be comfortable with the rebuild time or how likely it is that we're going to have to rebuild the data. So in that case, we scale out horizontally because then we have the data spread out across smaller rate arrays on those hardware servers. And a constant optimization that we're able to do when we do scale out horizontally is use very dense servers. And so one of the clusters that we've deployed uses the server in the lower left here, a Dell 740XD2, which has both of these servers, the other one's a HP Apollo 4200 Gen 10, I think. And both of these have dual backplanes. So with Cassandra, it's all about compute, it's all about storage. And you can get extremely dense configurations with this. I think the 80 terabyte example is only using one backplane, so we could theoretically go to 160 with 10, 16 terabyte instances on a server. But this gives customers a lot of cost savings. So the first customer that we had to do this for, they did not have the rack space to use commodity 1U servers. It was not an option for them to add, you know, two racks worth of servers to their system. And so this is what we had to come up with in order to serve that use case. And it gives you, you know, savings and rack units. You have less cooling, less networking, less physical space, less administration, and typically compute becomes more cost effective as it gets into larger core counts. And then just the chassis metal itself, how many power supplies you need, et cetera. These particular chassis are a little bit more expensive as per chassis cost, but when you consider the cost of five other, you know, one or two U chassis in comparison, it is cheaper at the chassis level as well. The other benefit that it gives us is better handling hotspots. So everyone has their own kind of standard of what they run at. But if you consider, as an example, a case where you're running 50% CPU utilization as kind of your baseline goal per server, and you had a hotspot that tripled demand, you don't have 150% CPU. You can't serve all of that, you become throttled. But if you had a server that was big enough that, you know, you had five instances on it and you tripled demand, even on two of the instances, you still have 10% CPU still available and you've served that hotspot to your customers. And just to be clear, this is not virtualization that I'm talking about. This is instances of Cassandra on a single OS. Because virtualization, you over provision, you waste CPU, you under provision, you don't have enough CPU for the hotspot. But not necessarily if you have an extremely dense server, but you might want to consider, are you overpaying on burst throughput or on compute that you could make a savings by, if you have a hotspot that's only on one server, can you say I'm gonna handle one hotspot per five servers and make a savings on your total infrastructure cost? All right, but now we're talking about having more instances than replication factors in the cluster. How do we handle the instance replacement at these more horizontally targeted clusters? It's a similar process. We still recreate the system key spaces by copying and swapping the system, local system peers tables. But basically there's an extra step of reducing how much data, we're expanding the search for data and then we have to reduce what we found. So we recreate the system key spaces, we snapshot a number of instances in the same site and then we use SS table split by partition to say that we can address each of these pieces of data independently rather than at an instance level or at a cluster level. Split by partition is extremely fast because you're just seeking to where the partition starts and where the partition ends and then you're copying that to another location so it effectively runs at disk speed. And then you're able to just open that SS table for the individual partition, figure out what the cluster key is, use node to get endpoints and say, is this a piece of data I care about or not? And if it is, copy it over the network at disk and network speed again or else say it's not something I care about and delete it. And then we use node to refresh or I think the new thing is node to import to load the data. And so this is, again, runs at effectively disk and network speed and is very fast for us. I think this is another use case that many existing clusters could use to more quickly replace a failed instance and reduce the impact to the application of the cluster. And then the other kind of thing that we do, this is more of a business and application side thing, but if we can only ingest 300 or 350 gigabytes of data per month, how are we getting to all these terabytes and terabytes of data? Basically we're doing the same kind of node replacement strategy where we copy the data, refresh it or split by partition, get endpoints and refresh it. But then we're importing the data into Cassandra in parallel clusters. And so there was one instance six or eight years ago where we had a project deadline moved up and we had to import decades of data in a very short period of time. And so we actually spun up Cassandra clusters on 20 different laptops to run imports in parallel in order to copy the data that much more quickly because in that case, it was an application driver bottleneck where a competitor's historian, we could only access in a single threaded driver. So we had to use more conversion applications and it was like, well, we're already doing this. We might as well use Cassandra instances and just convert one month at a time and then copy that into the new cluster. That was the easiest way for us to scale that. So yeah, thanks for having me. Are there any questions? Right, so a big part of that depends on what the activity is on each of those key spaces. If you're actively writing to that hundreds or thousands of tables, that is gonna have an impact because you're creating memtables for each of those tables that you're congesting data to that's putting pressure on the JVM heap, putting pressure on disk and compactions and everything else. For us, because we're able to convert the data and we're only loading to the current month, we're really only operating rights on one or two key spaces at a time. And memtables, once they're flushed to disk, don't reallocate in the JVM heap until you have another insert to that table. So we have, in our longest retention cluster, probably 250 key spaces upwards of 10 tables per key space. So that's quite a lot of tables, but because we're really only looking at the most recent key space for inserts, it doesn't put a lot of pressure on the JVM heap. The query is also, you know, the data is stagnant on disk, so you're not really doing anything with it, you're not using any compute. And so we use pretty much the standard configurations, G1, GC, 16 gig heap, and a couple cores per instance. Right, and it's all on the same rate array, so we don't have hot cold storage, and it's also not supported in Cassandra yet, but from a heap perspective, yes. I mean, the SS tables are open on initialization of the instance, but a lot of the other stuff, it's kind of just idle. Yeah, so because we're splitting it up kind of like time with the compaction strategy, that puts a cap on how much compaction will actually happen on that data. And so we use size-cured compaction strategy as the default configuration, and then just run a manual compaction at the end of the month, and that's basically going to limit it as much as you can. How many SS tables are considered in that read path also gives you a good savings on latency. Yes. I think in most clusters that would work, kind of the two things I think about is we have one customer, or at least one customer, there's probably others I can't think of, where they have uneven retention policies based off of data center, and so we actually need to use key spaces to say this data center is RF0 and this data center has the full replica because operators don't need as much context as decades of data. They really only need a couple of months to a year or two, but corporate report, planning out improvements to the hardware in the field, you're doing kind of more deep analytics on that and you need a longer retention of the data in order to serve those queries. The other thing is, like I talked about, because we're partitioning, we're partitioning at the key space level, we're reducing the number of SS tables in that read path because time window compacting strategy is partitioning those windows on right time whereas we're partitioning on the actual data we're looking for in the query. I think in most cases, yes, but we have some data that has to be, of course you can modify your right time as you insert data, but we have a number of cases where either the device at the field becomes disconnected from the control center or there's an interruption to the Cassandra cluster or something like that where we're ingesting data that's months in the past or something like that or we're consolidating systems. All right, let's see, I guess, how much time do we have, I guess, for content or call it? One minute, okay. All right, I guess I'll end it with a hot take of, I think there's also a recommendation that you put your commit log and your data directory on different partitions or different volumes. I think it's unnecessary. At least if it's SSD, if it's hard drive, yes, absolutely, put it on different partitions but SSD, one rate array, no problem. All right, thanks for coming.