 My name is Colleen Velo. I'm standing alongside Jeff Beck and we're going to be talking about Smart Things with Cassandra for the first nine years. It was eight years when this talk was originally envisioned, but Due to the rescheduling it got changed up to nine. So like I said, my name is Colleen Velo I started working with Cassandra when I joined Smart Things in 2015. So from the Smart Things about eight years I'm Jeff Beck. I've been doing Cassandra for about a decade. I'm aging myself by putting my MVP Award up there, but I'm still really really proud of it. I'm really proud of it, too Okay, so for the agenda what we want to do is kind of set the landscape of where we started with Cassandra Talk about how we've progressed through the various challenges, you know, including problems of scale as you usually get and Then kind of settle on where we are present day and kind of the lessons that we learned through our journey So smart things at a high level in case people aren't familiar with it. It's basically an IOT platform We were acquired by Samsung in 2014 and Got how many I think we have over like 200k events processing through our system now You forgot to say a time period 200,000 events a second. Oh, thank you. Yeah, two thousand minutes per second. Thank you for the correction just 200,000 events Total. Yeah, no per second So at a high level kind of walking through the timeline paradigm here, like I said Samsung acquired smart things in 2014 2015 was You had a single single application monolith single Cassandra ring in a single region in a single repo So the way most startups start 2016 Started expanding out into different AWS regions, which meant the rings also expanded out into the different AWS regions We also set up our first global ring And so we had several shard level what I'm going to say region level rings and then a global ring with multiple data centers So Moving on to the middle time That's when we started the one app launch and basically that was kind of the merging of Smart things had their own mobile app Samsung kind of had their own mobile mobile app And we wanted to merge it into one brand so we did that as well as starting to onboard TV Samsung TVs is OCF device. Excuse me devices. Yeah forget about the OCF part After that then After that in kind of the mid data center or the mid journey Started doing redata centering and that's so we could both Expand out And we started doing some stuff with more more embedding of hubs and then fast forward to today More Samsung appliances which are coming through as devices and full microservice model with the monolith Decommissioned so Starting at the beginning like I said the single mono with that single DC single single us Mono repo We expanded out into the two regions We was using the EC multi region snitch for that because at that point in time AWS did not have multi region peering that didn't come till 2017. So we were on public-facing networks Some issues or some challenges that we had backups restores repairs We was on DSC data stacks at that data stacks enterprise is what do you say stands for at that time and op-center The solution for backups and for the repairs that was being offered by data stacks didn't quite work for us at that time We were at an on an older version of Cassandra and that probably played into it So we ended up with a home road solution which also had some shortcomings it locking issues There was a lack of monitoring. There was not a lot of visibility into the job status, etc For the monitoring metrics. We were using the data stacks op-center Which you know, I did like the color coded color coordinated You know if your cluster is running hot or if it's in the yellow or if it's in the green But you were limited on the metrics that you could get from that It also as I recall didn't have a strong our back model. So only we could look at it. Yeah. Yeah when we the our back model was okay, but We also didn't have a super mature authorization inside the company either so we kind of solve for that by Limiting people that could touch it because that's easy. Yeah, so the long story short only the system system and people could look at it the devs often could not look at it. So Configuration management we were using again a whole kind of a home road script for deployments written in Python so So the application side as we look through this We were at a place where we were trying to scale Cassandra to lots of the company all at once And so we took the effort to say we're going to build libraries that have opinionated Like best practice settings baked into the libraries we Switched over to doing a lot more in the open-source world and saying look in the framework Let's build the connections deep into the framework things like Cassandra rat pack and Drop wizard and tying that into the execution models native promise frameworks. That was all really important to our developers and what we were working on here is really saying let's And the early years abstract away as much as the like details of of the inner workings at Cassandra from it now the Time didn't win a compaction jar is something we actually started Time when a compaction jar is something where this partnership between our operations and application side became really interesting so We were facing the need to start building and deploying a jar with all of our instances because we wanted to use time When the connection before it hit main line Absolutely amazing thing, but if you have a traditional relationship with operations, that's challenging and so App teams were like, oh, we write everything in Java We can build a pipeline our own pipeline to make sure we have builds quick and did all that So that was kind of the kind of partnership that was really exciting in this era So we would build jars you'd somehow make it up here on every server for me. So I didn't have to worry about it And then we continue to move on So that was the beginning years Kind of the mid-years what was happening with the architecture is we still had the monolithic Model, but people were quickly realizing that that was not very sustainable nor scalable Especially at the rate that Samsung wanted to expand So we started microservices rollout if there was any new functionality that was needed It was developed as a microservice and not bolted on to the monolith and we started the very beginnings of starting to Chop off functionality or migrate functionality out of the monolith and into microservices as well So the start of that activity Was happening here as well as the one app launch which again meant More rings scale and more and much and much much more scale We also started expanding out into the China region, you know again more applications and More Cassandra rings and we started to encounter Very significant problems of scale, you know highlighted by the Super Bowl Where we had kind of a bug in one of the TV apps with the volume up and down which Jeff will talk about a little bit more But basically it would just cascade our platform with events So for opera for operations on on the Cassandra side We did another iteration of the backup tooling Went away from the crone job because it was somewhat unreliable and you know we had problems like I had mentioned before with the locking and visibility and We had somebody kind of do another in-house solution Groovy based and they wrote another backup solution for us Which did work well We had a process in place where we tested it daily mainly using it to populate our analytics Cassandra clusters, which were you fed into business data dog graphs? We also changed our monitoring in metrics At this point in time we shifted away from using data stacks Enterprise and we went to just open source Apache Cassandra mainly different mainly Derived from financial reasons. We wasn't using really the data stacks enterprise features and the Apache Cassandra one Satisfied our needs and was very cheap and that being free So with the backups and restore the solution was a lot better. It was a lot more robust was reliable But we still had some sharp edges around that Single-pointed failure the person who wrote it was the person who really was the only person who really knew it in depth Ironically enough it also ended up being a monolithic code base because it was more it did more than just the backups and restores It was kind of the developer's personal tool base. So we used it for a number of Cassandra related activities and There still was a lack of monitoring and visibility, you know When if it failed correct if it ran correctly or if it failed, etc Do you have a sense of scale at this point? It's one global ring across three regions and then regional rings in At the order of six or seven and of course China being special because sign it China's isolated I said do tools that just to give the picture of what we're talking about, right? When we moved from data stacks data stacks enterprise to Apache Cassandra, we had to change up our monitoring We lost a nice gooey graph and Since the rest of the corporation or the rest of the enterprise was using data dog We kind of doubled down on our data dog graphs. We already had a couple of them But they were fairly thin and so the nice thing with that is one we ended up with a lot more robust graphs Monitoring a lot more metrics in a lot more detail. So following along with that We could put in a lot better monitoring a lot more early warning systems, you know To obviously know about any issues or trends before the developers do The other nice thing about that it was kind of gave us a single pane of glass So the developers could actually take graphs off of ours and embed it in their graphs on data dog So it was just there was a lot more change of information for the developers as well At this point in time we also started splitting not Cassandra yet But some of the other apps off into their own repels one of the Things of expanding into the China region is it kind of highlighted the limitations of trying to use a monorepo for all apps All infrastructures, etc. Of course like Jeff had alluded to or Jeff has stated already that meant more the sounder clusters as well and at this point we Stopped using the EC to multi region snitch is the default snitch for global rings and started using the gossip Sorry, I always messed it up, but the gossiping property file snitch is the default one. No one can say it correctly. It's fine. Okay DPSS. Thank you. Yeah So the the advantage there which I'll cover in later slides as well is with easy to multi region snitch is pretty much hard wire to Take the DC name the rack name and it also requires you being on a public interface Interface that's actually hardwired into the Java class and like I said peering wasn't a cross region peering wasn't available at that time At this point it was available and we wanted to start bringing our rings inside and having them run on private networks One other thing that we did at this time is we migrated from direct attached storage to EBS volumes There was a lot of discussion Around this in probably a lot more discussion can be had but the main reasons why we wanted to do that after a lot of Performance testing to make sure we wouldn't take any performance hits Was we was hitting the issue where some of our data was growing so fast We needed the ability to be able to get more disk space very very quickly without having to completely constantly roll clusters The other thing is that AWS was kind of starting at this time to Offer more newer instance types with EBS backing versus direct attached direct storage attached volumes Direct storage attached and these that trend seems to be continuing unfortunately The nice thing here is that with EBS backed We it gives us the ability to be able to roll new roll the cluster or do AMI rolls very very quickly All of our clusters are on AWS I probably should have said that at the beginning so we're entirely in the cloud and nothing's on prem So if we have a zero-day exploit or you know need some security fixes we can quickly tip up a new AMI in this bring the EBS data volume back and Keep moving without having to keep resuming all of the data The one thing that you do have to watch if you do use EBS is you really need to watch your IOPS and your Throughput and make sure that those are really tuned so you don't risk having throttling on your Cassandra cluster And In the application, this is where we've had lots of fun stuff going on at this era Looking at those move to Datadog was incredibly helpful because me as the application developer side I started to pull more and more of those graphs deeper into my production monitoring understanding oh This is a right or a read heavy thing and I see how the cluster is contained Where we tended to like divvy it up like Compassion settings stuff like that tended to be application concerns in our world not as much Operations operations had to make sure all the compactions were working and they would come and be very nice about me Waking them up when they weren't working because I I messed them up or something like that But that's where that partnership was really evolving and then at this point We had to start Hardcoding the driver versions So I I don't know if anyone remembers there was a bug for a while on Protocol version upgrade where if one of the nodes during a release was upgraded it would upgrade all Future connections. It's a bug. It's well documented We just then went to the stance of saying, okay We're gonna set the protocol version in the settings on our application side And that's the thing we had to know to make sure again. We can do zero downtime Deploys. This is also where I believe this is the era where we ran into the interesting bugs of If you're using select star and prepared statements and then edit your table the order of the select star Fields are gonna be out of order Simple stuff, but again, this is in the application world We weren't as deep into that if anyone doesn't know exactly what I'm talking about Just don't use select star and production just write out your fields and it'll be fine and you can forget it and that's the level of like we were trying to make those best practices and I don't know how much code is out there that all had select star that I a hundred percent wrote myself like That was from those early years when we had that one thing We weren't making that many changes, but then as we expand that's exactly what's happening And that's where we see again these earlier decisions Start to come back and that's us up, but they work great to scale early and right now This era is working fine So then we come to the air I'm gonna call it the redata sending era because basically we went through and Completely rolled all of our clusters and going back to what Jeff had just stated When we initially set out and set up our Cassandra clusters for the most part We used a lot of the default values not in all cases, but for most cases and in some cases specific speaking specifically here numb tokens the defaults really were not particularly good. Thank you So with that, you know, we wanted to change the numb tokens to be able to do You know any if you have your numb tokens too high any streaming is people who work with Cassandra will know any streaming operations that you're Going to do is going to become that much more painful whether it's repairs or whether it's you know Bringing in new nodes expanding shrinking. What have you? Some other challenges that got thrown our way is AWS decided it was going to decommission When it's one of its availability zones and when you're running Cassandra in the in the cloud Basically in a bit availability zone is pretty much analogous to a rack. So basically they were going to decommission one of our racks and so we wanted to Basically bring up a new rack in the new AZ without taking any performance sets or obviously any downtime Kind of alluded already to changing cluster level parameters we wanted to change the endpoint snitch as well for our running clusters and in some cases for some of our Rings that had had a lot of high density nodes although not as high as some We wanted to be able to resize those up fairly quickly without having to spend days and days of restreaming data for that So I have a note for the last pickle article there. We followed a lot of the stuff in that but and we also Unfortunately, so I've suffered a loss of a data center as well. So being able to Bring up a new data center configure it the way you want Rebuild all of the data in it offline without any Applications going to it and then saying okay We're ready and being able to push it over to it bought us a lot of value At this time we also made another adjustment in our repair tooling before we kind of had an in-house solution Which was very very poor and resulted in us not doing repairs nowhere as near as regularly as we should so we went with the open-source Reaper tool which was developed by Spotify and Rolled that out and that has been working pretty well So for us in the application side Thinking a little bit about all of the current Data centers and all of that like best practice that I baked into the tool in the early years Yeah, it was mostly wrong at this point given our scale and what we were doing So we had to go back through and change a lot of our drivers That's when it started our wrappers to those libraries now it made sense when we were smaller moving fast And we deployed a lot of these things Weekly or faster in many cases, but these have become Really solid systems Off our off system at the time is is completely backed by Cassandra, and it runs through Everything that you can think of for IOT because off is incredibly important I can off sidebar about that with anyone that wants to talk about off. It's it's a fun hobby It weirdly is for me and then We also started to need to point apps not at the data center They're turning on in so we had the configuration and these libraries set up so that when you turn on an app It's like oh, I know I'm in EU I'm going to connect to EU as my default data center in the driver so that when you have a multi-region Deployment it knows to send most of your your request there, and it's also again We want to reduce our spend on network transfer So we're trying to align to the to the read the Availability zones within the region and all of that so we're trying to align all of that And this is where it turns out we over index on making it easy And we needed that control and flexibility Because we lost the data center and we're doing migrations where we have two Two clusters in the same region deployed and we want to be able to be more specific about what we're pointing to and moving around so that that's mostly there and then I Think the rest of this era was really Maybe read the dates if you understand why it took us a really long time to do this safely Because we were also I think learning how to do it very distributed Opposed to getting together in a room and doing it all at once we got really good at doing massive data center migrations with little to no impact across running Production where we were also getting an incredibly new set of like operational concerns because at this era We're also seeing massive shifts in how people use their smart home Because so many people had shifted to working at home during this era So like we were operationally seeing massive shifts that were like is this anomaly? Is this the new normal? I don't know. I need a nap is usually my answer for that one Yeah, so fast-forwarding to present day Monolith app has been sunset Yeah, yeah, we're up fully on a microservice model. We have over 300 plus microservices Cassandra Forgot to mention in the redata centering years Cassandra also has been moved each of the rings have been moved up into their own separate repos So your blast radius is a lot smaller if you're making changes, you know You're only making your only risk making changes to that one ring at this time We're also using spinnaker for rolling out deploys And so you now have the concept of a canary where you can you know in our case We call it a jailed know where you can bring it up take a look at it Make sure that stuff is set the way that you expect it to be said etc and You know, we're also seeing Samsung Have you know Samsung devices a lot more Samsung devices coming through to our platform Which basically equivalence equivalence is to more events coming through for us um At this time we also Retooled our backups and restores, and we're now using the open-source Medusa, which is Solved a lot of the shortcomings that we've had before so now we do have good monitoring around it We do have good visibility into you know, if it's running properly or not And all the other pluses that you get with using an open-source Product no long no more single point of failure as well Like I said for configuration management, we're using separate repos for each of our Cassandra clusters And we're using spinnaker for the open tooling and deployment. I don't know if you want to say All right the application we finally got to deprecate the monolith with the great the reason that is in particularly interesting is it had very old versions of drivers things like that So getting rid of those generally the Java drivers have done really good against different versions of Cassandra It's generally safe. It makes me really nervous just because that's the type of person. I am The other thing application side that is kind of coming out with this is it's not really something to call out so much as We pulled out all of the like baked-in configuration that we had baked in as best practice. That's all gone now We we're more collaborative, but we have a much deeper trust for backups and restores for for that And that allows us the things like testing and Automatically backing up like using our automated backup to restore to what we consider analytics clusters, right? Which we are which we were doing with Medusa as well. I yeah Go ahead app-wise. That's really good because then as my application developer I don't have to let anyone touch my running Cassandra cluster except for me. I Don't share So kind of at you know kind of the high-level lessons You know we did do a couple of you know roll your own solutions And there you know there definitely is a time in place for you know having to do a roll your own solution But generally speaking in our case for the most part on the operation side We've gotten better results out of the open-source solutions that being said The open-source solutions had matured since the time that we did our roll your own so another big takeaway is Not trying to double down so much on multi-tenant clusters that proved to be troublesome as far as trouble shooting as far as tuning You know we did have you know at this point, you know work We have two global rings and any number of single DC rings and in one of the rings We had a right heavy and a read heavy workload, and so it was just a nightmare to try to tune that We've had cases where we've had the noisy neighbor where one app was not this behaving But definitely sucking up a lot of the resources to the point where it was affecting the other ones etc So right now we try to keep you know single app per cluster which keeps it a lot simpler And another takeaway is really minimize the time of running mixed versions clusters That one I will really stress you when you when you're trying to do an upgrade Really sit down and plan that window pretty carefully Basically your cluster is mostly don't upgrade to know your cluster is pretty much locked at that time You cannot do any topology changes. You cannot do any schema changes or any streaming operations. So Yeah, really try to minimize the time that you're running mixed versions It sounds simple, but like Yeah, I over and over so yes, I would put that one in red if I could And I'll let you do the so for us the Minimized like takeaways Depending on where you are in your like application journey Right now. I would say yeah minimize hard-coded values if you have good config management Figure out how to instill the best practice But if you can't it will last you a few years You can get by three or four years if you hard-code all your best practices into a library and distribute it in your But just remember that you left it there. Don't forget like I did We had a number of like weird anomalies. That's like, oh Right that one bug where we pinned to version met that we never went back through and cleaned it up three years later And then we had to go through and clean it up It's those simple things, but those are application level like concerns that you need that deep partnership And again the whole minimizing hard-coded values. That's true But remember you're gonna add that cognitive overhead for your developers that are there all about it and just Again always partner deeply with ops like we're up here together Because like the best outcomes that I've ever seen in Cassandra is when the application teams and the operation teams are working together As soon as we got those those metrics like on the same dashboard as my application It's been amazing. I I remember looking at like I ops bursting issues with AWS At the same time as like my through my upper level throughput So that's really fun and like it's also where you know when you make a mistake in your app code because like nothing Is changed in all of here and then everything certainly turns red and you're like, oh, I definitely did that to them Sorry, yeah, no problem These are some resources which has been helpful for me when I've gotten into various files like various fires rather If slack channel I found has been very responsive I've gotten a plethora of information from the last pickle blog. I highly recommend reading that John Haddad's blog also has a lot of very good info, especially around tuning and The data stack stack of change also very good same thing with Planet Cassandra and then there's a patchy Cassandra corner podcast which There's an episode where Jeff has been on and which Aaron floats here does so Recommend those resources. There's a lot more but these are kind of my top picks. So Thank you very much for your attention. I appreciate it any questions And someone could check us on time. No, we have we're right on the time I will answer questions until they get until the nice man that runs all the AV kicks us off We got more than enough heads up on paper, but never enough time in practice Yeah, that's an accurate statement. Yeah always. Yeah Yeah, I was gonna say Once we had There was a couple of our there was it affected thankfully only a small subset of our clusters So once we nailed down the redata centering process, it wasn't too bad You know that plus using the of the eb plus the using of the ebs volumes allowed us to move through that fairly quickly. So Oh, I think he had one. Yeah Yeah, oh My number one villain of oh, that's happening again is Misconfiguring specular best execution because I misread where the decimal point should be That that's and like what would happen is it would it would like pop up and we're like Oh, we found another one where we didn't fix it. So like that was that was my pain point What were some of yours? Oh guy the thing that you always I think complained about to me was probably like The back everything around backups and like restores and like making that was the one that was always cropping up as painful Until we got really really solid monitoring there. Yeah, that's true. That was the area. That was just painful because it wasn't as Front and center. I think everything else. I think we just ended up with unique problems pretty often So they didn't repeat except for when I hard-coded something and it propagated throughout 300 microservices Yeah, I think my big I think probably my big it's a it's more gin. It's a more generic one More on the up. Well, I'm both the operations and the developer side is You know really double down on your metrics in a way. I mean not be smirking the data stacks off-center, but That was kind of that was a little bit of a blessing in disguise with us moving to Datadog because we really could double down and pull a lot more metrics in and really tune them to What we needed to see so You can yes Right, right like for like so for instance We could put in stuff we could put in more specific metrics around EBS You know the Q the latency etc. We weren't clear about this and we should answer this and then we should stop because I'm looking at time The when we talk about metrics we were exporting almost all of the JMX table level metrics into into Datadog So like we would see all of the table level metrics in Datadog in real time along with everything else Which was incredibly useful. All right. Yeah, that did kind of bail us out a couple times. So Thank you Thanks guys