 All right guys let's get started. We're super excited today to have Jim McComb and Jeff Carpenter from DataStacks to come give a talk today about the new Astra Kostandra as a service product that they put out in the last year. So here they're here to talk about what was the process of doing that, taking an existing new SQL system and making it run in the cloud service. Jeff is the head of, sorry, teaching relations at CMU or sorry, at DataStacks. What was your title again? Yeah, I'm leading up the learning teams. Okay and then Jeff is the VP of cloud engineering at DataStacks where he's been here since 2018. So again the way we'll do this is if you have any questions just unmute yourself, announce who you are, where you're coming from, and then ask your question and interrupt it anytime. Okay. All right guys, Jim, go for it. Cool. So yeah, like I said, Jim McCollum, VP of cloud engineering here at DataStacks. Been here since 2018. Cloud engineering is really responsible for Astra which is our database as a service. Our internal cloud platform and slash platform as a service that we have. The whole point of that is to give us the ability to actually get code out into the wild and onto the various cloud systems in a consistent manner. It's our Kubernetes integration teams. They're primarily responsible for the operators that we're building right now, but they're really deep into the Kubernetes project and really understanding how do we integrate best as a database. It's actually not the best system in the world to deploy stateful applications on top of and having folks that understand that pretty well. It is important to our success at this point and I'll get into that a little bit later. Obviously we have the cloud DB team which I'll talk a little bit about, but their job is really to adapt Cassandra to be more cloud native. We have a cloud special projects team and I'll introduce one of the projects that they're working on towards the end and then our cloud SRA team who's really responsible for when things go wrong, getting in and figuring out what it is. We put a lot of work into the operator and to remediations, but sometimes things go poorly and we need folks to get in and really understand what's going on and fix it and then help us build new remediations for that future. I was actually previously a cloud data stacks customer before I came here. I was at Lending Club and we had actually purchased DSC. I was looking for a cloud product and it didn't exist at data stacks at the time and then the role popped up on the website and I said this is definitely what I want to do with my life is I'm going to build the product that I wanted to buy in the first place. That's how I got here. I've been in engineering and data leadership roles in a bunch of places from hospitals to biotechs to Amazon to GE. Lots of different types of exposure, not just a pure database background. I lived out here in Berkeley California with my family and yeah, my LinkedIn details are at the bottom so you can hit me up there if you want afterwards. Then Jeff. All right. Yeah, I've been doing a bunch of different things in my three years here at data stacks. I got involved in the Cassandra community about six years ago and I started as a user of Cassandra and an architect on a project that was building a new reservation system in the cloud on AWS for choice hotels and it got the opportunity to update the O'Reilly book on Cassandra which brought me into the orbit of some great people at data stacks, Patrick McFadden and others. Anyway, I ended up joining developer relations team. Right now I am leading the different learning teams that we have that are producing our training and certification programs and the new developer part of our website datastacks.com slash dev which has a lot of fun embedded training scenarios with it. If you guys have seen Catechota and that terminal based stuff, it's pretty fun. So that's what I do and my contact information I think just flew by there. Okay, thanks. Yeah, so anyway, if you want to get connected with me, the one fun fact I'll share about the cloud product is I was trying to figure out what data stacks market strategy was when I was talking to them and then they announced the development of like the acquisition of a company that was to be the first version of as a service product. I think that'll come in a little bit of the history here. But when they did that acquisition, I was like, oh, I get it. I get what the strategy is of this open source company. I'm going to go do this now. I think that's where maybe Jim can jump the jumping off point for this great story of the past couple of years. Awesome. Thanks, Jeff. And sorry about that. I'm really bad at slides on a Mac for some reason. I can just skip through the whole thing if I really try. So yeah, I wanted to go through. I'm not traditionally like a database person kind of like I alluded to it's more of a building big cloud platforms. And so I want to go through this more from the direction of, you know, yes, we've done some interesting things with the database, but how do we build this huge ecosystem up around it? And what was the journey to get there? And some of the some of the pitfalls that we fell down and how we got around them and how we progressively improve the product and how it not just made the cloud product better, but data stacks and Cassandra in general as we went through that. So basically Jeff alluded to it in his intro that in the beginning there was a cloud product that we had acquired about a year before I got here. And it was spaghetti at its finest. It was a big Ruby on Rails stack. Every one of these arrows like did something different and it was one big callback stack. And if one piece failed, it would take three or four hours of an engineer's time to go back and try and trace where everything went wrong. Because all it would take is one callback to fail. And we would have to go through all the logs and all of the systems to figure out where it had failed and where it got stuck. It also made updates extremely difficult. So it was a very, very painful system to work with. So this is what things look like when I got in here. Basically what it would do is we would have a customer and they would come to us and we would have a white glove service where they would we would send them a AWS cloud formation script, which would take care of peering their VPC back to our system, where we would then install DSE with Elk and Prometheus and Opcenter and all the rest of the pieces that that really made DSE a product and then start monitoring that product. It wasn't a lot of fun. It was invariably the admin on the cluster on the customer side would run this script. Something would go wrong. We would have to get an engineer involved on their side or they would have to get an engineer and we would have to get an engineer and they would talk and figure out where why this peering failed and where things got stuck and it was just not a lot of fun. We onboarded a few customers like this. Ultimately it wasn't super successful. So my next step was to basically make this stable. How do we take this thing and make it stable progressively without rewriting the whole thing from the ground up? So we moved into this contained spaghetti phase like a canned spaghetti where it was. We took best practices, took the user experience, took all the control plane, packaged that up in a Kubernetes environment with microservices and Envoy and all the all the best practices in cloud native development and really package that into what is the purple layer there. So that made things a lot easier. It was a lot cleaner. There was still death and destruction down below but at least the part that the user interacted with was isolated and wouldn't completely fail in front of the user. We also changed the model pretty appreciably where we would actually deploy into a VPC of our own that data stacks had control over. So we would push into this VPC and it was still a pretty heavy deployment but it was three BSE instances or Cassandra instances, OpCenter and LCM which is our provisioning and management tooling. But then we had really our first new feature for Astro which was we built a driver endpoint service and the reason we wanted to do that is traditionally when you're using a cloud database getting connected to it is very difficult. You would have to go through VPC peering or you have to use specialized connection or just an invariable number of ways that you need to do networking magic to make that happen to do development on a laptop. And first and foremost our goal was really to make it easy to get started developing on top of Cassandra. So we actually built this new driver endpoint that was able to take Cassandra wire protocol and translate it and pass it back to the appropriate instances behind the scenes. But that wouldn't be safe because people with bad passwords with low entropy would in putting those things directly on the internet isn't safe so what we actually did was made an MTLS connection so on the Astro experience and it still exists today you can go in and it will give you a secure connect bundle and the data stacks drivers for Cassandra actually use that secure connect bundle to set up an MTLS connection back through Astro for you so that it's really only two lines of code because traditionally with Java that type of getting MTLS working is a lot of switches on the command line and probably 20 or 30 lines of boilerplate code and we actually smashed that into the driver itself and made it made it very convenient. Like this is pretty deep in the weeds like what are we actually looking at this was this was this is your the overview architecture of the data stacks enterprise product which was a packaged version of Cassandra and sorry yeah so if we look up here at the top what's in the box is really this is the control plane for the service itself this is the UI and the services that actually orchestrate the deployment of the databases themselves so if you if we're looking at the databases the original version is just lower the box in the lower left corner here and the new version would be the deployment in the right corner over here whereas this one was in the customer environment this one was in a hosted environment that we were able to take care of on our own. So this this is somebody so again somebody has an existing Cassandra deployment and they want to switch over to use the managed one that you're providing or like no this is all net new okay I need a Cassandra database okay I want to I want to come in and run a Cassandra database and click a button and make that happen. And I mean out of curiosity why do you have both POSOS RDS and MySQL RDS in your back end like control plane? Good point these came so everything that in the rail stack over here on the on the right hand side that was what came through the acquisition process. The IDP this is really for our SSO solution we use a commercial slash OSS project for this red hats key cloak product and that does not work with no SQL yet. We've looked into it and it's it's not optimized for that type of pattern yet it's definitely something we want to do as a skunkworks project at some point but there are some pretty specific searches and such that would aren't super compatible with with Cassandra at this point. We think probably with our new storage attached indexing we might be able to make it happen but it just in in the spirit of moving fast this doesn't seem like a place where we would optimize at this point. Any other questions I can answer about this slide? It's going to there's going to be less arrows and less boxes as I go it's it's the goal is to we're going to get more consolidated as we go here. Okay now I think it's good. Cool so that that step worked we got to about six months in unfortunately the rails application that was underlying everything had about a 20% failure rate which mean that every time we would have to do something about you know roll basically roll a die and it would you probably like you land on one it's going to fail. So we were faced with the option of going through the rail stack and trying to fix it or re rewriting it and we chose the ladder and we took that whole rail stack and basically baked it into this blue box here. So we took all of the functionality and all of those callbacks and all of those services and really made somewhat of a monolith microservice out of all of it that handles the entire orchestration of of putting down the database itself and getting everything pushed out and I'm the top here this is when we migrated on to Key Cloak for our SSO these these front-end services are really what drive the user experience these are other projects inside of data stacks so the the idea was when I spoke about the platform as a service when when I arrived at data stacks we were very much a tarball company we ship we we built versions of data stacks enterprise Cassandra and we ship them as tarballs but for this type of rollout we really need a way to get software on to the internet very quickly and so we actually built this pass service that allows anybody internally to build an application and get it out on top of the on top of the cloud platform very easily with full CICD integrated testing getting it into a Kubernetes environment without having to stress about needing to really know anything about Kubernetes so if you go into most of the bigger bigger cloud providers or places like Amazon or Google or or Microsoft they have these type of systems set up you walk in day one there's a way to deploy path it's a happy path and it's all paved and we've been we've been kind of paving that road as we went internally here so that is really the point of the big top box the important part at the bottom here is the we consolidated everything into this one box and pushed infrastructure provisioning into this other box and what that would do is all the requests that would come in this would handle all of the orchestration about pushing it out it would use a workflow engine to make sure that we got through we had a big dag of different steps to roll this out on different cloud providers and through different types of VMs but really the big net add after this is we we actually started working with Kubernetes and we built in our command processor and the goal of the command processor is to keep to keep people's hands out of the customer environments so the goal is is that if we needed to do something inside of a customer environment regularly we want one thing to do it I don't want engineers logging in and SSH again and typing commands and fat fingering things and dropping somebody's table right we want we want like a clean set of it's essentially a recipe book of things to do on a cluster whether it's scale up scale down remediate some action where the node may be in a fail state and we have to bring it back perform repairs against the database so what it does is it actually takes commands off of a queue from the orchestrator service here executes those commands and then puts the response back onto the queue again so we have this like command and control loop between every single instance that keeps us from having to log into it it also stops us from having SSH into these systems so we use these cues as really asynchronous mechanisms for being able to push commands out to our clusters and so it gives us gives us the ability to affect many many boxes or many many environments all at one time as opposed to what's an example of a command like bring up you know latest version of something or install this package yeah install this package if we've if we've given you the recipe for the package other things would be scale up so we would we would normally deploy the first step of the deployment would deploy with three nodes and three different azs the next step would be if the customer needs to scale okay let's provision up three more nodes let's install Cassandra on those nodes let's attach them to the cluster make sure they're in a good state if anything goes into a bad state send a command back so an engineer gets a message that something broke and we need to break glass and get into that environment it could be the node is not responding so try and restart the node run repairs against that node perform a backup like a point in time backup for that node so basically all the all the various things that you would have to do but without having to actually log into the environment and add a curiosity what's the back end of the queue Kafka actually it's it's everything is on AWS so we just use SQS so we have actually one SQS queue for every single customer environment so they subscribe to their own queue and then we have a a shared queue for everything that goes back because the and so we can scale up the orchestrator service to actually handle more load and process more more messages if we need to we've never needed to do that yet from that perspective it's it's pretty good at with just the three to six nodes that run normally but as we add more customer environments we will start getting more information back we've also not had a storm of activity if something goes horrible on AWS I'm expect if we lose like a full az I'm expecting a storm of messages to come in that say I'm something bad happened I'm trying to reinstate and another az so um so yeah what these improvements so where do we get much better reliability we got down to a tenth of a percent of failure rate from 20 so that was pretty good still not exactly where I want things to be but it was way way better we got full CICP pipelines for all development so at this point you know we're pushing thousands of times a week through the CICP pipeline so it really is a continuous flow of changes and remediations so it's not just yeah engineers that have it great they can push code and it goes out but it's also when things go poorly we can get a fix out very very quickly we had a slight brownout situation a couple weeks ago where we pushed a piece of code and it didn't do very well in front of about 1400 people at the time and it was about five or six minutes and we had we figured out what it was had a passion had a new code push out to the production environment actually at curiosity how long like if you see eye pipeline how long does that actually take depends on whether it's an emergency or not um if it's not an emergency it takes about three or four hours because we actually push it through we haven't gotten to the point where I want to be which is when you push a front end change it only tests the front end framework it actually just goes through and does a full regression every time we push everything out yeah okay so we want to get to a point with enough mocks in place where we can trust that we if we exercise the front end environment with a mocks then the front end can just go right out immediately and if we push the orchestration then that can go out without having to exercise the entire infrastructure provisioning pipeline every time if we don't change that part okay and then but then updates to gassandra like the core gassandra like runtime that's that's a separate ci pipeline that's a separate pipeline um and and that one is actually a combination of ci and manual um we're getting better at that okay but that's one that the blast radius of something goes off the rails there is incredibly large of course yes so that one I want people monitoring when it goes out and so we we actually we've identified cohorts of databases that we will phase it out through so first is first is we have a cohort of individuals who are on the team and we will push out to them first and then the next ring is data stacks employees and the next ring is free tier databases that are not getting a lot of access and then there's free tier databases that are getting a lot of load and then there's finally people who pay us money and so and so we we progressively tier these out so that if there is a if there is an impact it's it's controlled where we aren't at this point is the okay we feel super comfortable without running these set of manual scripts that we're just going to kick off the next step so it's each each ring is a different run and it's a different set of eyeballs and people being progressively more nervous okay let's say best case scenario what is that how long is that that that that deployment cycle to go through all we just did 3000 clusters in about three hours the other day okay wow that's impressive okay so yeah it's it's getting and and part of that's due to the yeah I'll get to the how we how we got to that part and why it's way easier than it used to be I think there was a maybe an implied question in there for mandy maybe i'm reading between the lines too much but I think what about what version of kassandra is actually running on the nodes that was not my implied question but that's a good question too oh okay well we can start that question if you don't want to we are we're running right now we're running data stacks enterprise the same version that we package for data stacks customers we're waiting for the kassandra 4o release to come out before we put kassandra itself as an option mostly because we're not we're not as sure about the operational characteristics of the 3.1 release and having two versions out there in a soon to be three ones 4o release would mean that my sre team would have to be familiar with three different versions of kassandra now since 4o is really where we stayed as a company that we're moving towards they need to get used to it so it's we want to get it out there and start using it but getting them trained on 3.11 and understanding the operational characteristics and where it goes off the rails just felt like there was a little a bridge a little bit too far for the first version here cool um the other thing we built in is a this version of DSE is actually somewhat customized we went through and made some pretty fundamental changes some of those have been back sorted into the 6.8.1 release but some have not yet the one that has made it in is the guard rails so we've put in some guard kassandra is notorious for letting you do whatever you want and then failing for a database as a service that's not super awesome so we want it we want something out there that that is is stable and runs and is hard for the customer to bring down due to you know doing things they probably shouldn't so limits on column sizes it would allow you to put a gigabyte worth of data in a column for every single row and that's just it'll you can try and read the whole table and it will just page it all out and you'll run out of memory and it'll just fall over so we got rid of those those type of those type of failure scenarios as well as you know number of indexes and and other other specific tuning for the environment that we're running and just to make sure that if a user who's not familiar with kassandra tries to do something it doesn't bring the whole database down with it or even bring a note down with it and we're we're constantly improving those we got full tenant isolation which is great ops automation through that the the command and control loop there with the cues centralized monitoring alarming but it wasn't all great yet we had a pretty large stack that bottom right corner there was a lot going on in there we would run in the cloud provider limits pretty quickly we needed all these ip addresses and all of these boxes and all of these dns entries and so the cloud providers themselves actually put limits on say the number of ip addresses that you can have in an individual aws account so we actually have to shard across the counts you can only get so many ec2 instances of a certain type and so the more of those things that exist in the customer the customer environment means that that stack gets more complicated and and our we have a higher probability of running into limits which means we just run straight into a wall full speed because trying to query these limits they don't actually have apis around all of them so some of them you just like run into a wall and you'll get an error back and it'll be like oh crap we're out of dns entries again and so now we have to like get on the phone with aws get them to give us more dns entries and so now we had to build in this idea of not just sharding like databases but we have to shard across accounts so we actually have like a ring of database servers that exist in the world out there so so and amazon knows that like oh these are like 20 data stacks accounts and because they don't care if you pay them yeah okay yeah no it was and it wasn't 20 it was actually we went through a phase where we would do an account per customer or an account per user and that got up to something around 700 they think so they were then it got to the point where it was just impossible for us to actually see what was going on in the world and we moved back down and consolidated into these sharded accounts but we can once we start running into the limits we'll add a new account it'll get put into the the buffer of available accounts and then we'll try and balance we'll provision into that one until it meets the other ones and so there's a little bit of art which is do we feel like we're getting close to the limits and then there's science of okay as we add these things fill them up progressively and then spread it out across the top again so sharding a database accounts i've never heard this this is awesome it was pain it's incredibly painful yeah the sharding is actually really good but before that point i would probably spend it would be engineering weeks devoted to just running into these running into these walls and trying to fix them and then yeah we also have reliance on opcenter and lcm which are another set of tooling that just has to run in another environment constantly and there's a lot of cost the more boxes you run the higher higher the costs are that we have to pass on to our users which just doesn't work and so along came google and this was about a year ago may of last year google came around and said and this is about when amazon was just gotten trouble for putting out their document db which was the direct clone of mongo and the the kafka clone and so google decided to take a different tact which was work with the oss vendors and pull them in as the providers on gcp and not try to fight them or clone them or actually but actually like bring them in and bring the community with them and so we were invited to be part of the first cohort of this with elastic and confluent for plastic and then for kafka and one of the things that they strongly suggested we should do is move towards kubernetes as a rollout environment this was right when anthos had been announced and so they were they were really excited about kubernetes for everything on the plan and so we took that as a this is something that my team had wanted to do for a very long time but we had just not had the air cover or the budget to do it and now we are allowed to actually pursue this so we actually went out and built this capability for kubernetes to deploy kassandra something that had existed through several operators out there but not none of them were really built for mass consumption so we started working with home charts and and home charts worked okay but home charts didn't really give us the ability to build smarts into what it does what is home chart side so home charts are kind of like rpms for kubernetes rollouts it's a packaged way of deploying an application it just it explains all the parts all the different containers how they tie together but it doesn't have any just like an rpm package it gets it down but it doesn't have a lot of smarts in it the difference between that and the kubernetes operator is the operator is actually a process that runs alongside the rest of the kubernetes containers and keeps everything in line so it will actually go out and poke at all of the different pods and make sure that they're operating correctly checked for metrics so you can build smarts into this thing that says okay when when throughput exceeds a certain amount or when it when we look like we need to do too many like there's a backup of hints or a backup of repairs either perform this action or reach out and alert the the human operator or the the database system that we need to do something to remediate this so it's really the difference between between a install and a somewhat expert system that can help us administrate and keep these systems running so we went on to that and actually turned out really really well and this was really fundamental around what you had asked before about how fast it is to update it's when we want to update a system we build a new docker new docker container we push it out and then we just go out and tell all of the databases to go and update themselves to the latest version they pull it down the operator is smart enough to incrementally roll through each one update that one make sure that it's it's rejoin the ring again make sure that it's healthy and then move on to the next one and so we can we can really just it's you know point fire forget a lot of times and then just go back around and check so we have we also have an instances of deep pangs that we can run into the environment that tells us if the environment is sane or not so once this is done we actually these those deep pangs are constantly reporting to us it's how we know whether we stay with an SLA or not and they'll go through and check all of the endpoints if everything's up great if it's not somebody gets alerted because with with you know four nines that gives us if we're down for an hour every month with four nines we've missed our SLA that's 99.986 which it's not four nines so it's important to get somebody on top of that really really quickly so this is this is the the complete evolution from like a really really messy stack to something that's really nice and clean that takes care of itself on deployment so yeah we this forced us into building the operator which is great in where we dog food it constantly like if this thing doesn't work then we're in trouble from a cloud perspective so we've released it open source to the community it's tested thoroughly it continues to get tested and improved and refined it gives us faster provisioning this is faster updates gives us faster life cycle events on each cluster we've deployed over 25 000 clusters with it so far it's you know both clusters that people are using and plus through that CI CD pipeline of testing the whole loop beginning and it allowed us to deploy our graph and our graph ql and rest end points very easily before we would have just this environment for the front end that was in its own kubernetes cluster and then we'd have those three other dsc clusters behind that and then we'd have the LCM and ops center nodes behind that and so we're talking something like six or seven individual VMs at that point now we launch on three we make sure that they're appropriately sized and everything just gets squeezed into those three boxes and when we need to upgrade we just upgrade them all at one time and we push we need to scale the front end scales immediately with the back end one of the things we want to do as an improvement when we when a user we finally find a user with a very mismatch workload is allowed for the ability to scale out the the graph ql and rest end points independently of the database itself so we can we can handle more load potentially on the front end and balance that out against very large databases on the back so we want to be able to to to provide that feature out it also allowed us to how does that happen we're like the the graph ql you know front end is is the bulk of the time you're spending for for a database system right now none most of the people who are using the database right now are kassandra natives so they're they're really going through the cql interface we are we're actively the reason around graph ql and rest is really to help attract a new a new breed of developer that doesn't want to mess around with drivers or mtls or anything else right so at some point just due to probably the serialization aspects of graph ql and rest that they're going to need more memory they're just they're going to have a different jbm characteristic than the database role so yeah it makes sense so it also allowed us to bring along we now have you can bring your own kms so if you don't trust data stacks with your keys for your data bring your own we can encrypt your data with your own keys we got bpc peering now multi key space we we actually limited the ability to add key spaces because in in kassandra you when you add a key space you're allowed to set a whole bunch of parameters around how that key space performs some of those are bad from a performance standpoint it also allows you to do things like set the replication factor which is just we don't want people to set the replication factor to one and then wonder why their database goes down so we had to actually build some mechanisms and to allow for multiple key spaces so we we've released that code as well and so all of this led to where we are now which is we've actually taken kassandra fully multi tenant at this point our free tier right now is running across about 17 kubernetes very large kubernetes nodes with about 3700 databases out there so we we've actually we've taken the operator to the next level which allows us to squeeze a lot of these free tier databases into a very small space which it's it's good from a cross control standpoint when we're putting up free databases having to pay for them every month is is somewhat untenable in the long term it also allows us to start testing out where we want to go with this which is paid multi tenant tiers um trying everybody can agree that having a replication factor of three across three az is really good it's also really expensive it's about 1800 bucks a month and and for your average kassandra user that's not a lot of money but if we're going towards an adoption curve and trying to get people who are they they're okay within rdbms they're okay within amazon rds they just want a little bit more stability and they want more skill better scalability and better reliability 1800 bucks is a lot um so we want to get to the point where we can give all of those characteristics but kind of cheaper easier to use package that you know these people don't need seven eight nine thousand transactions per second they need like 100 and so there's no way to do that really in a cost effective manner without without this this path and then so far we rolled this out about two and a half weeks ago and it's operating really well and this is this is another one of the reasons we can roll an update so quickly it really just goes out and hits a bunch of very very small instances and gets them updated very quickly so is that as a free customer is that is it per container or is it are you packing multiple free customers on a single jvm no no so everybody everybody lives in their own container there's full there's full workload isolation they have their own endpoints they have their own command processor everything is is very specific for that we treat it like a regular a regular astro cluster but in theory you could do even further consolidation if i mean i don't know what kind of uh balance the eternal balancing for tenancy that kassandra has but in theory you could pack you know again i think it's unlikely the free guys are even doing 100 transactions a second right they're probably doing one a minute yeah like for these guys you could pack them in a single jvm and that would save you you have to do that but i could save you even more yes it could the one problem though is the the back pressure and and the the guardrails are not to the point where i would trust that for somebody who is giving us money meaning if there wasn't if there was a bad neighbor who was able to tear down like knock the jvm off then we could squeeze a bunch of people onto a single jvm but all it takes is one one bad actor to really just bring the whole thing down for everybody sure yeah i gotta yeah with the container isolation everybody gets their own slice of the cpu they get their own little bit of memory um if you're a jerk then you're really just compromising your ability to do anything at that point uh and and we we kubernetes was firmly in our path at that point and so that was we felt that was the best way to solve this at this point as opposed to trying to to restrict you know bad actors or or noisy neighbors directly on the jvm itself that was that was definitely like a there was a long discussion on on the best the best path to go down there um so yeah what's next multi regions coming up by the end of the month here so support for azure surprisingly enough a year ago if you were to ask me i would have said gcp is most likely to be the next you know number two but azure has really come out of the gates and they are cleaning up everywhere so microsoft is really coming out as a dominant player in the cloud market so we are we are actively and aggressively charging towards that um kubernetes makes that a lot easier um when we moved from gcp to aws it took about three weeks to make that change um azure is not as simple their apis are not as is clean and easy to use all of my engineers have been saying that they're very peculiar um and and just they're unorthodox and i think part of that is microsoft has a very long history of supporting a very a deep level of regressions and supporting everything that they've ever done forever and ever um and so which is great because they have a they have a much bigger enterprise base amazon will just tell you like nope we're going to deprecate this feature tough microsoft will just support it forever and ever which leads to a little bit of weirdness but we're getting around on that um we're replacing the kassandra secondary indexes with storage attacks attached indices so actually building those into the ss tables themselves as opposed to outside of them and building it in consolidated across the cluster as opposed to on a per node basis this is new uh code that data stacks is contributing back to the kassandra community but it's it's enabled on every um it's enabled on every astra instance today to try out but it's not it hasn't replaced the native secondary indexes yet um we want to bring in pluggable authentication authorization more regions annual reserve usage paid multi-tenant um and then the next two things um these these were the special projects i was talking about one um we're working on serverless um we're working on on making kassandra itself serverless um and that is the team of about five or six of our our best kassandra engineers thinking about how we can separate out computing storage in a way where we can make we can make serverless a tenable and and and cost efficient model for getting kassandra out there for everybody on the planet so eventually eventually we move away from this idea of even free tiers then we move to a model of ephemeral compute and in in movable storage so the idea that you can have a cold node and bring it up and bring it online within seconds to handle whatever the workload is so if it is you know if it is that one or two requests an hour it'll probably sunset and come back and sunset and come back but if it's you know say one per second then we'll just keep the workload running and be able to be more like a dynamo or key spaces or a cosmos where we we build at the IOPS level and not at the keeping $1,800 of the hardware up and running um and so they're actually looking at where we can where we can break the architecture and make it make kassandra itself more distributed and more like a microservices based architecture and less like a big monolithic database um and then the last thing is you guys I think are the first people outside of data stacks to hear about this but we've started working on a project called C2 and C2 is our reimagining of what the coordinator looks like for kassandra it's very much an internal project at this point we we looked at what the coordinator did and then we've also looked at where people have extended kassandra in the past data stacks included and much of what we do is actually in the coordination tier so if you look at integrations with graph and um GraphQL and REST things like integration with Kafka as an input the pluggable authentication transparent decryption and encryption roll level access control read repairs joins the ability to to output events to external systems like Kafka or Spark or any other streaming system or push out a CDC every single one of these things ties directly in at the coordinator layer so we've actually reimagined the coordinator as a somewhat of a very mini app server um that we can build in a processing pipeline on top of to tie in new new handlers to handle these things like transparent decryption and encryption is really just who is the user what keys do they have access to can I encrypt this for them yes then encrypt it on the way down can I decrypt it on the way out no then they can't see it um so we can do things like data masking through those type of mechanisms roll level the way to think about this is like the existing kassandra coordinator is just like it's a hodge because you already have these these features for the most part right so you're saying that the existing one is a hodge podge of just like things grafted on to this yeah all the wax and the idea is like let's have a clean architecture where we have this this clear path to get to the data the path take the data out and then now you just you're hooking in different services yeah and then if the two green boxes on the top and the bottom the the coordinator service and the persistence service are really that is what essentially makes up the coordinator today and then the one thing that we really want to do with this is give everybody on the planet that has a kassandra cluster the ability to put c2 in front of their cluster today so with a with a kassandra cluster the way it exists right now you can actually run a kassandra node in what's called the fat client node and that is essentially just coordinator only and that that buys you a little bit of extra performance because it does separate some of the compute and storage we just want to go like one level deeper and say a let's let's make it clean so that we can put these abstractions in and everybody's not just trying to like slam these things into the os s project as as additions and then b how can we make this available for everybody running kassandra today and that's that's really what we're illustrating at the bottom with these storage shims is really you can use those storage shims to put c2 on top of any kassandra instance instance on the planet and modernize your input even to the point where we're hoping we can get it back to kassandra too and bring things like cql back to a database that really doesn't have that at this point and and just give people a more modern way of of interacting with kassandra so since you're running this from scratch uh or actually i mean i i don't know how much of these components are you relying on the existing kassandra code like is it is everything written from scratch or is it leveraging existing infrastructure it's leveraging existing infrastructure we're basing it off of the os s 4o code base at this point our goal is to work through a set of refactors that really uses kassandra as a library so we can bring in the stuff that we need and run it run it in a less resource intensive manner than trying to start up a new version of kassandra so this is about a three week old effort at this point we're at the point right now where we can we can make it work with the existing kassandra we have graph ql working we have rest working it's clunky but it's all coming together and really getting to the point where we want people to be able to run kassandra how they want to run kassandra and and not have to be experts on the system and really abstracting out the coordinator really allows anybody who can do basic servlet programming can build a new pipeline handler and extend out kassandra for their needs and that's about it just where we are and where we're going q3 we actually want to have multi-cloud in place so that you can actually run kassandra between all the different clouds and actually move your your workloads around if you if that's something that's important for you and then yeah i just jeff you want to take it away here yeah totally so what i want to make sure is that you understood i don't know jim probably said like free tier like a hundred times during the presentation but this is for you like you think if you want to get a free tier database that you can go play around with thanks for free you can sign up at astra.datastax.com that's you know for all of you that are live on the call but also anybody is watching this on youtube later on that's fine and for folks that are wanting to get up to speed not just learning cql but also getting hands on with these rest and graphql apis and getting an understanding what that looks like to interact with the database that way we have a workshop series um like a couple hours at a time uh for eight weeks we're only on week two or we just finished week two so this is still something that people can jump in on and all the materials are available online you can go back later on and watch the videos if you missed something um so this is just like for anybody that's kind of interested in the learning aspect of this uh in kind of getting leg up with some examples so you don't have to totally start from scratch um kicking the tires on the database we you know we've got the example code and and the stuff for you to get rolling on it um so that you can ask us the hard questions like you know i did this to my to my uh database uh this is the experience that i had working with the rest api what gives um we want that feedback to uh to help us make the product better so okay awesome uh so i will applaud for jim and jeff being here uh does anybody have any questions before uh we sort of wrap things up so in the previous slide you showed okay you support gcp and nativus now can you roughly say what percentage of your customers are with gcp versus amazon and then of course azure is is is important as i understand would you not bear yet yeah there's mostly everybody is on gcp just due to uh most folks at this point we we did a formal ga in may and so most of the folks that we have at this point are free tier users and we deploy free tier exclusively at gcp at the moment um we did that originally because gke was free um were for the for the uh the kubernetes masters um so you didn't have to pay for an extra set of of nodes um we've stayed there just because we know that it works and at this at this point it's it is free so it doesn't really for the most part it doesn't matter that much um we want to we want to be able to provide free on all three of the database providers but that's we're really more focused than getting azure working the commercial customers are about 40 60 50 50 gcp and aws there is a lot of pent up demand right now for azure um our existing customer base so we're really we want to enable those folks pretty pretty good so i so everybody does also do your own the cloud infrastructure team um so you're not you know your your team is not involved actively developing and improving kassandra um like if kassandra runtime itself but i'm curious to know like what metrics or what information can you provide are you providing back to that to that kassandra team about how people are acting using kassandra because now you you see everything right you know the queries are showing up you know what sort of you know roughly what the data looks like as as maybe it's too early but is there anything that you're providing to the kassandra development team to say okay you see queries that look like this you know make sure we run those right you know fix the system and run those things faster if you sort of had that kind of feedback provided to them yet or now so we actually do have the the team that's working on the serverless development is all database engineers and so their their secondary project is really when we see things that we don't understand from the the cloud side they're there to hold our hands through the whole kassandra process and what's going on there the they work very very closely with the kubernetes team um so the kubernetes operator team is very tightly tied in with the metrics collection and reporting mechanisms we don't actually look at folks's queries um specifically unless unless we're running into a big problem and then the customer gives us access to it and we'll go in and look at the queries um just because somebody could say where social security number equals x and I just don't want anybody on my team seeing that at this point at some point we'll have tokenized parameters but we're not quite there yet um we we have fed queries back well the other thing we will do is actually just feed the the metrics that are coming in and so we'll be like this cluster keeps you know the gc keeps kicking off on this cluster constantly do you know what's going on and they'll look at the metrics and they'll look at the configuration and say keep why don't you try this and we'll try it and it'll either either be better or it'll be worse um we and we round and round until we come back to okay we should probably deploy these clusters with x y or z or we should probably put another guardrail around this to prevent that type of behavior from the user side of things and throttle it at the user end that sounds very manual it it is but every process every process that ends up is what is the change we have to do so either does it go into the operator or is it a new guardrail or rarely is it rarely is it get the database engineers to fix something but it's it's definitely not it's definitely not self-driving right it's there is we would we would really we have terabytes and terabytes and terabytes of database operational data and at some point we're we're looking to get the the the the capacity to be able to start to mine through that and see what are we looking for what are what are some of the anomalies that we can look for and then that that command processor and command and control loop really becomes useful at that point because we can just throw a message down that says you know change this parameter and now we have a database that's the spoke for every one of our users that is that is definitely the goal that we'd like to get done to and then for the for the stature collecting the jbms puts out something the database system spits out something kubernetes as well do you find um like do you guys you guys ever use the hardware performance counters from the cpu like like you collect that kind of stuff or is it just sort of user level things mostly user level things um and kassandra is pretty well instrumented at the jmx level okay one of the things that we did for to enable the operators build this metrics collector that really pushes that information out in a collective fashion so we can push it into any any prometheus cluster um and that's actually how we if you go into the interface today and click on stats there'll actually be a graffata dashboard that's every instance has its own version of graffata and prometheus running to collect information and display it to you we can also scrape all of that information off and and perform you know there's a lot more metrics than we actually expose out is that we can see what potential is going wrong with the database itself and is the level of i mean i realized over time this is my last question is the robustity of those logs always the same or do you do you recognize something's wrong you maybe increase your sampling rate and then and then do a further investigation uh currently they're they're all it's it's a standard sample rate um it's you know it's usually about once every couple seconds or five seconds so it's not it's usually good enough to understand what's going on um with the running database what we did is actually just crank the retention down so we can we can collect a pretty high granularity of of data without overloading the box with just too many metrics yep okay all right cool awesome this is fantastic this is this has been really interesting uh so again i thank jim and jeff for being here and sharing this with us today