 Imagine a horizontally scalable database that puts data everywhere your users are. It seems intuitive, but most databases are still light years away from true Turnkey global distribution. Azure Cosmos DB offers the first globally distributed multi-model database service for building planet scale apps. It's been powering Microsoft's internet-scale services for years, and now it's ready to launch yours. Only Azure Cosmos DB makes global distribution Turnkey. You can add Azure locations to your database anywhere across the world at any time with a single click. Azure Cosmos DB will seamlessly replicate your data and make it highly available. Azure Cosmos DB allows you to scale throughput and storage elastically and globally. You only pay for the throughput and storage you need anywhere in the world at any time. As the first and only schema agnostic database, Azure Cosmos DB automatically indexes all your data so you can perform blazing fast queries. And since no data is born relational, the multi-model and multi-API capabilities remove the friction, allowing you to build with any data model and API. While most database services force you to choose between strong or eventual consistency, Azure Cosmos DB provides multiple, well-defined, intuitive consistency choices. So you can select just the right one for your app. Take off with confidence knowing Azure Cosmos DB provides guarantees and comprehensive SLAs other databases would never attempt. High availability, consistency, throughput, and single digit millisecond latency at the 99th percentile. Your app deserves a globally distributed database service that's out of this world. Welcome to Azure Cosmos DB. Hey everybody, Jay. Welcome back to Azure One Bytes. Thank you so much for joining me. It's been another week of things to learn and what do we do here? Well, we learn about the people, the products, the process, and all the things that go into an amazing Azure experience. So I've got some new features this week. I got a nice slick background music, love StreamYard, great product helps you do some really cool things. This week I've got some really great guests that are just going to join me to talk about Cosmos DB, which we just watched a little video on. It is a globally distributed database service within Azure that allows you to create no SQL databases, access them extremely fast, low latency, high throughput, consistency, you name it. They're all these different models. My guests today are going to tell me all about it. And so I'm going to bring them both in one at a time. First, I'm going to bring in Gal Levy. Hi, Gal. How are you? Hey, everyone. It's great to be here. That intro was amazing. I love it. And also, I'm going to bring in Theo van Cree. Hey, Theo, how are you? Hello. I'm good. How are you, Jay? I'm great. So where are the two of you located? Where are you broadcasting from today? I'm in New York City. Oh, me too. I'm in Brooklyn. Same, also Brooklyn. I'm in a tiny little town called Western in the northwest of England, so the UK. Well, we are, you know, multi-continental today, but New York seems to have the big lead on where we're from. So we've got a really cool episode that we're going to talk about today. We're going to spend a lot of time on Azure Cosmos DB. We don't have a lot of time, but the time we have, we're going to spend on doing that. And before we get too much into things, I want to remind everybody of a few things. One, we've always got a poll and we'd love you to take this poll. I think it'll help set some context for my guests about who's watching and what they're watching for. And this week we've got, are you currently using NoSQL for your applications? I'd love to know a little bit more. You can vote yes, no, or it's complicated. It's one of my favorite answers. And then we've got the docs for today's show. So all the documentation that you would want to point to the learn module, which I absolutely recommend you go and check out. You can get this free education right within Microsoft Learn. It's gamified so you can have points as you can see. I am an 8,900 XP wizard on doing things with Learn and Cosmos DB. So that is the big intro, guys. Next, we're going to kind of get into you for a minute. And so I want to remind everybody, if you've got questions and comments, you can send them in the chat, whichever chat you're using. And that includes the native chat within Learn TV, so aka.ms slash Learn TV, or if you're watching on YouTube or Twitch, you can send them that way. So I want to ask you both a little bit about your background, how you got here. And so I always ask that question. And I'm going to ask Theo first if you just tell me a little bit about your background and how you got here. Yeah, so I don't know how far to go back. I'll just go back to when I started in tech, I guess. I started as a software engineer working for a big insurance company. I don't know why I'm not saying who they are. It's AXA. I worked for AXA for quite a long time. And then I worked in government in the UK as well, moved into a role as an architect. And then I came to Microsoft. In between those roles, I did a master's degree in data science. I was getting into distributed data platforms, no SQL and big data. I kind of developed an interest in those areas. And then I worked in the customer success unit as a cloud solution architect for a couple of years before joining the Cosmos DB team. Although really I was straight on to Cosmos DB from day one. I think it was just being released at that time. So there was a lot of buzz around it. There still is a lot of buzz around it. And so, yeah, it feels like I've always been working on it, but it's only been sort of two, three, four years now. So yeah, that's me in a nutshell. And watching the product kind of grow has been very, very impressive. So, Gal, I've got the same question for you. How did you get here? Also in my mind, I'm thinking how far do I want to go back? In terms of career-wise, I started off my career in software engineering at DataStacks, actually, working on Cassandra. And eventually I transitioned over to PM. And now I'm a PM at Microsoft, still working on database software, just working on the API for MongoDB. In between, I did an MBA at UC Berkeley. Cool, cool. That's great to learn a little bit both about your background. I think it's always helpful for everyone to kind of get some context set about our subject. And so speaking of getting some context set around our subject, I'd love to learn a little bit more. I'm lucky enough, I know some stuff about Cosmos DB, but both of you are working as a team, getting things out there about how to improve, use all these things, your applications with Cosmos DB. And so I'd love if one of you wanted to give me a brief intro and some information that you feel like you want to set context with. How about you start with us, Gal? Sure. Do we mean context on Cosmos DB or aren't the product in general Cosmos DB? Yeah, Cosmos. Yeah, Cosmos DB is a globally replicated database platform that's multi-model, meaning there are different APIs that can be used to access it. And what's great about that is as a developer, I don't have to learn a new database format basically in order to use Cosmos DB. I can use a database format I'm most familiar with. So for example, MongoDB or Cassandra, I can use the API in the same applications as I was using before and not really have to change much of anything in order to leverage the benefits of Cosmos DB. Obviously, we also offer the SQL API, which is the native API for Cosmos DB, but that's not the only option. Gotcha, gotcha. And so you're using these different APIs like you would say through any of the other natives. So if I'm using the SDK that MongoDB releases in order to actually connect an application to it, you don't have to make that modification. It's more like a connection string change if I'm right. Exactly. And from the client's perspective, you're still using a MongoDB database, but you're actually using Cosmos DB and leveraging all the benefits of that. Sure. So you said that you've focused on the MongoDB portion of things. And I got a question for you about that. And Theo, you'll be able to give me a little bit on this as well. So if I'm building a new app, should I choose the Cassandra or Mongo API or the SQL API? What do you believe is the right move? And Theo, I'll ask you to jump in as well after we hear from Goh. Well, I'm biased with obviously working on MongoDB, but from my perspective, if you're a MongoDB developer and you've gained those skills of using the MongoDB drivers and the tools and all that, you probably want to keep using that knowledge you've gained over time and not have to totally change your way of thinking, basically, and not have to change your apps. So if I was a MongoDB developer, I would use the MongoDB API on Cosmos DB because I still get the benefits of Cosmos DB, but I don't have to learn a new query language. I can keep my apps the same as they were before and just use a new database platform on top of that. Gotcha. And Theo, you know, same question. If you were building an app, why would I select Cassandra as opposed to one of those other APIs? Yeah, it's a really great question. It's one of those questions that kind of demands that you start with that horrible. It depends, response. It depends where you're coming from. I mean, if you're at one end of a spectrum where you're really wanting to solve a performance, let's say, an efficiency, it would be difficult to look past the core API, the SQL API, because that's the, you know, that's really the cloud native platform. That's where it started. All of our platform level features usually ship there first. So in terms of pure cold performance, that's where you were going. You'd have to recommend that. But then again, that's not to say that the other APIs aren't kind of close to matching it. And there are other things that come into this as well. Like Gal just said, if you're coming from a different developer space, not just about the platform, it's also about the programmability and developer experience as well. And of course, Mongo and Cassandra, much as we'd love to embrace, you know, the SQL proprietary standard, they are open standards and that tends to be important to a lot of companies who are, you know, having multi-cloud strategies and so on and so forth. So it's a complex thing to answer. It's not something you can really recommend to somebody. It really depends on where your starting point is. So, yeah, I like Gal's answer. If you're coming from a place where you really like the Mongo experience or the Cassandra experience, choose those. If you're really pure greenfield and starting from scratch and especially if you want to really solve for particular performance characteristics then, yeah, you may want to take a look at the core API. Gotcha. Thank you so much, both of you, for those answers. I really do appreciate getting that little bit more context set. So one of the things I'd love to learn a little bit more about are some common use cases. If you'd like to talk about it just for a second, Gal, if you'd like to just give me some, say, common use cases for the MongoDB API. Yeah, so I guess common use cases for the MongoDB API are ones that leverage Cosmos DB's benefits. So the ability to use MongoDB and pretend Cosmos DB is a MongoDB database, but also scaling is a major factor. So the MongoDB API on Cosmos DB is able to scale instantaneously within less than a second. And that's a little benefit because sometimes people have applications that don't actually use all the scale at once. They kind of go up and down depending on the time of day. And you don't want to delay when you need that scale. You want it immediately. That's a major, major use case for us. Gotcha. Theo, same question relating to the Cassandra API. Yeah. I guess where Cassandra differs, the main area where it differs is it has a schema. It's the only API that does. So if you're particularly attached to having a schema and there's lots of reasons that you might want to have the schema at that layer, then you would use Cassandra. Also, similarly, performance, order catalog type scenarios or even IoT type scenarios. Scenarios were really throughput and is important. Latency is important. Availability is important. Actually, all of the APIs are sort of common in that respect. Where the differences are, like we said before, the developer experience. But where we generally see use cases across the whole of Cosmos DB is the things we already talked about, the kind of where you need great performance, IoT type scenarios, retail scenarios, ordering, comparison website scenarios. I would actually say that between all of the APIs, you have a pretty general purpose set of functionality there. So there isn't really a use case that I could say you couldn't use Cosmos DB for that or there wouldn't be a way of really optimizing it. But it is a distributed database. So if you were coming from the relational world, then there'd be some learning to do at the very least, even though I would confidently say that you could apply your use case very well to it. Sure. So one of the things that you said is distributed. So that really lends to, say, very globally diverse types of applications that may be having different endpoints or I should say user bases in different parts of the world. So having edge locations that are distributed that you can replicate your data to, I'd imagine, is kind of crucial in being able to really get a lot of the benefits. And I know that there's multi master reads and writes so that if you're in a different, then like whatever you've initially created a different region, you can write to multiple regions and then you can pick your consistency level. So eventual consistency, session, things like that, they all become available. So, Gal and Dio, I know you both have a little bit of presentations to show me. We've got about 45 minutes or so, maybe a little less to start going into them. And Gal, I'd love you to kind of jump into what you've prepared for today. Sure. If it's possible, please share the slides I have. Yeah, so I kind of wanted to drill in a bit on the scale because this really applies to all of Cosmos DB. So we said, Jay, you mentioned we have regions all over the world and you're able to scale up and down based on user demand in those regions as well as have very high availability. We offer up to five ninths of availability because of the multi primary writes feature. And when I say like scaling, I don't just mean scaling, so not just scaling VMs. Since we don't have a concept of VMs, we essentially sell throughput of the database. And the reason for that is because we're a multi-tenant system, meaning that we have a huge, huge pool of resources that we can allocate resources and throughput to you at any second very, very fast to you as you need it. And you only pay for what you need. So that allows you as a user to scale in fractions of the VM of granularity, meaning that you're only paying for what you use and you can get all the skill you need whenever you need it. Which also means you don't have to pay when you actually don't need it. And if I'm right, the pricing module is based on request units, which are basically IO utilization and throughput that you're actually making use of. And I know there's like two different models. There's a reserved unit and a serverless unit. And I'll give you some time to go through that. Yeah, so on that note, imagine you have a use case similar to the one in the middle here where your workload is spiking up depending on the time. If you're using VMs, you would have to wait to achieve cost effectiveness here. You'd have to wait, imagine blocks of time as you scale up your VMs up and up and up in blocks and then down and down and down. With Cosmos DB, you don't have to wait for those VMs to scale up and don't have to over provision. You just pay for what you use and you scale up. You follow that line much, much closer than you would with VMs and that saves you money. One, and most importantly, it gets your users the performance they expect. Your database does not slow down. You always get 100% performance for your database because of this. Because we have giant pools of resources we can allocate to you immediately. What you mentioned with reserve instances is very applicable to standard provision throughputs. The slide you see here is the three different ways we have provisioning resources in Cosmos DB. With standard provision throughputs, that's where you set the limits of how much are you, how much request unit budgets you need for your database. This is very good for workloads or very consistent. If you know you always need around 7,000 RUs, you can have the 7,000 RUs pre-provisioned as resources and know that you can save money with that because you don't really need to change that often. However, if you have a spiky workload, autoscale is a great choice because you only pay for the resources you provision and you'll scale up and down instantaneously depending on what you need. The last one here is Serverless, which is GA now on the API for MongoDB where if your database is idle, you pay nothing, you pay zero for that and you only pay for the RUs you actually use and that's really good for sporadic workloads or dev and test workloads where you use the database like once a week or something like that. Gotcha, gotcha. Very, very cool. Being able to pick a pricing model around these different things is really important I think for our users because you have to be budget conscious when you're working with a globally distributed database. I'd imagine rather than just, well, spinning things up and not worrying about what things cost. That's not something that people are really interested in doing and one of the things that I've really enjoyed about Cosmos DB is how it's really empowered the developer because there's way less management of the underlying systems that you would have in traditional database administration work and so now you can have DBAs within your team kind of readjust their workflow so maybe that they're more focused on making sure that the indexes are right and ensuring consistency levels are right and working with the SDKs to work with backups and restorations and things like that all that I would imagine really helps give the developer a bigger hand in the situation. Exactly and one great case of that here is you can even switch between throughput models on your same Mongo API collections between standard autoscale you can change and that's why we always recommend for users to start off with autoscale first see where their workload patterns go and then switch to standard if they feel like that will save them money but it's up to the developer to choose which one they prefer. One great case of where a lot of this operational workload is lifted is essentially with sharding so sharding is very important because it allows you to horizontally scale your database and in Cosmos DB sharding or partitioning is native it's a cloud native database and it's built to horizontally scale as much as you want and unlike in the MongoDB case unlike many other MongoDB services Cosmos DB manages the sharding automatically for you as long as you choose a shard key or partition key that evenly splits the data which is something that these people who are speaking about DevOps folks can really focus on that instead of the actual sharding aspect as long as you choose the right partition key the sharding key Cosmos DB will automatically shard for you and scale up and down to your needs and not just scale a little bit up and down scale as much as you want there's no limit there which is really amazing because it manages all this for you you don't have to decide how many shards you want to manage that it's all done automatically for you by Cosmos DB and that's one of the benefits of building your MongoDB apps on the API for MongoDB on Cosmos DB Very very cool and one of the questions I have is how long does it actually take to auto scale or to scale up that database that I might be using Yeah, so since we have a giant pool of resources that allocate things to you instantaneously your scaling is instantaneous you don't have to wait time your users don't have to wait and you only pay for what you're actually using at that point in time so it's really the best of both worlds there and that's with the auto scale model that I mentioned right here the middle one And one more question is why is the API for MongoDB just more efficient than say other MongoDB offerings? So yeah, so first off it's more efficient first because the other operational kind of workload of having to manage sharding and deployment and upgrades and all that is taken away from you Sharding is done automatically for you upgrades take seconds because we don't run any MongoDB in code we act as if we're MongoDB database and implement the wire protocol for MongoDB but all of our versions are in one code base and to upgrade between versions it's just feature flags on or off and it takes seconds to do that's one aspect of efficiency the second aspect is that we have that giant pool of resources that we can allocate to you effectively whenever you need it immediately meaning that you're only paying for the resources to use instead of over provisioning VMs and waiting for your usage to catch up and then over provisioning again to make sure that you don't get a degradation in performance So rather than say going with another offering from another company that is basically leveraging virtual machines or containers on the platform so you know GCP AWS or even Azure rather than letting a third party and I know MongoDB they have their own service it works great but it's still running on top of somebody else's cloud as opposed to this which is native to Azure Exactly imagine you have a request doesn't matter what kind of request it is Cosmos DB API for MongoDB as your load goes up in a VM based kind of framework as your load in the VMs goes up your requests get slower and slower and slower until they eventually can't go through anymore In Cosmos DB the ideas you want to have 100% guaranteed expected performance for each request and you'll get that and when you need more throughputs you can pay for more throughputs you can auto scale to that throughput immediately so your requests always behave in the same expected way and there's no surprise when suddenly a bunch of users start using your platform your app and then your performance for your database just goes way down because you weren't prepared Gotcha so one more question for you and then I'd love to hear a little bit more about Cassandra unless you've got more to show us I'm curious you mentioned about all those different features that you've got there and I was just curious how do I upgrade my database account because I know Cosmos DB kind of has this hierarchy of like account then the actual container that you're working with So upgrades actually I'll go back to this slide because I like this slide because it has a cat and I have a cat so I like talking about cats in general I have a cat and if my cat needed to upgrade my database my cat would be able to do it because it's literally one press in the portal or programmatically it can be done and there is no you know the database stays online everything keeps working just as you expect you just get to leverage those new features immediately and you just to demonstrate this you just go in the portal click this button set the version to whichever version you want and then your database starts working that way with every new connection and to top that off with those feature flags you can even undo the changes if you decided to do that you can go back and downgrade to a different version and because of the way our codebase is structured we're not limited to end of life in older versions we just have older versions in the same codebase so we can just keep them running for as long as we want and you can upgrade and downgrade between the versions you want to get the features your app desires basically Well I am a dog person and my dog is very very excited to be here in my apartment and home so thank you very much so I'm curious Kyle is there anything else you'd like to show me before we head over to Theo so he can show us a little bit more about the Cassandra Will this be a good time to demonstrate just one quick thing with with robo3t Yeah sure Let me just share a different screen here just give me one moment please Sure and to remind everybody if you've got questions or comments use the chat and if you haven't already take the poll please go to aka.imslashlearn.tv to do that I'm going to bring up your screen and you've got robo3t which I know is a tool to actually browse within databases Exactly yeah and I just want to demonstrate this is a Cosmos DB API from MongoDB database and collection here you can see the collections they have and I want to demonstrate how these collections work in the same way you expect as if it's a MongoDB database but it's not it's a Cosmos DB database with API from MongoDB running on top of it but all the tools and all the SDKs work as you expect and that's really important because as a developer I don't want my app to have to change I want it to stay as similar as possible and just leverage all the benefits we just mentioned of Cosmos DB so you can see I can go into this demo collection and just double click on it and see some documents everything works as expected I can insert documents, let me just insert a document here in this case this collection is started on user ID can validate and I can insert it and everything works as expected and this is the same experience we want everyone to find with their tooling because we want the experience to be the same for developers but at the same time being able to leverage the benefits of Cosmos DB as well very cool and then we have our our keys and our values and everything that we would expect along with that incremented ID that if you don't necessarily specify your own ID you can just have Cosmos create your incremented ID and then you just when you're creating this you're indexing on one of these fields and having Cosmos you can say you're having a shard a shard key set on one of these and then you're letting auto indexing kind of happen and you don't have to worry about managing data distribution you just pick where you want it to go and let Cosmos move your data across and you can see we're indexed here we have indexes for ID and a wildcard index as well so very very cool so unless you've got anything else to show me I want you to hear a little bit from Theo and see what I can learn about that Cassandra offer no problem though, thank you so much so Theo thank you for giving me some time I've got some questions about the Cassandra API and the first thing is I know that there's Cassandra managed instances that exist on Azure but I'm curious is that part of Cosmos DB or is the Cassandra API a separate offering yeah so the answer to that is yes and no it is part of Cosmos DB in the sense that it was built by the Cosmos DB team and it also reuses the Cosmos DB control plane for all of the automation that we do but it's actually a separate product it runs open source Apache Cassandra and we automate deployment and scaling and it's on that so it's a separate offering so is it kind of like the same relationship that say SQL server for Azure or Azure SQL has to say an SQL managed instance is that kind of like a one to one there yeah you got it it's the same idea yeah great great so I know Theo you've got some stuff to kind of present for us and I'd love to take a look if you mind me bringing your screen up sure yeah let me know when you can see it because I can't see it okay Cassandra the cool bits yeah so I wanted to talk a little bit just a little intro I guess to Cassandra you know what's good about it what's not good about it what's good about it it's pretty simple it's performance right it's distributed linear scale right optimized fast reads it's very resilient full tolerant it's actually pretty easy to use as well the learning curve at least from the developer side I should say is easy what's not so easy is maintaining it right so the back end and replication settings and different configurations and just running the platform in general can be difficult which of course makes a great candidate for being an API on Cosmos DB Cosmos DB at the back end is pretty similar architecture so it makes it relatively I don't want to say too straightforward I don't want to make it seem like it's too easy but we can surface this API and as much as a goal was showing there all of the SDKs and things that you would expect will work that does depend on feature supportability so we don't support 100% of everything that you would expect so we definitely recommend reviewing our docs to see what things we support and what things we don't support but we're bringing new features all the time and we usually by features what I mean is things that are already expected in Cassandra but we're adding them to this API so that it can be completely feature complete so things like materialized views in private preview there like with transactions is now in public preview Truncate named indexes clustering key indexes and now in private preview as well and then we have the type of features that we call kind of unique to Cassandra API so change feed is something that's unique to Cosmos DB and it's a programmability model for being able to do event sourcing type applications and feed ranges is a way of doing that with a processing in parallel for very large types of workloads and we've also got server side retries now in GA as well and there's some features we're working on right now so we're working on native RBAC which is the role command in Cassandra again it's something that you expect to Cassandra but we're adding feature compatibility all the time and then there's things that are again unique to Cassandra API point-in-time restore is a feature that's in the platform that we're servicing the API and we're working on a public Azure search Azure cognitive search integration as well and then there's Cassandra managed instance which I'll probably talk about a little bit more later as well we're adding some new features there and there's some stuff coming also but what I wanted to hit on was a question that as you can imagine we're obviously getting quite a lot since managed instance went GA last November how do you choose or what would we recommend and the answer to that is really it really depends why you're coming from it's a bit like the answer to the other question which API should you choose but specifically here when you're choosing between two offerings that look very much the same or similar it's really a spectrum between control versus productivity so it's very much like the I guess the choice between running IaaS implementations and applications versus using pass services generally of course if you're rolling your own stuff you've got a lot more control but you've got less productivity and vice versa you have a lot of productivity that you might not be able to do certain things that you expect or you need fine-grained control and what this kind of boils down to the analogy that I think Scott Hanselman first used and I've stole it ever since is this kind of think of it like an Uber or a manual or an auto shift if you need a car to go somewhere and you can't get a bus so you can't get a train you need a car for whatever reason but you don't have a driving license or you don't have a car or you can't afford one is going to look great to you but if you have a car and you have a driving license then obviously you're choosing between a manual shift and an auto shift and that analogy kind of works for me because there is this kind of line between these types of offerings or self-hosting I should say and the kind of managed hosted platform version of Cassandra versus something like Cassandra API you're still in control of the platform configuration aspects there might be more automation in managed instance but you're still in control with Cassandra API Microsoft really controls all of the platform level stuff got you did you have a question no no no please go ahead continue yeah and so what we're adding to managed instance over time is what I like to think of a semi auto shift so you're more things that you can control but if we take the analogy further of course there's always going to be certain things that we're always doing right otherwise it's not really a managed service so deployments OS patching and scaling and so on what this is really about is trying to give users who like Cassandra as much choice as possible and again it's the same thing there's always a trade off it depends where you're starting point is and so you have this kind of choice got you so I wanted to hit migration and I've got a bunch of demos so we'll see how they go migration in this context just to be clear I'm talking about like for like so where you already have let's say a self hosted version or even a managed version for that matter of some flavor of Cassandra so you're not migrating from I don't know MongoDB or something into Cassandra you're migrating from a flavor of Cassandra but you want to migrate into either managed instance or CosmosDB with the Cassandra API sure so if you're like you're using data stacks and you want to bring everything over to Azure that would be kind of a one to one just using the migration tool right exactly and even there there are challenges with migration and migrations are always a challenge right so and especially live migrations which is what I'm kind of focusing on here the nice thing with Cassandra managed instance is what we've done as a managed service we have kind of have this unique capability of allowing people to configure hybrid clusters so what that means is if you deploy a managed data center in a managed instance for Apache Cassandra that data center can join an existing self hosted cluster ring in a your own self hosted version of Cassandra so what you'll have there is self hosted on premise or in the cloud wherever it happens to be and then a managed data center in the cloud and just to kind of give a quick I guess demo of this it's more of a blue blue pizza style demo English people will know what I mean by that but yeah here's what I made earlier this is a resource group in Azure and I have deployed an open source Apache Cassandra cluster and I've got this thing here called Azure managed instance for Apache Cassandra and it's this cluster resource and what this is in the case of hybrid clusters is kind of like a bookmarking resource so that the system knows that any data centers deployed for a managed instance in this case are going to join an existing cluster ring of this name and so what I have here if I look in my where is it I know let's go back into the cluster resource if I go into my data center tab I can see information about my managed data center here but there's also a self hosted data center which I if I go into the node tool use the node tool command from one of the nodes on there I would be able to see both my self hosted data center here and also the managed data center here so that might be something that seems a little bit weird why do we do this why are we kind of mixing and matching self hosting with the managed service and obviously the reason is quite obvious for people who want the easiest most seamless way possible of migrating let's say from a self hosted or on-premise Apache Cassandra cluster there's nothing better and more seamless than actually using Cassandra replication joining your data center or your your self hosted data center with that managed data center it's a form of hybrid and then everything is just seamlessly replicating and similarly it doesn't have to be used for migration it could also be used just as a way of extending your capacity if you're comfortable where you are whether it's self hosting or on-premises and you want to use Azure as a kind of on-demand way of scaling up and scaling down elastically then this is a great way of doing that as well because we provide automation around deployments not only but also scaling of up and down of nodes as well within your managed data center so that's why we've done that it sounds like a good disaster recovery option for people who need to have multiple places that their data needs to live or if they're in a scenario where only a certain amount of data can actually exist in Azure based on compliance or regulation and then they can have the other portion of the data say hosted in their own data center. Right absolutely yeah it gives you flexibility but if you're not in the space where you can do that where you know you're not running a version that's close enough to the versions that we support or you can't do hybrid cluster for whatever reason maybe you're using you know Cassandra APR some wire protocol version of Cassandra that doesn't support connecting up to our service then the approach that we recommend we see a lot of people use this thing called dual rights or double rights as it's sometimes known as it's not really a new way of doing migrations so where you have let's say you have data that's being written to the old database over a timeline let's say and you start by migrating the schema what you would do is configure your app to write to both source database and the target database typically when you're writing to target would be asynchronous and then if necessary you'd also have something that would be migrating or copying the historic data because obviously you can't guarantee that all the updates are going to hit existing data if your retention is short enough then you just leave it running and then eventually the data in the old system becomes irrelevant but if it's not if it's very long then you want something that migrates all of the old data as well either way you're going to get to a point where you feel that your databases are in sync and then you can validate that the migration doesn't have any errors and no records missing and then you would cut over at that point so that would be that pattern the problem with this pattern is this stage right configuring your app or changing your app changing the code to point to a different target everywhere where you're doing updates is maybe a little bit invasive and if you've got a lot of applications hitting the same database obviously that's not very convenient so what we've done is developed an open source tool which we call the dual write proxy and this is a tool that you can install on an existing Apache Cassandra cluster so you'd install a piece of Java software that you install and run as a process on each node on a given existing cluster and when that's running all you have to do in your application is just change the port, do any other changes and then the proxy will then routes both requests to the local node and also the target Cassandra system whatever that may be asynchronously so this is a way of doing the dual writes process but taking away a lot of the pain and concern and the friction and so on then of course what we recommend for doing the data copy is using Spark with the Cassandra Spark connector in this case you'd have to either back date or preserve the original write time so that when it's being copied over those records don't overwrite anything that's being updated live via the proxy and so on and for this demo that I'm about to do I'm going to use a sample that we can share that preserves the write time but it also has a routine for doing validation and then correcting any areas that come up as well so yeah let me the moment of truth let's do this so what I have let me go to my resource group first in fact let me go to my app first because it's something I need to start running so what I have here it's just an app that I'm going to run and all that is doing is just dumping 100,000 records into my source database and table that's going to simulate data that's already there of course you might have millions of records but this is a short demo so I'm just going to dump 100,000 records there and so while that's happening let me go and look at the resource group that I have and again like with the the other hybrid cluster that you saw I've just deployed a vanilla open source Apache Cassandra cluster with three nodes and then I have a Cosmos DB account in here a Cassandra API account that I'm going to end up migrating to and I also have a CQLSH tool which is the tool for interacting with Cassandra so I'm connecting to my open source database here and then I'm just going to do an account hopefully that data has been dumped in there there we go 100,000 rows and much like you saw with Gar demonstrating Robo I'm using exactly the same tool to connect to your Cassandra API the tool works in exactly the same way all of the drivers or the open source tools that you can think of they work pretty much in the same way while their feature compatibility might not be the same my protocol is the same so all of that developer experience is still preserved so when I run account here I've got no records obviously I'm about to migrate something so let me go back to my app here that's obviously finished now what this is going to do when I hit enter is it's going to simulate updating some of those records various randomly of those records that I inserted but also adding new records as well so let me start that running here and then if I go back to my source table obviously I should see let's go back to the source window I should see the account starting to increase obviously there are inserts going in and records being updated and if I go what I didn't mention before is that I have installed the proxy already and the app is now pointing to the proxy so you can see some activity there it's connected and now it's routing requests or should be routing requests to Cassandra API and if I go into my target table I should see some records starting to appear which is great I haven't had to do anything except change the port and install the proxy on the source Cassandra nodes and then configure it to point to the target that's all I've had to do but of course that's great that I'm getting the live updates but I want the historical data in this case so I also have a Cassandra sorry a Spark script here that's going to migrate the data and this is a sample again that we have public that you can use I've just taken the cells here and put them into some notebooks here in Azure Databricks so let me just run this thing here and this portion it's all in browser which makes it a lot like you don't need to worry about local tools to be installed it's all here for you exactly yeah there's some code you have to use Spark and the Spark connector and the libraries that we put together but yeah it's pretty straightforward I think to interact with hopefully so this is running this is going to migrate all of those 100,000 records and who knows how long this is going to take sometimes it's quick sometimes it's slow so if you have any maybe this is a good time to ask another question or sure so one of the big differences between what you were showing as far as Mongo is concerned and what the other showing is the lack of schema that's required in order to actually work with your data and if I'm right whenever we need to make some modifications to our application we may need a new key and value to get stored we don't have to go ahead and migrate our data to a new version with a new schema am I right about that from a longer perspective on a yeah you can basically create whichever documents you want whichever keys you want the only important thing to mention there is if you are creating a sharded collection that shard key needs to stay consistent and needs to be in every single document you insert into that collection because and the reason for that is because it needs the database needs to know how to split up the documents on each shard individually and it needs to know where to go basically great so Theo looks like your process finished right yeah it finished and while you're talking there I also ran this validation routine and you might think that I'll be alarmed by this but I'll break the suspense I deliberately engineered this failure so there's the 1582 records that failed so it looks like there are some missing rows in here I can see the reasons for failure missing rows or some updates didn't make it through and again I'll break the suspense I'll tell you why this failed what happened is that when I did the data copy I was saturating the throughput on Cosmos DB and so when the client was trying to do the requests it was working fine to the source database but I'm getting this request rate is large and this is by far the sort of most likely thing that you'll run into if you're trying to do a migration to any API in Cosmos DB for the first time and you're used to using something that's provisioned in a more traditional way Cosmos DB is a request based currency so it works on this thing called request units you have to provision enough for what you need and if you exceed that amount you get what we call rate limited that's something that we could probably spend a whole other show talking about but in this case I was a little stupid deliberately I ran my data copy at the same time as my steady state workload was at its peak so obviously I should expect to get throttled and not have enough performance there one of the ways or a couple of the ways that you could avoid this is one you could go into the Cassandra API account or any account I believe now I think we have the support this everywhere certainly Cassandra and Mongo go into features and you can able something called server side retry and what that will do is during rate limiting we'll actually retry on the server up to a certain limit of 60 seconds and after that we'll time out another thing you can do is in the actual spark settings obviously you can limit the throughput through during that period of time so you can fix it that way and so that's all well and good but I've still got a problem I've lost some records how do I correct that well there is a we also have a script here that just allows you to retry those transient failures by just filtering on them filtering on the failed rows running the migration sorry the validation routine again and then running the migration routine to get the records that failed and just sending them again so let me just try doing that and hopefully that's not going to take as long as the other things that I've just run great and just to give you a heads up on time we got about maybe five minutes or so before we got to wrap up yeah we're good I think we're going to be good we're making good time it's always fun when you do it live and you have to kind of like how many minutes especially the deeply ambitious demo like this failure on a non-deterministic kind of never mind I think you're doing a great job yeah so it looks like we're about to get a result here so again this is retried all of those failed records and hopefully this will finish pretty soon maybe I won't sure I was just going to ask Al waiting is there any real limitation on how much storage I can use for the amount of data that I'm going to be putting into Cosmos the storage yeah so there's no limit to the storage that you can use as long as your your cluster is sharded you can scale as much as you want there is I think there's a minimum number of throughput storage but that can be obviously lifted somewhat depending on your use case but overall as long as you're database or collection is sharded you can scale as much as you want there's no limit there's a limit that's great because one of the more difficult parts of that database management is managing the underlying storage and actually being concerned like how much is there how much is getting backed up is it all getting packed up so knowing that this is all done through a pass and being concerned yourself with those things is great so Theo how did everything work out? yeah awesome so what happened there while you were talking is I ran the retry failure script here and then I ran the validation to see if I was okay and it looks like I've got no failures so I guess the one final check I can do is just run the count on my source and then run the count on my target and hopefully it's a similar number or the same number great awesome yeah that's exactly the same so that was it really for the migration and that was a live migration using completely open source tools the Cassandra migrator sample for the spark connector to preserve write time and also the dual write proxy and in this case I did it for Cassandra API for the first time so if you want to know how to live migrate Cassandra API this is how to do it we've got this documented as well if you want to repeat this process and reach out if there are yep and the Cosmos TV documentation I've got up here and I'm going to take your screen out now the documentation has helped here at cda.ms 3j3 where you can go and you can get all the information on Cosmos TV including all the documentation there are some tutorials, some code samples, a reference guides links to say the different API SDKs that you may need for the different types of APIs that there are and we only touched on two really today Cassandra and Mongo we talked a little bit there's the SQL one there's also Grimlin for Graph Databasing and then of course Table for just pretty simple storage of like say keys and values so those are the big ideas and we're just about out of time and so I just wanted to start by saying I'm going to bring up this little bit of background just so we can close with some nice music in the back so first Gal can you tell people how if they'd like to reach you and talk about anything on the internet where they can yeah feel free to reach out to me on LinkedIn that website goes straight to my Linton so feel free to send me a message always happy to answer questions and the docs page for the API for MongoDB Microsoft docs is a great way to get started great and Theo same question where can people get in touch with you to learn more about Cosmos DB or reach out with questions yeah you can reach out to me on Twitter but I almost never use it so technically you can reach out to me my handles there but reach out to me on LinkedIn as well it's easy to find me I think I'm pretty much the only Theo BangPri in the world it looks like so you put me into Google and I just seem to pop up well very cool so gentlemen I want to say thank you so much for being part of the show today for those of you watching if you go into the show notes you'll be able to get a full list of links of different documentation that you can check out and any there's also a link to the free tier info on Cosmos DB so if you want to get started you don't want to spend a few bucks on it you just want to see where it takes you maybe you want to just create a couple documents and see how you can access them or modify them give the free tier a try anyway gentlemen thanks so much for being part of this today I know it's in the evening out there Theo and God we've got a little bit of our day left I hope you stay warm it is absolutely terrifyingly cold here thank you so much for having me on the show no problem guys let's give everybody the big wave goodbye and I will catch you all next time here on Azure FunBytes thanks for watching see you soon