 So I think I'm ready to go. I guess you can, sorry, say that again. Share your screen. Thank you. All right, let me, I gotta share the screen and start the presentation at the same time. A little bit more complicated. All right, can you see that now? Okay, perfect. Okay, so you, as seen on the screen, I'm Steve, I'm here talking today about modeling relational DB, our data with Cosmos DB. Just kind of a quick intro. My name's Steve Faulkner. I'm at South Pole Steve on the internet. If you see that anywhere, that's me on GitHub, on Twitter, on Facebook. I did go to the South Pole once. So that's me at the South Pole. That's a topic for a different talk though for another time. So you'll have to ask me about that later in the comments. I do work at Microsoft. So I absolutely work on Cosmos DB where I maintain the JavaScript SDK. So don't worry, this is not JavaScript specific. I'm not gonna be doing any complex JavaScript stuff. This is very applicable to all users of Cosmos DB for all the SDKs. But if there's any JavaScript users out there, welcome, shout out. And again, I'm talking about modeling relational data with Cosmos DB. So in order to do that, I kinda gotta go back in time a little bit. I gotta start at the beginning of time, talk about the history of databases and kinda how we got here. Best I can tell, Googling around the internet, the SQL database was really invented in around 1974. So it was invented with sort of the technology that was available at that time. So to give you a head a little bit, if anybody wants to guess in the comments, the price per gigabyte in 1981, relative to the same time period, it was $700,000. So I think this is the Apple II. So this is what it cost us for one gigabyte of data at that time. You couldn't even store a full gigabyte on one system, but this is what it would have cost you. Fast forward to 2017 and the price per gigabyte, two and a half cents. So clearly some things have changed since SQL was invented. And so really a lot of these new technologies out there, like Cosmos and other NoSQL databases, they're optimizing for different things. SQL was heavily optimized for storage, whereas NoSQL was optimized for speed. And that's kind of what I'm maybe talking about today is like how the way we've modeling data is gonna change for NoSQL. Specifically, I'm gonna talk a lot about Cosmos. Cosmos is Azure's NoSQL database product. It's globally distributed. It's a document store. It has this fantastic ability to scale out, take your data and distribute it between different regions and copy it between different regions. Here's kind of like the important bits that I think are important as a developer. You read and write JSON. It's just a document database. It has this massive scale and performance it's capable of. No schema, so it's schema-less compared to a NoSQL, or a SQL database is gonna be a little bit different. And it's also partitioned. So it scales out and I'm gonna get into partition in a little bit, but it has a system where your data can be, exist on different machines. It also has a SQL API, which is really cool. And everything is indexed by default, which is also a really neat feature and pretty unique to Cosmos. Digging in a little bit more into some of these aspects, the multiple APIs. So we have, these are our main APIs at this current moment. We have SQL, it's kind of like the primary original API of Cosmos. There's a tables API, there's a Cassandra API. So you can talk to Cosmos like you would any other Cassandra system. This is really great if you're coming from an existing workload or existing application that uses Cassandra, you can just use Cosmos via all your Cassandra drivers, which is really cool. Same thing for Mongo and Gremlin. These have been built on the open source tools that are already out there. So you can use all of our driver or the open source drivers and existing database SDKs for talking with those tools. We also have some cool things in the works. I believe this just hit public preview, Jupyter Notebooks, as well as Spark is on the horizon and SCD as well. But today I'm really gonna be focused on the SQL API. I'm gonna be talking about how to do relational modeling there. It's kind of the one that frequently gives people the most hiccups. So the first thing to understand is it's a SQL API, but it is not a SQL database. This definitely trips people up, they come to Cosmos, they start raising really complex SQL queries and either they don't work or they're really not very performant because under the hood it's still not a SQL database. We're just kind of exposing this SQL layer on top of your documents. Something to definitely be aware of. That leads us into Cosmos queries. So Cosmos queries, they're SQL. This is probably the most basic one you can run. Select start from C. You can select individual attributes. You can also select filters based on where clauses. You can do things like picking a do-at that's greater than today. You can order by. There's a bunch of different stuff. This is not the full list of SQL that we support, but this is kind of the big ones people tend to use in practice. Select, where, distinct, order by. Count, max, and min, and average are all aggregate queries. So there's some support for aggregate queries, which is pretty cool. We got some other cool stuff on the horizon I want to mention. Multi-order by is coming offset and limit. We already have out. Distinct we already have out. I mentioned that earlier and then correlated subqueries as well. So kind of beyond the lookout for those things that are relatively new to Cosmos. But because it's still a NoSQL system under the hood and it's a document database, we still have the same kind of gotchas that can happen with any NoSQL system. And those are two things. The first thing is scans. And Cosmos because we index everything by default, we kind of tend to avoid this problem. It's still possible to do a scan, but it usually doesn't happen out of the box. So other NoSQL products can worry about this. We don't really run into it too much. But the big one is cross partition queries. This is both one of the unique features of Cosmos as well as one of the Achilles heels that people will have in their application. Cross partition queries is when you're talking to multiple physical machines in order to execute an individual query. Now I'm going to jump a little more into partitioning and how that actually works. So you can understand both what I mean by cross partition queries and also understand how those are executed by Cosmos. So imagine the Cosmos backend. It has machines in different regions potentially, but even in one region, multiple machines that are storing your data. And what ends up happening is that a hash of your partition key that you provide as part of your data comes in and that determines where that data lives, which machine it lives on. And then ID is the thing that uniquely identifies that document on an individual machine. This is all transparent to you. You don't have to really worry about machines. You don't have to have instances of Cosmos running. It's all serverless, but it's important to kind of understand how it's working by in the hood so we can do things in a performant way. So the important kind of takeaway here is that a document is uniquely identified by ID and partition key. So if I do a query like this, select star from C where C.partitionkey equals foo. What's gonna happen behind the scenes is Cosmos is gonna receive that query, that's gonna hash that partition key and it's gonna go find the individual machine that all that data lives on. This is why partition keys and understanding them and where they fit into the system is very important because it means this is a very performant query. Every single time this query for any given partition key is executed, it's only gonna run on one machine, all of your data for that query is co-located, it can be very fast, it's not gonna have any fancy processing to do either on the backend or on the SDK side. But what happens if you do the reverse? So we call this a cross partition query. You query on something where you're filtering by something that is not a partition key. And what happens is the client then kind of takes over and it ends up finding a list of all your partitions from the backend and then it goes executes them against each physical partition. Now, Cosmos is unique and it sort of provides this for you but what happens is over time as your dataset grows, you end up doing something like this. You are having a cross partition query in your application and you go from 1,000 users to 100,000 users and suddenly that query will slow way down because it's having to execute way more queries against individual queries against individual partitions. Even if some of those partitions don't contain any of your data. Also this can cause problems on the client because it's executing all of these requests in parallel and there's ways to tune this a little bit with the various SDKs but we often see people running into client issues as well because they're having to process all these results and burning up CPU cycles and doing all kinds of stuff like that. So cross partition queries, something to be avoided. Quick recap of what I've talked about so far because it's really important before we get into the relational modeling part of this but SQL API does not necessarily mean that it's SQL. The partition key is very important to understand and you're going to learn to love it as we get through this presentation and we want to avoid cross partition queries. So that kind of gets to the meat of what I'm going to talk about today which is the relational data modeling. I'm going to use a simple example of a blog so there's some key bits here that I don't want to lose so I'm going to try to keep the example really simple. We're going to have users, they're going to have posts and posts are going to have comments. If you're coming from a SQL background you might have a system already like this that has tables and you might just hop over to Cosmos and create three different containers, a user's container, a post container, a comments container. And we need to provide partitioning information so Cosmos knows what to do. So maybe on users you'll partition ID. This you can do this, you can have your partition key and your ID be the same thing. It's actually, if it's possible for your system to work that way, highly recommended, it'll be very performant because your data will be evenly spread across lots of partitions. Maybe on posts we would partition on user ID in order to figure out like all the posts for an individual user so we could provide that in order to just query an individual user's posts. Similarly a comments container could be partitioned on post ID so we can get all the comments for an individual post. You don't want to do this. I have misled you probably over the last 15 seconds. This is not at all the approach that you want to take to modeling data in Cosmos and it's tricky because this is sort of the reverse of how you were taught to model data especially coming from the SQL background. I mean it's how I was taught to model data and I was doing a lot of post-gres work and it took me a while to understand that this is actually an anti-pattern and I'm going to go into why before I sort of talk about exactly how we should be doing things. There's a lot of cross-partition queries that will crop up here. That's why this is going to be a problem. Let's look at one example, our posts, they are partitioned on user ID. Imagine if we want to find one post. Well, how do we do that? We only have the post ID, we don't have the user ID, we don't have enough information to uniquely identify that post. It's still possible with Cosmos to make this query but the problem is, is it's going to result in a cross-partition query. It's going to, if you have hundreds of thousands of posts that are stored across several machines, it's going to talk to every one of those machines in order to find an individual post. And you can imagine that that would be very slow as well as problematic if you have users who like are coming to your blog and they come right to a post page and they don't have user information and you don't have user information you just need to find a post by ID. This is the kind of situation you will run to as you become your wildly successful blog become successful you will have this problem. Comments, another example of where we're going to run it at cross-partition queries. We decided originally to partition it on post ID. It seemed to make a lot of sense at the time. We still can't find an individual comment, has the same problem as posts. But what if we want to find all the comments where they belong to a user, right? We might want to give them ability to see that on a page or edit a list of comments or see a list of comments even. These are all going to be problematic and it's because we sort of brought this SQL style thinking which I kind of like, there's a lot of things that go into SQL style data modeling. But in my opinion, these are kind of the critical bits to understand, in SQL we rely on databases to drive the schema or to store the schema information. We normalize our data. So we want to keep only one copy of it when at all possible. We have usually pretty rigid consistency with SQL. We have very strong guarantees, we have transactions, things like that. And as a result of normalizing our data, we also have many different tables, right? And that's sort of what led us down this path of doing lots of cross-partition queries. No SQL thinking, right? So this is now the new approach we're going to take. And this is what I'm going to talk about throughout the time. We're going to sort of flip what we were taught to do on our heads. First off, the application is going to drive the schema, right? Cosmos doesn't have a schema, it's not going to enforce relationships. We're relying on an application to enforce those kinds of things. We're going to de-normalize data. So we're going to take data and potentially store it multiple times. And then we're going to have flexible consistency. So one thing that I definitely run into with users is that they say, well, I can't use transactions so this isn't going to work. And it turns out that I think in practice, transactions are not always necessary, right? If we write a couple of documents so we need them both to succeed, we can just retry that, right? That's actually works for a lot of systems. If you're maybe writing credit card transactions or something like that, you should look at doing something differently, but flexible consistency is an important term here. Cosmos does have the ability to do transactions, but let's go to a single partition. And lastly, we're going to start with a single container. We're going to put every single piece of data into one single container. It's going to look like this. And I'm going to sort of explain how this is going to work here in a minute. But our data is just all going to live in our data container and our partition key is just going to be partition key. I've sort of taken it and shrunk it down to just PK everywhere to help stuff fit on slides. But you can really use whatever you want to call it here like underscore partition key, PK is fine too. It really doesn't matter and Cosmos will let you set whatever you would like. So let's go back in time now and revisit some of our things and how we're going to model them in Cosmos. So finding an individual post. This was problematic before as it was causing a cross partition query. And what we're going to do is when we create a post, we're going to do what we did for users. We're going to store the ID and the partition key as the same value, right? This allows us to do really efficient fast point reads on any of our top level objects. We're going to do the same thing for comments, right? Post comments and users all now will have this document that stores, it's kind of the primary document for the top level entity and it's going to store all of the information we need and we're going to be able to get it fast because we can uniquely identify it in the system. We know both ID and partition key because it was the same thing. It's going to be fast. Now let's get into the relationships between these documents. If I have a user and I want to find all their posts, that's a good starting point. What we're going to do here is we're actually going to store two documents, right? So this is the first thing that we have to kind of wrap our heads around is that when we create a post, we're actually going to create two different documents and cause that. The first is going to be the post itself with all the post information like the title and the publish that date and any other kind of post stuff. But then we're going to store a second document that links the post to the author. And the partition key here is going to be users-one-posts. You can see this matches up with the author ID on the other side. And then the ID here is a pointer to the same ID as posts. So this document, because it's the same partition key, is going to serve as basically creating a collection of all the user's posts. So we're storing that information kind of in two places, right? We're not relying on the database to maintain that relationship. We're doing it at the application level. If I created three posts, you can imagine that I would create three documents that look like this, all having the same partition key, but then having different ID pointing to the kind of full post document. So now when I do this query, I'll do select c.id from c where partition key equals user-one-posts. And the important thing to understand here is that this is providing the partition key information. So we know that going back to our kind of previous example of how Cosmos is handling this on the backend, that query is going to execute against a single partition. We know it's not gonna go across partition. And this is sort of the beauty of this style relational modeling is that we might end up doing more queries because we don't have joins and things like that, but all of our queries are gonna be very efficient and very fast because they're guaranteed to either be point reads or they're gonna execute on a single machine. So this is what the return value would actually look like. We give an array back of all the post IDs and then we might go fetch those posts individually in parallel in order to get more information from them. Let's expand a little bit more. So if I wanna find the user's posts that are in the draft state, I'm gonna do the same kind of thing. I'm gonna create two different documents when I create a post and I'm gonna again duplicate information. So before I'm sort of duplicating the relationship information, now I'm actually duplicating information that belongs to the post. So this state, I'm gonna store in both places and what this allows me to do again is have a query that is scoped to an individual partition key and the state information I can also query on because I know that that's gonna execute on one single backend partition. There's, I gotcha though, right? So if we do that when we create the post, anytime we update the post, we also have to update both pieces of information in both places. So this is sort of where it gets into like the application becomes more responsible for maintaining the data consistency and the data integrity, right? You're trading speed for having forcing your application to do a little more work. And this isn't a bad or a good thing, this is just a trade-off, right? And it's understanding these trade-offs where you decide for your application what makes sense to do. Now, let's do something like a little more complicated than that even. So what if we wanna find a user by an individual email? Right now, this would be a cross-partition query. We would only, the only way we can uniquely identify users is by ID. Again, we're gonna store information twice. We're gonna have two documents. The first one is gonna be the actual user document with all the information. And the second one is going to be using the partition key almost like a secondary index, right? So I would store my email in there in a composite way so that the partition key is user-email-my-email. This is actually not my email, just to be clear. There's some others to you before me at Microsoft, apparently, so I couldn't get it. But this is important to understand here that this is allowing us to use partition keys as a secondary index on these attributes. So I can now, instead of doing this, which is a cross-partition query, right? Select ID from where C.email is Steve at Microsoft. Now, if I do this query, I can actually do it as a single partition query, and then I don't have to worry about, you know, like that query taking a long time. There are definitely other things you can do. I've sort of just touched on the very basics today, but I think there's a lot to take away there because it really sort of requires you to reverse your thinking about how to store relationships and databases. But there are some more advanced techniques. Using change feed, as Santosh talked about earlier, is a really good example of how to kind of do more advanced things here. You can use Azure Functions plus the change feed in order to update documents later and sort of propagate out some of these changes to documents, so you don't have to do them all right in your application or like all at the time you're writing a document. You might lazily update some information in other parts of your database. Materialized documents are something we really didn't cover on, but you can take documents and sort of materialize examples of them. The best example I think for a blog is if you wanted to have like a truncated version of the blog, maybe a little summary that showed up at the beginning, we could potentially store this separately from the regular blog information and we would use Azure Functions to generate this truncated version of the blog every time somebody updates it. The last one is splitting out containers. So I talked a little bit about this at the beginning that you want to put everything in one container, but at some point it might make sense for you to split containers into multiple things. But I think I like trying to tell people to use this as an optimization, right? If you find some query is hitting one partition, even if it's not cross partition, but it's still like heavily hit by your end users and so you can't really cache it or something like that. If you need to make that query work better, you can split that information out to a different container and there's no reason that you can't do that for any part of your data. But I think it's better to start with one container and then sort of optimize over time. So critical takeaways, I'm just gonna kind of recap a bunch of stuff that I've talked about and see what people have to say and questions afterwards. But first off, start with a single container. I just realized I was talking about splitting them, but start with a single container. I think that's the best choice for most people. What is the best partition key to use? Well, just use something called partition key because you wanna think about your partition key in a very flexible way. It is gonna act as a identifier, it's gonna act as a way to index things, it's gonna act as a way to index things with like composite values. You can really do a lot when you just start stuffing data in your partition key. Top-level entities, all of them, all of the top-level models should just have ID and partition key as the same thing, right? It's only when you're doing the secondary relationship or index and operations or relationships that you're sort of having them be separate. Otherwise, this evenly spreads all of your data across all of your partitions for just point reads, which is exactly what you wanna do. You wanna, that's gonna allow you to take advantage of the speed and sort of SLA that Cosmos has. Use partition key to index your relationships and your attributes, which I gave several examples of. And then lastly, optimize as the data size grows. So I gave a lot of examples here of where you wanna avoid cross-partition queries, but that doesn't mean you couldn't start with a cross-partition query and then later optimize it away. I think just to jump back a little bit, this is probably the best example. Find by email, for a small number of users, this might not be a problematic query at all. This is a cross-partition query, but if you only have a few partitions, might be slow or fast enough for a while, but when you start getting more users, you're gonna wanna optimize it into this kind of a query, which you can do. You can use things like change feed in order to replicate your data and sort of update these indexes later after the fact. Oops. I believe that's it actually. So that's at the end. I'm happy to take questions from the comments. Thank you so much, Steve. So we have a couple of questions from people on the channel. The first one is, is there an upper limit recommended for the number of documents in a given partition? Yes. So in Cosmos, one partition can hold 10 gigabytes. And one of the beauties of Cosmos is it will sort of handle that behind the scenes for you. It will split partitions in order to move documents between them and stuff like that, which is really cool. But you still can only hold a maximum of 10 gigabytes in a single partition. This can be problematic. There's different ways of handling it. You can, the best example I've come up with is like a chat application. Like most relationships in a chat application are gonna fit under that 10 gigabyte limit. But if an individual channel has more than 10 gigabytes of documents, then you're gonna have a problem. So more than 10 gigabytes of messages in one channel. And there are definitely ways to do it. You end up sort of adding like a number to the end of the partition key name. And then if somebody scrolls way back, they end up going on to one of the old partitions. But you sort of like move your partitioning forward to accommodate that. There's some other strategies as well that probably go beyond this talk. But 10 gigabytes is kind of where this, you have to start doing fancier things. Another question we got is, why is this better than relational database, more scalable? Yeah, so I would say that that's probably the primary thing people are trying to get out of it. It's more scalable. I don't want to disparage traditional SQL relational databases. They're great and they can be great for lots of use cases. And I've used them a lot. Really, Cosmos excels when you have large amounts of data and you really have really strict requirements about performance. With a no SQL database, you sort of are optimizing for the fact that you want to maintain that speed over time and as your application grows, versus if you have a SQL database and suddenly your application grows to the point where it needs to be sharded, you're going to have all these same problems because you're effectively partitioning your SQL database and you're going to have to work through all these problems anyways. So for small applications equals great, but as you sort of scale, it's very, I want to say that Cosmos really shines when you get to those upper levels. Really cool. All right, perfect. I believe we're at time. But if you have, if people want to get more information, where should they go? So on Twitter is the place I'm likely to respond the fastest I'm way too much on Twitter, but also ask Cosmos DB at Microsoft.com is something that goes to the entire team and we pay really close attention to. So if you have specific questions about your architecture or how you're doing partitioning and strategies and stuff like that, feel free to send an email there and somebody will get back to you. It's a great like direct line into the Cosmos team. All right, thank you so much, Dean. Okay, and we're going to get ready for the next speaker.