 The conference schedule is divided into three, because we also have a banquet hall with the downstairs to the new area, and then also room one with most of our off-the-record set. Guys, are we ready? All right, I've been told that everything is cool, so I can introduce our first group. So here we are at the start of this big data conference, and this morning, we are gonna focus on where you put your data, databases and data stores. We have four talks in a row, all about databases and data stores, and our first speaker is Dharmashita, a distinguished engineer at Microsoft. Now, he built a globally distributed database called Azure Cosmos, and he is here to tell the story of how he created a system that stores hundreds of petabytes of data and handles trillions of requests every day. Dharmashita? Hi, good morning, guys. We're gonna talk about the lessons that we learned in building a globally distributed database system called SIPI. But first, how many of you have used Azure? How many of you are familiar with those equal databases? How many of you are familiar with equal databases? So my team and I spent seven years in my blood, sweat, and tears towards all of our energy and time and care for seven long years to build a system which is globally distributed, and I'm gonna talk about the capabilities that the system exposes to developers and the lessons we learned in building a system. So our journey started in 2010 with zero lines of code. We had a prototype of the system that we were bearing with existing most equal databases back in 2010. So back in 2010, Microsoft needed a database that this database had to serve the needs for applications like Xbox, Office, Windows, and Active Directory, and a bunch of these large-scale applications, Microsoft. All of these applications had a few core requirements. They all wanted turnkey global distribution. What it means is that they wanted the data from Office. They wanted to have the database expose a single system image of a table and made the data available wherever the users are. Worldwide, in an easy to use, easy to use clicking a button or calling an API kind of way. They wanted low latency across the world, so guaranteed low latency. They wanted high availability. They wanted programmable consistency. When you're dealing with a database that is globally distributed in the happy path in steady state, you have to navigate the speed of light to give low latency. And during the failures, you have to confront the cap theorem and to ensure the high availability of the system. So the standard trade-offs between strong consistency and eventually eventual consistency were not good enough. Developers inside Microsoft wanted programmable consistency. Eventual consistency is great for getting high availability but doesn't let you write intuitive applications. So they wanted programmability. They wanted to elastically scale throughput and storage worldwide and pay only for the throughput that they need. So depending on the workload spikes, they wanted to programmatically scale out, scale back the throughput across the world at different times across different regions. And all of these things, these applications wanted a really, really low cost. So this was the laundry list. This was the requirement list that we started with. And we realized very soon when we started building it and the first-party applications within Microsoft started using it, we realized that none of these requirements were unique to Microsoft applications. In fact, all applications around the world which are running on cloud would need a database system which is built to give these properties. So that's what we did. Earlier this year, Florence, because that's where that was the town in which we had the first working prototype back in 2010. And this is a long journey. In 2017, we released this product to the rest of the developers on Azure. At some point between 2010 and 2017, we had a slice of the system, a portion of the system exposed on Azure as a service called DocumentDB. And then in 2017, we exposed all of the infrastructure in the form of Cosmos TV. So that's the journey. Before I go further, let me explain what Cosmos DB is in a two-minute video clip and there's no one who can do it better than Leslie Lambert. Cosmos DB is a database system in which the data can be replicated and copies distributed throughout the world. This permits an application to be configured so every user is near a copy of the data without the application programmer having to worry about how many copies there are or where they're located. My TLA Plus specification language and its tools were used to improve the design of Cosmos DB, helping ensure its robustness. The Cosmos DB team credits TLA Plus with helping them provide great SLAs to our customers. The work of other researchers at Microsoft Research and elsewhere was crucial to developing the five different consistency models that Cosmos DB allows developers to choose from. Many consistency models have been proposed in research papers over the years. But to my knowledge, Cosmos DB is the only commercial system that has tried to identify the useful ones and implement them precisely. Implementing any distributed system involves a trade-off between, on the one hand, the degree of consistency it provides to the users and on the other hand, its availability and response time. I believe that today, commercially available databases offer only two choices, perfect consistency, sometimes called strong consistency and eventual consistency. Cosmos DB provides three additional intermediate choices. I expect that developers will use this flexibility to obtain the trade-off between consistency and performance that they find best for their applications. So when we started the project, we wanted to build the database that is designed for the cloud. And we asked ourselves, what are the unique properties if we were to build SQL Server for the cloud for today's age? What would be the design centers of that database? We wanted to build the next generation database designed for the cloud. And we found three properties that are unique that we must exploit as a cloud provider. So if you're Amazon, if you're Google, if you're Microsoft and if you're one of the public cloud provider companies and you're in the business of building a new database, then you better exploit these three properties. So number one was global distribution. Azure is ubiquitous. Azure envelopes the planet Earth. It has data centers, hundreds of data centers across 40 regions as of now. And it's constantly growing its regional footprint. So if you deploy an application on Azure, then your application is Azure by virtue of being omnipresent makes your application available around the world instantly. So you get to deploy your front-ends all around the world so that you have your users wherever your users are, your front-ends of your applications are running. But what about the database? Database is still siloed. It's still stranded in one region. And maybe for the DR purposes, you have your database configured to have another replica or a backup in another region. That's not a global and distributed database is above. Global and distributed database would mean that it makes the copies of your data ubiquitous, available wherever the users are, such that you get a single system image of your table with the replicas all around the world and you can scale throughput and storage around the world. And the application front-ends get low latency access to the data wherever the data may be. So that's exploiting the innate global distribution capability that only a cloud has. So that's one. The second property that the cloud provides that as a database provider, you must exploit as we learn is that cloud provides elasticity. So you should be able to horizontally partition a single table into any number of partitions and then distribute. Each partition is made highly available by a replication and then distribute those partitions all around the world, such that you can scale out throughput on a single table. You want 1 million transactions per second on a single table across different regions in the world. You can do that and then you can scale back. If you don't have enough traffic, you can scale it back to 100 transactions per second across all regions or across some regions. So that kind of elasticity of throughput and storage on a single table is something that if you want to design a database that knows how to exploit, we want to break out of the shackles of a single machine or even a single cluster or even a single data center. We want to have a horizontally partition system. So that's the second property. The third property which is very unique to cloud is the fact that cloud, by virtue of its abstraction, that is the fact that there's an interaction called the cloud, you have the opportunity to pack hundreds of tenants or customers data and their workloads on a single machine and then you can pack thousands of customers and their workloads across a cluster of machines and do dynamic load balancing of the data of different customers. You can do the same across clusters within a data center, across data centers within a region. This, the fact that cloud-wise interaction or abstraction allows you to no longer be restricted to a single machine. A database instance doesn't belong to a machine. Now, multiple instances of users' database or slices of users' database can be packed on the same, or across many, many users can be packed on a single machine and across clusters and that allows real cost efficiencies that just cannot be achieved when you build a database that is designed to exploit all of the resources of the machine. So this cannot be, the observation is that all of these three properties cannot be an afterthought. So if you build a database that is designed to use all of the resources available on a machine, you cannot, after the fact, add rate limiting and resource governance to your B-tree or your query processor or your replication protocol. You cannot, after the fact, add global distribution on a database that is designed for primary backup or DR capabilities. You cannot add horizontal partitioning as a bolt-on on top of a database that was designed to scale up, not scale up. So observation number one, these three properties cannot be an afterthought. Observation number two, that these three properties compose with each other. So you can have a single table that is globally distributed as well as horizontal partition. You can have a single table that is globally distributed horizontally partition and it's also multi-tenanted. So that's one of the central, you know, a theme that runs across the architecture of Cosmos DB. We learned in a very hard way how to exploit these three properties. So what does it mean to build a globally distributed database? What does it mean to have global distribution from the ground up? The first thing we've done is that whenever Azure opens a new data center or a new region, Cosmos DB is installed by default. So it's present. It's present doesn't mean that you have replication enabled, but it's present. It's available across the world. So that allows you to just go to the map of the world and select different regions like so. You have the map of the world. You simply select different regions. Let me, so this is a single table that is distributed across one, two, three, four, five regions and I'm going to add, and these are the five regions. This is an app that shows these, you know, the blue dot is the database replica and the red dot represents the application instance, the front end that is working against that replica of a table. And these are the five regions it is in. I'm going to add the sixth region. I'm going to add Brazil south. And once I add it and I save it, it's going to add this new region and then the data will automatically start replicating from this point on. This is as easy as it gets. That's what I mean by turnkey global distribution. So you can associate any number of regions with your database instance and you can even restrict what regions you want to replicate or constraint your data replication. So if you're in sovereign clouds, like government clouds or China or Germany and you want your data not to leave that country or that sovereign domain, you can also configure a policy that restricts it. But in general, you can simply distribute your data by selecting the map of the world. All the APIs that the database system exposes are well-mean-homeable. So what it means is that whenever a disaster happens, you don't have to redeploy your app. So imagine your database is spread across those five regions in the sixth one that we just added. And if one of the regions goes down, you don't have to redeploy your app to now configure your app to talk to one of the other data centers or regions. All of the endpoints are logical. So your application continues to operate with logical endpoints, logical URIs. And this is very different than if you're building a database that is designed for DR capabilities or disaster recovery capabilities. Therein, you also have to redeploy your app after the disaster occurs. So this guarantees that you have high availability of your application without requiring you to redeploy the app. You can dynamically configure the priorities of your region. So with the six regions that we just associated, the Azure regions we associated with that table in that demo app, if you add more regions or remove regions, or you can set the priorities of the region. So you can say, if US West goes down, I want you to failover to US East. If that also goes down, I want you to failover to Singapore, India, and so on. So you can set that. You can also test the end-to-end availability of your application, not just the database, by simulating a failover. So you can say I want to simulate a failover and make sure that not only the database remains highly available, but my entire application stack also operates in a highly available map. So this is in fact a proof of a global and distributed system. If you have a global and distributed database system, then it must allow, as we learned, the users to failover at will. So every day in our live site telemetry, we see lots of customers failing over and testing the end-to-end availability of their application. And this is crucial because you don't want to take a chance on probability of a region going down and the database or your application surviving. So this is what we mean by turnkey global distribution. The, finally, one of the crucial elements of this is giving SLAs not just for high availability, but when you are a global and distributed database, as we learned, you have to give SLAs for high availability, low latency at the 99th percentile, consistency, as well as throughput. These are the four dimensions, high availability, consistency, low latency, and throughput are the four different sort of legs of the global and distributed database stool. So when it comes to latency, we give less than 10 millisecond latency at P99 for all the reads around the world. We give less than 15 millisecond latency for writes, which are fully indexed writes. I'll talk about the indexing in a minute. And the way we do these latency guarantees is because of two central reasons. Number one, we have designed the replication protocol such that in order to meet the quorums, we always do local reads and writes. So you never have to consult replicas which are spread across regions in order to solve a read or a write. So that's a crucial design choice of the replication protocol and the consistency models that we expose, they also take into account that. So that's a core piece of the low latency guarantees. The second thing that we've done is that we have a database engine that we've designed which brings the best of those simple databases in terms of log structuring techniques. So we have borrowed really good ideas from 30 plus years of research and disparate file systems and disparate databases when it comes to log structuring techniques on the right path. So whenever there are updates or writes, we employ some of these techniques and I'll talk about it in a second. For the read path, we've taken the Beatrix. So the relational world, there's a great history and lineage of database engine design which is optimized for serving queries and reads. So we've taken the best data structures from on the read path, married them with the best data structures and techniques from the right path and build a database engine that is capable of ingesting sustained volumes of writes and as we ingest, we can automatically index all of the content that we ingest without requiring secondary indexes or schemas from developers and write locally to the disk in terms of large sequential clashes that we do on the disk as well as replicated with every replica you'll be committing it before the write gets acknowledged. This is on the right path and because these log structured writes are married on top of these log structured storage engine is a beatrix as the data structure, we serve really low latency queries. So this, we have a VLDB paper if you guys are interested in database engine design. It's a very unique database engine which we have for benchmarks against the popular log structured engines like ROXDB and levelDB and such. So those are the two techniques that we employed and the paper describes it in great detail. Now, since you're familiar with no SQL databases you know that horizontal partitioning is key. So we do that, you can take a single table and we will automatically manage the partitions on your behalf. So the table transparently grows as you ingest more data in the table. If you want to scale the throughput you can also scale throughput. I'll talk about scaling the throughput in a second. But so there's nothing fundamentally groundbreaking about scaling the table or a database or by what you're doing partition management. It's like any other horizontal partition on those SQL system. But the key piece here is that we make sure that the data is also replicated. So high availability is key. So how the partition management is done in context of global replication is a unique piece here. And you can also optionally time out all of the data that you've ingested. So you can set TTL on the table or TTL on individual records and let it time out. We make sure that the notion of time across when you have a single table which is horizontally partitioned and distributed around the world, the notion of time, the physical time is also preserved. So we employ a variant of hydrological clocks to make sure that if a record has expired in one of the regions and so if you issue a query after it has expired the query is not going to return that document or that record as a result. And the same query if you issue it any other part of the world you still, the expiration is honored. So that requires that the physical time is married with the replication protocol when we do the distribution. So far am I making sense? But the real meaty problems start emerging when we want to scale out throughput. So scaling storage is okay but you see application activity around the world. Application instances, those front ends are accessing data around the world and customers want to provision different throughput across different regions at different times. So how do you scale a single table from 100 transactions per second to a million transactions to a billion transactions per second differently across different parts of the world at different times? This poses an enormous amount of challenge. It stresses all of those three periods that I talked about. Global distribution, the design of the replication protocol, horizontal partitioning, partition management and multi-tenancy and resource governance. In fact, you design the entire system as a gigantic distributed queue with rate limiting and back pressure built in across the entire stack such that we can guarantee the throughput that customers provision at any given point in time across different parts of the world. So what it means concretely is that customers can write an API call, can you can say, you know, table dot throughput equals 100 transactions per second and the next line of code your application would write would say, table dot throughput equals million transactions per second. And within a finite bound of time, we have to honor, we have to scale out a single table, we have to distribute all the partitions and the replicas around the world and make sure that now the new throughput is in effect so that when you issue the, you know, requests you get the throughput that you sort of provision. So this requires a fully resource governance stack. It requires a highly responsive partition management scheme so that when you split a partition and you distribute it around the world for the throughput that it's supposed to honor, every replica, every partition is calibrated with the amount of throughput it is supposed to deliver and when every replica of a customer is placed in conjunction with thousands of other customers on a same machine or across the cluster, you want to make sure that every one of these components live within a fixed amount of system resources and deliver the throughput that it's supposed to deliver. The interesting thing here is that what we've done is that we allow customers to scale throughput both at a second and at a minute granularity and there's something interesting about this. So imagine this orange line is the line of, you know, the throughput on a table which is provision. So that's some customer has provisioned that much throughput. And then the blue lines are the actual consumed, you know, the application the users are pounding on this table and then there are spikes. So there is a spike at the 29th second. So this is a minute. And within that minute on the 29th second you got a spike, you got another spike, you got another spike. So whenever there is a spike, the throughput that was provisioned, the developer had configured it on that table configured some throughput and then some unexpected spikes came. So how do you smoothen out the spike? How do you predictably smoothen out the spike? So what we do is that we also allow the developers to specify the throughput at a minute granularity. So you say you want so many transactions per second and then you say in addition to that I also want some more transactions per minute. And whenever there is a spike in any second we will smoothen out the spike by borrowing the throughput that you have reserved for that minute worth of window. And so in this case on the 29th second there was a spike and because there was enough budget for the minute we smoothened it out, we smoothened it out. And so this we sell, you know, customers purchase throughput at a second as well as at a minute granularity. And by doing that they get predictable way of dealing with the spike. The throughput that we offer customers at a minute granularity is cheaper than the one that we offer at a second granularity. So these kind of capabilities can only be built if you have designed the system with resource governance from the ground up. This is another example, I'm not a golfer but there's a famous golfing company which manages its tournaments. And so when US Open was there this organization had two regions. It started with one region, East US Two. And then just around the US Open window this is the power of turnkey global distribution along with scaling throughput. So what they did was they also added US West just like I showed in the demo earlier just clicked on the map of the world and added another region to the table. And you see during the US Open so they got high availability because they can now survive a regional disaster because they have another region. And they are using throughput across both regions now. The gray bars as well as the blue line during that window of time. And then they're getting low latency access across both regions by virtue of adding that region. And then as soon as the US Open was done we saw that this customer removed that other region and so it doesn't have to pay for that other region and continues with the East US Two that one region that they have. So this kind of turnkey global distribution ability to scale, get high availability, low latency. Now we've tried to simplify it and this is an example that shows that. The one of the learnings here as we're building a system like this is that there is a vast array of distributed systems and database literature arrive with different proposals of consistency models like Dr. Lamford said in the clip earlier. There are all sorts of consistency models ranging from the gold standard that is linearizability and all the way to the weakest of the weak consistency models. And when you deal with most no SQL systems you configure them in different ways and there's no telling exactly what guarantees are you getting. There's no TLA plus specification. There's no guaranteed way of building an app. So you might use data, you might use availability. You may not get the low latency that you expect to get. You may not, when the partition splits are happening or partitions are being managed, maybe your performance degrades. There is no telling what happens to those dimensions. There's also no telling how do you write an application that can be reasoned over in a very intuitive manner. Just like when you write linearizable programs you can reason over the sequential nature of the program by reading the program code. When it comes to eventually consistent databases it's very hard to build an app. So what we did was that we asked this question to ourselves that are there consistency models in this plethora of consistency model research work that has been done for decades or so that can be operationalized at scale can offer intuitive practical programming models that developers can build reasonable applications and they enable practical real world scenario. So is there a consistency model for IoT? Is there a consistency model for web and mobile? Is there a consistency model for writing apps that are pop sub or messaging or where the data is shared by multiple producers and consumers? So we found by virtue of the fact that we had several large first party applications at Microsoft using the system. We tried many, many consistency models in terms of specifying them. Most of them were not even, there was no specification for it. So specifying them and then operationalizing them at scale and then we harvested the ones that survived. So this is the binary choice. You get the red pill or the blue pill. You either get strong consistency with high latency trade off or you get eventual consistency with low latency but your programmability starts. So that's what the state of the commercial databases is and then we've exposed these intermediate consistency models. So strong gives you linearizable consistency. Bounded stainless says that your reads will lag behind writes by a delta D but you still get global order of all your operations. Session says that you get monotonic reads, monotonic writes need your right guarantees. Consistent prefix says you get prefix order, the version order of the propagation of all your updates and eventual is eventual. And what we found once we expose it internally as well as externally is that a vast majority of customers gravitated towards using this intermediate consistency models. Second thing is that there are clear trade-offs. So strong and bounded stainless consistency for the price of the throughput, the transactions that you purchase. You get half the amount of transactions when you're using those stronger consistency models. As you get relaxed consistency models you get double the throughput for the price you pay. So there's a clear trade-off. We also learned that high availability SLA's are not good enough. You have to give financially backed SLA's for all more of these aspects of global distribution. And example of that is gen.com which is one of the popular retailers in the United States. During Black Friday's cyber Monday window they jack up the throughput on their people close to 10 million requests per second. And then we observed the availability during that window of time. As they started spiking during that window of time there was an availability tip because one of the upgrades happened was rolled out during that time. But still we maintained the 49 availability that we had and this table is spread across two regions. So we maintained the global availability as well. We continue to maintain the latency guarantees and all of these matrix we transparently publish it to all customers. So they can see the latency availability throughput consistency as their workloads are pounding on the database. We also learned that at scale nobody wants to do alter table, you know create index, drop index, manage schemas deal with secondary indexes. Most no SQL databases still ask you to configure secondary indexes. And when you're globally distributed it's a nightmare to maintain indexes, maintain schemas. So what we've done and this is the VLDB paper that I have the link at the end of the presentation I would love for you guys to read it. What we do is that we take all of the records that we ingest, here's an example of a JSON record that we ingest and we turn every record that we ingest into a tree. These kind of logical layouts. So every label, every text label in the JSON representation became a node in the tree. And the query processor, the index management always operates at this materialized trees. It doesn't care about the schema so you can add or remove new records with completely different schemas. In fact, you can have JSON documents containing exclusively of goods. We wouldn't care, we're just indexing the parts of these trees from the root to the leaves and then writing them sequentially on SSD using log structuring techniques. And then the query processor materializes these trees and operates on these trees. So these logical layouts free us from the burden of, free the developers from the burden of schema and index management altogether. So when you're rolling out your application upgrades for a database that is distributed around the world, you don't have to do all the table or you don't have to keep the version of the schema of your application and the database in sync. You can have the old customers, old clients continue to operate using the old schema across replicas which you haven't rolled out in the upgrade and the new customers work off of your new application schema. The queries are consistent and thanks to this unique design approach for the database engine. The other most important realizations we had was that we realized that we don't want to create our own query language or our own APIs. We found that developers are already using MongoDB and Cassandra and Gremlin for graph and so on. So once you have a schema agnostic database engine, you can translate, efficiently translate all of these data models into a core data model that the engine understands and then expose the APIs that developers are already familiar with. So you can take your existing MongoDB app and point it to Cosmos DB and start programming against it without any learning any new APIs. So I'm running out of time. So you can have different data models like documents and key values in Column Family and graph and so on and different APIs. Finally, we test the hell out of the system. You can meet me after the talk. If you're interested in how we test the system, how we do upgrades, how we run the service. Summary, globally distributed databases are hard to build. You should use when you have one that is highly available, reliable, and gives great SLAs. Schema agnostic database engine designs are really, they're moving the frontier in terms of freeing the developers from the schema and index management altogether. There's a VLDP paper link I mentioned. There are intermediate consistency models that the industry as a whole should harvest and operationalize because there is a market for it. Developers are willing to pay for the clear trade-offs if you expose those intermediate consistency models and as a globally distributed database, you need SLAs for all four dimensions. This is the schema agnostic indexing paper, bunch of other links. You can follow me on Twitter. You can follow Azure Cosmos DB on Twitter and we are hiding. Thank you. Thank you very much. We have a few moments for questions and answers. Yeah. Let's get you a microphone. We do have mic runners around. So those who have questions, we will send a microphone to you. We have three minutes for questions and answers. So there may not be very many. I saw a hand there and a hand there and that's gonna be all. Yeah, can you hear me? Yeah, can it? Yes. So the question is that as if you, if you want the customers to pay for the throughput that they need and the window of time for provisioning the throughput is an hour. So you pay for that hour. If you provision, say 2,500 transactions per second and if you start getting, your app gets 10,000 transactions in that second, you're going to get rate limited. So the way to get out of it is you write a piece of code in your front end which can increase or decrease the throughput based on the incoming arrival rate. So you can avoid in the negative case, you can avoid the request rate too large and in the positive case, you want to save some money. So you can lower the throughput if you're not getting incoming traffic. There's a standard, you know, 100 line support we can share with you. Okay, next question. I have a question, yeah. Here? Yeah. You didn't mention what transaction model is supported by Cosmos DB because obviously Mongo has, you know, one document per transaction there is. Yeah, so this supports multi item transactions with snapshot isolation. Yeah, so about your interminate consistency model between strong to the eventual. So have you seen any pattern about a certain application domain gravitating toward a model like video streaming might be eventual and e-commerce might be strong? Can you share some light on that? Yes, yes. Which domain goes toward which? Yes, session consistency is extremely popular with IoT, extremely popular with web and mobile workloads. Because, you know, when you're writing an app like a Twitter client, when you as a user tweet, you want to see monotonic reads, monotonic writes, you want to see your tweets, you want to see the causality of the tweet stream in your client app, in your, within your session, but you don't care about, you know, universal global order of all the tweets, right? So these kind of scenarios where web or mobile or IoT, wherever there's a user or a device involved. So if you have your app rooted in a device ID or a user ID, session consistency is, by far, gives you the best trade-offs. Bounded sales is great when you are globally distributed and you have multiple publishers and consumers operating on the same piece of data. So bounded sales we see in people building pop subsystems, queuing systems, messaging, chat, social apps. So bounded sales is there. Eventual, so eventually strong are around the fringes. We see not too many use cases. Consistent prefix is the next one after the session. But we have, we can sync afterwards. We can walk through different observations we have found so far from users. Thanks a lot. Thank you.