 Hi everybody. Welcome to the talk about Distributed SQL versus Polyglot Persistence. And how to think about these two specifically when it comes to picking the right database architecture for cloud native microservices. I'm Karthik Ranganathan, one of the co-founders and the CTO of Yugabyte. And we're building the open source Distributed SQL database called Yugabyte DB. Okay, so let's dive right in. At first, as we all know, there was the big bang, right? Like, you know, universe was formed and planets were formed, earth was formed, life was formed, all of that stuff, right? And then, getting more to the point, we came up with the RDBMS, right? This is the first real mainstream type of database that came out. It was the single node traditional RDBMS. Well, we can call it traditional now, but back then it was the cutting edge RDBMS. But anyways, it was a single node RDBMS, right? And so, because that database evolved out of all the needs of the applications back then, well, all applications back then ended up using the RDBMS. It was sufficient for, it was pretty powerful as a data modeling language. It was sufficient to build most of these applications. And it was very expressive. It was like there's been many years for people to learn and there's enough materials out there. Number of frameworks built around RDBMSs and so on and so forth. So, over the last 30 to 40 years of the existence of the RDBMS, it has made its way deeper and deeper into, you know, the knowledge base of developers. And so, it's here to stay, right? So, it is a very powerful paradigm, even today. Well, back then, downtime was a part of life, right? So, if you had a failure of a node that was running your database, right? The database node failed. Well, you'd paid somebody. Somebody would come in and promote a slave if you had a slave, like a replica cluster. They'd promote it to be the new serving tier. And so, you'd then get, you know, business resumed and everything continued as normal. But until that failover happened, you had downtime. If you wanted to do an upgrade, somebody would put out a notice 12 a.m. to 3 a.m., like, you know, midnight to 3 a.m., we're going to do downtime, schedule downtime for maintenance and do upgrades and whatever fixes and index building and all of that stuff. Similarly, if there was a network partition and your database was running in a data center, well, you were toast. You couldn't really, you couldn't access the application. And people got that, right? This was the early days of evolution and it was completely fine. Now, let's carry forward. What else was going on? Well, a single-node database was sufficient most of the time, right? 90, 95, maybe even 99% of the time you could just do business with a single-node database. If a single-node database was not sufficient for those few use cases, well, you could just spend the money and get a bigger database and an even bigger database. This model is called scaling up, right? There's no scale out. There was only scale up. And it's okay that it was really expensive because it happened so rarely. It wasn't the norm, far from it, right? So, then what happened? Well, humans evolved, right? Just like Big Bang, we continued our evolution. Well, the first part, if there's no food, you were angry, and if you got food, you were super happy, right? But today's very sophisticated and refined human is such that they complain about services when they're slow and they're like, what is wrong with the service? Why isn't it up already? Or like, you know, plan, maintenance, downtime for that? I mean, who does that anymore? Come on, it's 2020. And if any of this happens too often, people will just switch to a different service. And not only that, they're going to tweet saying how bad the previous service was and how much better the new service is, causing brand damage on the previous one, right? So, it's just the state of being. I mean, there's nothing wrong with it, but it's just where we are as a society, right? So, in order to keep up with the demands of, you know, the evolution of humans, well, applications also began evolving. Well, you had all of the simpler, you know, static applications to three-tier applications, to internet-connected applications, to now what are more service-based applications or microservices, right? And the idea behind these microservices is really to give you a few things. Firstly, high availability so that there's a failure, there's an upgrade. I don't have to put out that notice. Somebody is always consuming my service, so it's highly available. The second one is scale. Well, just as in the old saying, too much of anything, any good thing can be a bad thing. Well, that's at least surely true if you aren't prepared for it. If a service becomes wildly popular, then a lot of people start consuming that service. If a lot of people start consuming that service, then its scale and the stress on the database system increases. If the stress on the system increases and you're not ready to scale out or accommodate it, well, your system or your service could be down because of the success of a lot of people using it, at which point you're no longer successful because people would be like, hey, this service is down all the time and I'm going to switch to a different service. So, you know, the success can actually lead to failure. Of course, failure leads to failure, too. So what are businesses really playing for here? It is extremely important for businesses now to future-proof themselves in order to be scalable if the need arises, right? Not pay too much upfront for scale, but definitely be ready to scale when it's needed. And the third bit is having good user experience. Well, like the issues of, you know, how the data center topology is, how many failures can happen, how difficult it is to build the service because, you know, one database is not scalable versus another one doesn't offer you the feature set. Well, those don't matter, right? So the competitive pressures across services in different brands today is such that who can offer the best customer experience while retaining high availability and scalability and all of those good features, right? So that's where we are in the application evolution. Okay, so to put this succinctly, modern microservices require multiple data models in order to build rich applications. They require high availability and they require scale, ability to scale on demand, right? So these are some of the core requirements. Okay, how does this translate to database? So we talked about a bunch of these. We obviously understand high availability and scalability in the sense your database, even if a node goes down, should be highly available, you should be able to start small and just add nodes in order to scale out. So those two are relatively easier to understand, but what is the impact of data model diversity, right? You could have data model diversity manifesting itself as read pattern, write pattern, and other general data oriented features, right? As requests. On the read pattern side, you could have very simple applications that are doing simple primary key lookups to index based lookups all the way to really complex joins. On the write pattern side, you could have applications doing high throughput batch writes, simple updates with indexes and constraints and a bunch of other relational integrity, all the way to complex concurrent transactions that are going on. So this is also a spectrum, right? And microservices need different things at different times. Finally, on the data oriented feature set of a database, could be things like, you know, change notification, like tell me when a particular row changes. You might want to expire older data sets. Well, the good part is there are different databases or database paradigms that deal with these. And the SQL RDBMS is universal in being able to deal with most of these, right? But nevertheless, these are some of the impacts on the databases. Okay, so when you put all of that, right? You know, when we talk about like the high throughput writes over time, your data is going to build up. If your data builds up, you need to be able to go beyond a single node RDBMS. You need to go to multiple nodes. How do you solve this issue? Well, you have two primary options, right? Your first option is to build a sharded SQL solution. Your second option is to have one SQL database and a bunch of no SQL databases. Let's not even look at the option too. Let's start with option one. What's wrong with sharding SQL? Well, nothing wrong, except it's extremely complex to get right, right? So your first problem is that you have to figure out how your application manually shards the data across all of these databases. Now, a sharded SQL solution as is is not resilient to failures and is not horizontally scalable unless you build those features yourself. So you need to get each of these databases that you have sharded across to have replicas. You need to be able to promote them. You need to be able to figure out if you get more data, how will you redistribute it across your fleet of sharded databases? If you need to reshard your database, that's a very complex endeavor. You would lose very critical RDBMS features such as joins, indexes, and constraints because they're only available in the SQL shards. They're not available in the global database. And finally, you'd need a cache in order to protect the total number of IOPS to the database because if this system with the increasing scale starts putting more and more pressure on these individual databases, you'd either have to keep resharding them and scaling them out, which is a daunting manual process, or you put a cache to protect the total number of IOPS. So large tech companies actually adopt this type of an approach like Facebook, for example, one of the places I worked in very successfully has built a sharded MySQL tier. This gives you an extreme amount of control but it also is a very big investment on the part of the company, a very big technical engineering team investment. And so most of the enterprises are loath to doing this because it is really, really complex. So what is a paradigm that most of the people can follow? Well, if you take out option one and there were only two options, it turns out option two is it, right? So the philosophy of polyglot persistence is to pick a different database for a different microservice. So depending on the microservice, you pick the database. So given that this is the only viable option at the time this issue hit us, right? This was around 2008. The solution was to go polyglot and manually sharded is way too hard. So let's split up into some microservice or access patterns and use the appropriate database for the appropriate access pattern. So polyglot persistence as a solution was introduced around 2008. And again, remember, let's go back to 2008. You pick a separate database for each database for each data model. RDBMSs were not scalable. So you'd move the scalable portion of the data out to other databases and you have a number of these other data models, right? Like you have the relational data model which requires transactions and relational integrity. You have a white column, document, key value. You have a number of these, right? So you start moving out like data models that will not fit in your relational model or be satisfied by the relational database to specialized databases that would give up a bunch of stuff relational databases do but give you other features in return. So what does this mean, right? It means that firstly, the SQL-based RDBMS limitations of lack of resilience and lack of horizontal scalability will be solved by these other databases and you minimize the exposure to RDBMSs. And the polyglot persistence strategy for databases requiring high availability and scale is like to start with no-SQL databases which we're getting starting to get built in 2008 or for the rapidly expanding one you could again revisit sharded SQL but that's something that's probably way too complicated. So no-SQL is the solution to go with in order to deal with the fast-growing data set which required scale and high availability needs if possible, right? Now obviously these no-SQL databases wouldn't do everything you wanted out of your RDBMS but that was the whole point about polyglot where you're picking the database for the data model. Let's take a real example to understand this, right? So let's take an e-commerce example. Now if you take a web application you can compose it of a number of access patterns. You have user sessions which like, you know, users log in and you have to maintain their session. You have a product catalog where you're browsing the products and so on. You have financial data and reporting things like your orders, your billing, your invoices and all of that. You have your shopping cart where you actually add to your cart and you know you view things in your cart and you hit the checkout button. You have recommendations for products that are similar to or who bought, you know, 83% of the people who viewed this product bought some other product. There's a number of these type of recommendations. There's analytics on top like, hey, what are my most popular products? Like who's buying what? How much should I stock in my inventory and so on? And finally there's user activity logs. Like hey, my website may not be functioning well because a lot of people who come to this page actually try to check out but fail. Like you need user activity logs in order to figure this out, right? So how do you model such an application? Specifically when you need scale and high availability and a bunch of these other modern microservices features. Let's break it down one by one. Your financial data and reporting requires asset transactions, relational integrity and complex queries like joins, right? So they're probably not going anywhere but staying in an RDBMS, right? So if you have a failure of those RDBMSs, you still have to manually deal with it but hey, at least the scale of these, you should be able to scale up and manage it for the longest time so you should be okay with respect to scale. The second one is the white column fast growing data sets. Now these are in traditional NoSQL databases like Apache Cassandra. This is your fast growing user analytics and user activity logs. You typically require a high write throughput, high availability, the ability to query and handle large data sets, right? Like that's really some of the core requirements of this use case. Then you have your product catalog which is well suited for a document data model and that's because your products themselves have a ton of attributes, right? There's your product ID, there could be multiple IDs for a product, there could be a weight, there could be dimensions, there could be its product ID code in different geographies, its price, there could be a ton of things. So this requires you to have a document data model which can evolve a schema very quickly, which doesn't force you into a fixed schema. And it also requires high availability because one of the first steps users do is actually browse what products you have before they buy it. So they will browse and then add to cart and then check out. So those represent actions for which availability is paramount, right? Then if you look at the shopping cart, that requires high availability, scalability and data consistency, right? So it's not a full blown document or no sequel but it requires consistency, scalability and HA. And finally you have user sessions where you have a high read throughput like you need to service a lot of reads actually because many users will be concurrently connecting to your product. You can just simply have key value as the data model because it's not very complex in terms of the data model but you definitely need to scale this on a dime. You need to accommodate more and more users coming in. Okay, so if this is how we want to lay it out, this can become really complex as microservices evolve because each microservice can continue to add features and their set of requirements could change over time, right? So this is what causes the first problem in this kind of a deployment because you kind of zero in on this microservice, requires this access pattern and hence I'm picking a database for it. The problem is what if the access pattern changes? You're not future proof, right? So that's the first real problem. So if you ought to summarize what you lose, over time you will lose agility with this approach and that's because NoSQL fundamentally gives up on so many features that RDBMSs have. So firstly developers need to learn to use these and how to use these new features and models and so on, learn a new set of tools and frameworks and ecosystems but most of all if your app needs evolved they need to find out what these databases cannot do and perhaps migrate to a different data model or a different database altogether. The second problem is for build engineers that need to have build artifacts, CICD pipelines and so on. Now you not only need to build one for every microservice, you also need to build one for every database behind every microservice and if it's a different database each time that works differently then it becomes really complex, right? And distributed databases are even additionally complex because it's not just a single node you're deploying, you're deploying multiple nodes and configuring them right and so on. On the test automation engineers, test automation is extremely important for CICD to increase agility and deliver so that you are completely automating your tests. You now need to learn and account for the nuances of every type of database. Hey, what happens in this database if I upgrade the software but I don't do that in the other service or you just have to worry about a much more complicated feature, matrix set and an intercompact testing set. And finally the operations teams don't have it easy either. You have to now understand your day to operations on multiple databases, things like scaling the database. What happens if my pressure on the database increases? What should I do? How do I do a backup restore? How do I do software upgrades? If I need to do a security patch, what do I do? If I have to encrypt my data, if I have to deal with a variety of other failures like network partitions, every database behaves fundamentally differently and this causes all sorts of complexity. Okay, so the solution is like if you go to a public cloud, like many people would say this, hey, let's just go to a public cloud and take the databases they have because that simplifies things, right? So like let's take AWS as an example. Yes, to some extent AWS can simplify this because they've taken care of some of the issues. All of these many different types of databases are pre-hosted for you. You don't need to figure out how to do the deployment and figure all that out, but there are issues with this approach. Firstly, the most basic problem of if you pick a database and your microservice evolves, you're not going to be able to future-proof it. That still exists. The second problem of getting developers that understand each database type still exists because if you're picking a NoSQL database, you need somebody who to either learn the NoSQL database data modeling or somebody who already knows it. So you need specialized developers. It is also very expensive because whether you're doing test, dev, or production, services like managed cloud services will charge you the same amount. So there's no tier two, tier one, and tier three type applications where you can differentiate on cost and actually save money. And finally, it is a form of cloud lock-in because it is impossible to move off of one of these cloud databases, even if a different cloud like Google Cloud or Azure is offering you a similar service at a much cheaper cost. And vice-versa, if you're on Google Cloud or Azure and Amazon's offering you a better cost, you wouldn't be able to move very easily. It represents a complete rearchitecture and rewrite of the application, which is very, very disruptive. So how do we make modern microservices work? How do we deal with the multiple data model requirement and the HA and scale? Well, what if we rethought the whole problem? Well, what if we make SQL scalable and resilient? Well, it's really hard to do this as a database, but it can be done, right? It has been shown by the folks like Google Spanner that it is possible to achieve. It is difficult, but definitely possible. The second point is we talked about how RDBMSs are very critical to some of the microservices. Well, guess what? Over time, those microservices are also growing in their size, the data size. They need scalability. They need high availability. So bottom line is if you're solving the database problem for a couple of microservices that are critical, you might as well think about solving it for all microservices and see how we can bring this problem back to its original state, which is, hey, I'm using RDBMS, but RDBMSs are not highly available and scalable. Can I approach that? That's what distributed SQL is attempting to do, right? So what is distributed SQL? Well, distributed SQL is this new database paradigm, right, which completely sticks to what SQL and transactional databases do. So completely offer you everything that a SQL database does, but it's highly resilient because it runs across a cluster of nodes. And so if a node dies, other nodes can instantly take its place. It's horizontally scalable, which means if you want scale, simply add more nodes and the aggregate group will behave like, the aggregate group or cluster of nodes will behave like a single logical database. And you can geographically distribute the data because the replication of data and resilience is internal to the system. So you can actually ask the database to place copies of data in different geographies, be it multi-zone, multi-region, what have you. And finally, it's important to pick one that's open source so that it works on any cloud. So you can actually go and have different applications or microservices run on different clouds or even do a hybrid deployment where you have some on-premise data centers, which are private cloud, augmented with public cloud. So the premise of distributed SQL, what does it bring to you? Well, it keeps all the capabilities and functionality of a relational database, but combines them with what you have in a cloud-native distributed architecture. Now, your NoSQL databases throughout the programming model because, hey, if you were sharding your database, you would be giving it up anyway. You'd be giving up transactions. You'd be giving up joins. So the NoSQL databases said, hey, why don't we give that up but simply take care of the data distribution and replication and failover problem. That's what they did. But distributed SQL actually retains all of the SQL functionality, but elements of NoSQL in the sense that it can distribute, replicate, and automatically failover data and queries. So let's now go back to our e-commerce example and see how your distributed SQL databases can actually change the way your e-commerce application can get deployed. Well, in this example, since originally before the advent of microservices, we said most applications simply used RDBMSs, we already know that most of these applications or microservices can actually leverage an RDBMS. The issue is the auxiliary requirements that they come up with, such as scalability, availability, resilience, so on and so forth. We'll also take the example of YugabaiDB, which is a transactional distributed SQL database designed for resilience and scale. It's fully open source, so you're now completely free to use it, and fully Postgres SQL compatible. So it retains all of the operational characteristics of Postgres SQL and has all the enterprise-grade security and encryption and those type of features and backups and all of these features completely in the open source, very similar to Postgres SQL, and built to run on any cloud. So if you picked a database like this or any other distributed SQL database, and now let's reimagine how your application would look. Well, the first thing is, obviously, the systems, the microservices that required RDBMSs can straight up leverage distributed SQL, right? Because distributed SQL is really a fully relational OLTP database and that can support the type of transactions and joins and all of the complex requirements that financial services and reporting actually need, like financial services like checkouts and invoices and et cetera, and you're reporting. It can just take care of that. Now, if you look at other things like analytics and user activity logs, well, it is horizontally scalable. It is highly available, and it can deal with a lot of data because they're inherently built to be scalable. So it is possible to hold some of these larger datasets which have relatively simple access patterns with a distributed SQL database. So you can put in all of your user analytics and user activity logs and so on in this database itself. Now, it need not be the same physical instance of the database. It can be a separate logical instance or a physical instance of the database, but it still makes everything so much easier because if you look at it from the application developer point of view, it's the same database. It's an RDBMS. If you look at it from the build engineer point of view or the test engineer point of view, the testing and automation point of view, it's the same database you're testing. So the knowledge that gained from one side from one application is completely reusable elsewhere. And finally, operationally, it's a single unified way of managing the databases across the entire fleet, which makes it so much simpler to train people or find people to take care of the database. Now, let's talk about shopping cart. We said shopping cart doesn't really require a complex access pattern, but it requires data consistency and high availability and the ability to scale. Well, guess what? A distributed SQL database by virtue of its consistency properties, availability and scalability can directly handle that without the need for any other system. Now, if you go into something like the product catalog, which requires a flexible schema model, right? Like something that we said we'd use a MongoDB for because your schema can evolve very rapidly. You didn't want to set a fixed schema and get locked in and every time your schema evolves, it results in an alter table, you know, schema change operation. Well, most of the modern RDBMSs have evolved to supporting flexible schema data types like JSON-B and Postgres SQL. So with a UgobyDB, you should simply be able to use the JSON column type, which is JSON-B, to model and store flexible schema documents and manipulate them. There's a ton of built-in rich features and functionality that allows you to do so. In fact, you'd also be able to create indexes on dynamic attributes in order to speed up lookups by some of these dynamic attribute like values, right? So that really makes it possible to model your product catalog on top of a distributed SQL database, which allows you consistency, it allows you the flexibility of the schema and it also allows you scale and resilience and to build a highly available service. Now let's look at user sessions. We said that the user sessions are typically put on a Redis database, right? What is different about this, right? Like, why is it that a distributed SQL database can actually do this? Well, if you think about why Redis is used as a database in the first place, well, firstly, it is possible to live with loss of a user session because it's just a slightly degraded experience for the user, right? Like, let's say that a machine dies, a Redis machine dies and you were logged in and actually browsing around on a site, what happens? You get logged out and you have to get logged in. Now, it's not ideal, but it's something one can live with, right? But if ideally you can support persisting this data, then why not? You would want to persist it. Now, what was the pushback against persisting this in an RDBMS or any kind of database in the first place? Well, two things. Firstly, you want to expire user sessions over time. You want to say, hey, if this user has not interacted for some amount of time, I'm going to kick out the session and invalidate it. The second thing is that there could be a huge burst of users coming on the site. You'd need to service them with low latency so you cannot be taking too long and because, you know, if it takes you a long time to log in, you probably won't go to that service again. And you also need to be able to scale very quickly. Suppose the number of concurrent users increases by a bunch. You should be able to provision and add more nodes in order to scale. Well, it turns out a traditional RDBMS cannot do this. It turns out a NoSQL database might be able to do this, but compromise consistency leads you to other kinds of problems. But a Redis cache is actually ideal in order to serve a bunch of these. But suppose your volume is not high enough to where if you could scale your underlying database, you could simply add nodes and you'll be able to support this, then you would. So take this with a grain of salt, but you might be able to get rid of your Redis cache in some cases because your underlying DB itself can scale and you don't need to protect the total number of IOPS or total number of queries that can be served by your database. Now, if your need is you have an extremely high number of concurrent users and you need very, very low latency serving, then perhaps a Redis cache is probably better. But if you have moderate amounts of users using it with extremely high numbers, not extremely large numbers, then you might be able to unify that into a distributed SQL database as well. Finally, let's talk about graph, right? Like this is something that you currently would be using, say, a specialized database like a Neo4j. Well, in the specific case of Yugabyte DB, Yugabyte DB is built with a multi-API access pattern. So it allows you to access through multiple known APIs. The first one, of course, we talked about being the YSQL API, which is Postgres SQL compatible, but you also have a YCQL API, which is Apache Cassandra compatible. And there is a graph database called Janus Graph that can leverage the YCQL API of Yugabyte DB and it can give you graph functionality. So it is actually possible, and in fact, we've had a few users do this in Yugabyte DB where they were able to move off of a database like Neo4j and use Janus Graph plus Yugabyte DB's YCQL API to simplify. So the underlying point here is it is possible to unify a variety of different microservice access patterns into just a single type of database with distributed SQL, simply because SQL is a phenomenally expressive and powerful language and has been around for a really, really long time. So what this means is that the original premise of when people started sacrificing RDBMS SQL functionality in order to build no SQL databases, that was because that was the need of the hour then. There was a lot of data that didn't really require very sophisticated access patterns, but as the core sophisticated use cases itself need to scale, we might as well leverage that in the other places and use that to simplify the overall deployment, right? So that's the moral of the story here. So let's just take a look at what the simplest option is, right? So we said sharded SQL still remains very complex because you'd have to go do it yourself and you lose a lot of properties. There's no SQL and SQL databases, a combination of these. This is where a lot of deployments are today. However, this is really complex, specifically when you go into multiple zones, multiple regions, scale on all sides, a lot of microservices coming up. This becomes really complex to scale very quickly. And finally you have distributed SQL, where it is possible to use a data model as opposed to a polyglot API in order to model richly whatever you need in your application, right? And that makes it a very compelling and powerful thing to use. So what I'd say is, if you're building your next microservice folks, take a look at distributed SQL. It's come a long way. There's a lot of rich functionality and a lot of capability and a lot of users leveraging it. So it's not exactly unproven at this point. Google Spanner has been around for many years, like almost 8 to 10 years now. So I'd say take a look at this space. It's really changing and simplifying the way modern microservices can be built in order to keep pace with the evolution of human beings. Thank you. That's all I had. If you are more interested in distributed SQL or Yugobite DB or anything, please join our community slack, yugobite.com slash slack. Thank you.