 So I'm here now between you and your lunch and maybe let's do another session of stretches That's kidding Well, I'm I'm from your friend and card. So it comes more naturally. I'm not here for that I'm here today to talk about As a practitioner, we're often faced with the choice of what database to choose for building that mix system of yours right and and what I want to share here is essentially learnings from my experiences from 2010 building other than building infrastructure services at lip-guard and where I work currently which is cure for 10 and Health phase about building data stores for for health care, right? So the there's a problem with with my title, right? It says sequel and no sequel that that's a very common accepted way of referring to relational and non-relational systems because sequel just simply says what is a query language that? you use right it doesn't define a class of systems and and we'll get to that but in general so the intent is Help make our choices between relational and non-relational the new sequel no sequel Whatever you want to call it or whether you need to choose a polyglot data stores to get your job done So let's start with the landscape this again from one of the research Outputs so you see this it's a the purpose of putting this up is saying that this your choice is just way to get right between in premise and and hosted systems and roughly classified between relational non-relational grid or caching zones and some of them being in premise and and Let's say hosted so that really leaves us leaves us void for twice But the harder problem is with every single vendor out there claiming that their solution is the best for your Choice, how do you make that decision of what database to choose? Okay, so let's start out with some very basic ways of how you'd start looking at Database choices and this is primarily my talk is going to be more Oriented words in premise choices right with some things that you can pick even from from those two solutions So let's start off with some of the over the board of the very obvious considerations that you make So first and foremost is the kind of data manipulation Language or the sequel or no sequel choice that you're trying to make right so a query language of sequel it might seem like That it's something of the last few decades But trust me it is very relevant and why so because even spanner there's a few recent paper that says how spanner is becoming a More sequel system right so sequel as a query language of choice. It's not bad Right, so you you want to make your choices of saying hey is my data more and amenable for sequel like queries Now that's one choice that you would make or you would say is it most suitable for a key value style operation, right? get over or Even differently if you're looking at some kind of draft very capabilities and in my talk today I'll be able to give you examples of some of these cases from real life where you will see that it is quite relevant And it's quite natural to to look for a query language support that you would like to have and Another dimension is to look at bulk processing both in loading and data export and this can be for any number of reasons You're trying to see your database. You're trying to export the data out. Take it somewhere So it's good to get a first-hand view of saying is that database choice of yours going to give you bulk data exports or The next one is asset properties, right? So asset Let's look at one auto asset properties. So there are the bunch of properties. What does that mean to database choice, right? So if you look at a Thomas city, it talks about can be roughly translated into transaction support Consistency would talk about Ordering whether you're going to be Across if there is replication is it across replicas is it going to be eventually consistent? Is it going to be strongly consistent? Data stillness and also durability talks about how durable is the data and can it survive crashes, right? So now when someone says that is it asset one dot o compliant Then you can say that you get the following properties or you can expect the following properties from your data store Now more recently there's another proposal or in fact implementation of another Set of properties called the asset 2.0 How many of you have heard of asset 2.0 here? Right, right now asset 2.0 is more about his choice when typically is a matter of choice when you're trying to achieve scale and more precisely when you're trying to achieve higher throughput, right? And in an asset 2.0 system where you say that a data can be Across replicas can be associative commutative and the data gets relayed in a manner that's idempotent and and the data nodes themselves are distributed, right? What does this mean? It typically means Transaction support is a little limited and what I mean by this is if you see in the asset 1.0 world You could say that multiple inserts across multiple tables Sorry insert across multiple tables in a single instance is transactional, right versus in a 2.0 world It is probably limited to saying that inserts only to specific records are transactional not saying that there are in databases that offer that But typically right it means that transaction support is is limited There is a relays relaxed ordering, right now One of the reasons why Spanner has all this fancy infrastructure for GPS and atomic clocks It's just to guarantee ordering and because they claim that they're a globally consistent Datastore and you want all replicas to converge and converge in time Right, so with relaxed ordering you get higher throughput and that also leads to another type You can get that up called CRDT's which are conflict replicated Sorry conflict free replicated data types and there is ready support for this in a in a in a open source system like we add Right where those conflicts get thrown up through the application and application can then solve how to can decide how it wants to resolve those those conflicts and at least once delivery guarantee I Very recently read that Kafka supports exactly once delivery semantics that is Very hard or pretty hard to achieve right just guaranteeing at least once delivery semantics itself is is is fairly complex Right and what that means is with at least once delivery semantics What do you get us mostly eventually consistent data stores, right? And this also can lead to higher availability. So you would start considering 2.0 Properties when you're trying to achieve Large scale, otherwise if the 1.0 properties work well now, that's fine Moving on What do you expect in wire protocols, right? So wire protocols most of them are custom protocols on top of DCP IP some of them work on GRPC and there are very few that even work on on say it's HTTP Not much of a material difference in when your choice in the choices that you're making but it's good to be aware of What it is, right now support for popular database protocols and why is this relevant so if you look at Cockroach TV they give you Support for being used by other postgres drivers and this is primarily because for that you're trying to make a shift from Postgres to cockroach TV. You could pretty much seamlessly use it and and not make too many changes on your application later Right, so it is good to be aware or let's say if I were to consider cockroach TV I would say that hey it supports postgres and I've been using postgres So it's good as a matter of choice as an end user Right or for that matter memcache for a few years. It is always Couchbase has been supporting memcache D as a protocol Right, so you could talk to couch base at the back end but using the memcache Or for that matter standard interfaces, right? So in the Java already JDBC is like a standard and you have JDBC for all kinds of interface Right Apache Hive has a JDBC driver. You have for fetch means you have Phoenix which gives a JDBC driver Or if you look at witness, how many of you have heard of witness? Any of the Google YouTube guys, okay, so witness is a my sequel scale solution built by YouTube and Google and We came across this in Flipkart when we tried to scale my sequel and this does give you a sequel interface on top of an under the underlying my sequel base storage, but what it lacked was a JDBC driver and While we are in Flipkart our team built and compliment contributed a JDBC driver for witness So it basically allows you to scale my sequel Impromise so it means that across a number of instances of you're trying to build any of your custom sharding solutions This is a another viable option, right? But just a case in point where just building that JDBC support opened it out for wider adoption within our company where most of our instances where you've seen my sequel, right? So you could say that okay, it now has a compatible JDBC driver and then you can just shift on to that So it's good to look for that as well. So on schema So I sometimes hear this right like like schema is evil, right? My system is not able to scale because it is it is relational It is very schema defined and therefore it's a problem, but the truth is if you really look at Number of systems that really need schema less Backends that you can count them like they're very few Right. So either there's past missus past metric or you're trying to look to build entity value Entity attribute value storage like for example, if we started off in Flipkart our catalog used to be Something if you're trying to store Let's say a diaper and an iPhone 6 on the same catalog system We started off with an av store, but then just that couldn't scale and then later we build something else Right. So for these kind of system arguably you have a need for a schema less system and you say that okay There are a lot of changes so frequent changes you would go and use such a schema less store, right? So I'm here there might be fans here or document databases like MongoDB and they would argue that it works very well For this kind of requirement, right? Now, what do you lose if you don't have schema right and and very interestingly though? I think as of 2017 Mongo has unknown support for Schemas in their database, right? Why that is so because in the absence of schema understanding the data becomes incredibly hard and What becomes even worse is you have the wrongly typed data going into any of these Fields and then when you're trying to make sense of this data when you're querying it becomes next to impossible So you're you're actually making those the hard You're doing all the hard work when you're querying the data. So schema itself is not is not bad You'd be able to be near to really need to be convinced that if you're going for schema less wire You're going for schema less and be prepared to pay the price for going schema less or there's an in-between part So for the ones which use column data stores They require and and you and you require column indices for example, right? So if you look at a database like Hesbays or let's say Cassandra, they allow you to define indices on top of those column families Why because you're trying to query them you're trying to Make that more efficient then this like kind of an in-between world where you're trying to optimize on on both storage It's a schema. Therefore is it's not even and it is perfectly okay to Expect or look for schema support in the database that you're looking for Cap theorem, right? I'm sure by now all of us will be aware of what's kept here But there's an interesting point about not taking it in an absolute literal sense And there's a debate about saying is CP really achievable or not and and so on But what I like is in one of these papers and those references are there like I found that it is one good example of a fundamental trade-off Between safety and liveness in fault-prone systems. That's exactly what cap theorem. That's right It's trying to define when you say safety it's talking about consistency and liveness It's talking about availability and when you say fault-prone systems It says that when network patterns can up can occur, right? So that's the way To look at it in its in its if you just look at the conjecture that Proof of proposed it is too simplistic to say one is achievable a bull versus Not and therefore there's like kind of two simplistic at face value. There are enough examples where Anyway, CA is impractical because you if you're in a distributed world if you don't go beyond a single node then You know the CA is pretty much But cap availability and cap on consistency is actually a spectrum and is never and not binary and I'll give that as an example right, so if you look at AP reads and AP writes now AP reads in a there are enough examples, right? So you can say that I have a cache so long as the entire system is available It's partition tolerant and I trade-off consistency Acceptable. What's a good example of AP writes? Right I've seen one very good example and probably that affects all of us here is I've read this of the number of users who use Uber and I know their friend of much of you were friends here So they I believe they built a system for for captioning the rights when you complete a right Right when you complete a right when you probably swipe and you say that you're completing the right What is very very important is to capture that right because there's Customers are going to wait the cab driver is not going to wait It's very important that you capture the right and the right can go to any popping up among the available nodes and with mechanisms like hinted handouts finally the data Reaches the eventual state you may not be able to read the data right away, but the right is super important. So AP itself is that we will see that in these two examples of a cache scenario versus like a Uber like example It is a spectrum right and so therefore cap even with cap you go and critique your data sources to see what does it that matters to you Right, so it's it's never a binary Right, and then with that you define your application trade-offs right you impact validate impact on your NFS latency I can give you many examples again if I go back to the Pipkart example, so we have systems where like even on pricing well not so much on pricing But I think definitely on offers we chose to have Availability over over consistency right and we build systems for that and and that's perfectly all right And you will define those trade-offs when you're trying to critique it using cap and that also serves as a good starting point for Considering right Polyglot persistence, so what I mean by polyglot persistence is something like this So this is from So in 2010 2012 when we build the system be when we looked at the data Here so there is the raw biometric data the enrollment data Which is roughly now. I think at 1 billion enrollments. It's probably about a few petabytes Right so that needs to be archived for seven years So that's it's on an NFS and I'll talk about why an NFS versus an HDFS So this is like an archive store the same data first actually comes into HDFS for very high throughput processing The biometric extracts so those are for you who have used your other authentication The authentication is happening against HBase, so that data is in on HBase the master other data is on mysql We have indexes. It's no longer mongo's. It's a little dated But that's again mysql and then we have a query on top Which allows you to do all kinds of searches I'll write in a very secure way on like if you lose your enrollment slip and somebody doesn't have it How do you find your other record, right? So it's the same data or rather extracts of it which sits in a in a highly polyglot Stole and the data progresses through various layers another Concept here is about it's called CQRS only if you've heard of CQRS So it's basically means command and query responsibility separation. So which means that yeah in terms of operations It's like write and get read. Yes, obviously operations are separate But what if your data stores themselves are separate and CQRS essentially promotes that and a number of systems today exists That allow you to realize this and I think the go if I'm right among the more recent frameworks allows you to Gives you first-class support for CQRS Basically what what it says is saying that you have rights going into a data store and from there those those are commands And there is a continuous Update of those changes going into a read store so that you can serve the reads of a completely different Infrastructure and what you're seeing to the right is essentially the system that flip cut has on its for its catalog and this system at peak has served like 10 million reads per second Right, that's the read pipeline and what you're seeing at the bottom which is highly replicated highly distributed cluster sitting on readers and and with virtual buckets managed By the application layer while the rights or the whole data that gets stored sits on HPA is elastic search and gets orchestrated through a through a right pipeline along with you know, right repair happening across these clusters so this is one example of of Basically using the polyglot store Applying techniques of CQRS, right? So that's another Another one where without you having to engineer it yourself now There's an example of some work that I'm currently doing at QFIT, which is we are trying to build a knowledge graph Which will eventually be used for such things like an automated symptom diagnosis System and which can even get into can aid the doctor in in in disease analysis, right? I hate the use use the word AI, but this is basically a data driven diagnosis system Which with no human intervention and that talks about first establishing the graph of the sorts Which kind of says hey, you have flu and flu has conditions like flu as a condition has symptoms like cold body ache headache and so on right so very Naturally represented as a graph and what you're trying to do with this and we built it and this uses Titan as a storage And we chose Titan for multiple reasons right one is It could be packed with Cassandra and Cassandra as we know it is an AP system. So very nicely Maps with what we're trying to do here give more than the staliners of data. What we care about is availability of data works well for single query hops like what we're trying to do here, which is like it once in flu you're trying to travels to let's say cold or body ache and headache and For a system like this you can create very interesting Solutions on top, but here with light and the point I was trying to make was with Titan the underlying storage You have a choice of either Cassandra or Hitch base and for indexing it's on elastic search, right? So it's again polyglot persistence under the food you didn't make a choice, but it's a case where it isn't a single bundled Storage engine so couple of other things to look at this performance benchmarks right in general benchmarks by Vendors are lies lies and more lies Right. Why because they obviously they have a job to sell their databases So they'll try to take the most ideal case of databases and then you know workloads and try to prove that it's right for them But I found that independent benchmarks are very useful like there was a reason benchmark on on graph databases And somebody proved that if you can actually implement a graph For certain workloads on let's say a postgres without even going towards a let's say a Berkeley DB or let's say I think they even pick Neo4j and tighten with Cassandra back in those graph queries actually work better with a relational Datastore at its assets underlying storage, right like Facebook tau Database for example, it's a graph database built on my secret So it does help you might still make your decision or a choice to go with whatever right like I Even after the including this report I would still go with titan over Cassandra, but Why because I don't have to build the underlying graph storage layer on top of let's say my sequel or postgres But it's really good to be aware of these kind of benchmarks and use them in your in your choices operations and maintenance This is interesting right so you look at With all of their engineering pro is they didn't go and build a storage layer for tau, right? They went and build it on top of my sequel same with the LinkedIn guys They build espresso and build it on top of sequel and if you look at their engineering blog and those posts will show you that Both of them built it on my sequel as a storage engine and why do they do that? They say hey, we have expertise managing my sequel as a storage engine We know what to do when you have to replicate data when you have to restore data So we would rather use that to build a storage as the storage engine and then build our whatever layer on top Right, whether it's a document dv in case of the LinkedIn espresso or let's say it's a graph database in case of Facebook and the other important thing is support support either it's paid or whether it's a community and community where Where if it is paid whether it's single-vendor versus multiple vendor and community again the size of the community matters a lot, right? Though you would see that a lot of the databases would claim that they are open source But the only company supporting it would be the guys who created it, right? But that doesn't that really is not doesn't speak too much of a Large community the fourth one is a hosted service right so coming to this So when we chose Titan with Cassandra's a storage Titan interestingly also has Storage implemented on top of Dynamo TV, right? And when if you had to deploy and operate on AWS, then I already have a service that can be managed by someone else Right, so that's a matter of choice in a public cloud. So it says that hey, it really works Right, so so that's some again to look at when you're making the choice You're from here will move on to some of the under the hood consideration, right additional things that you Generally look for in addition to what we're the fairly obvious things That was that database part right so between a relational database and a column the database it helps to understand How is data stored internally and in the storage layer so relational quite naturally if you look at it data stored? Where all the data in a row stored together? So which means if there's going to be a disc seek or transfer it's going to operate at a record level Versus in a column a store all column data stored together and why so because when you're trying to do OLAB queries You're trying to do group by you're trying to do any of those count and any of those aggregate queries More suitable for for a column of storage and that's why very very likely that you'll see most of your OLAB databases Having making a choice of using a column our storage at the underlay layer, right? And it gives better query compression and so on and so forth It definitely helps to know what is the underlying storage engine or the water the storage layer look like for the database of your choice It also comes with a few trade-offs, and I have a case in point like suppose you're trying to do a You're trying to build a data store like let's say Other is fine, but if you look at that data you're trying to retrieve a record at the time a relational store because it uses a Petri kind of a store is probably more efficient than a column a day Document so we have four classes right so you have relational column a document and key value So document is essentially a subclass of the key value store data is organized from a TV to a collection to a Document it sometimes have challenges in in optimizing storage right because they don't have a data dictionary So it is it becomes rather hard to define. How do you optimize it first at a store storage level? Now comes a key value stores oftenly ran often ram based and you see that with key value stores If you really start looking deep how many of them give you high guarantees of data durability And I wouldn't expect them to give that because they are by design They're meant to give you very low latency reads right you're saying that Sub millisecond or sub two millisecond reads they largely try to keep the data and memory So which means then then when you do a write it's quite likely that writes get batched and then get they get flush to disk And if you say replicas if you're trying to say is an a i require a key value store, which is Going to give me very high durability You know guarantees where the durability is down to disk and is also distributed then imagine the cost of doing that right It has to do an rpcs send the data across nodes and also do a flush to disk Most of them while they do probably might give you the way to Tune that kind of durability key value stores are Better suited for mostly in-memory read writes with kind of batched flushes to the disk So data organization now this is another interesting topic Where if you look at the number of relational stores they organize data data in more like of a b3 format and a more in more recent times would say a lot of these databases Trying to use the log structured format right and why log structured format it is for multiple reasons It started off as being log structure for having very high Uh throughput writes like like a hh base or a kassandra. They started off using a log structured formats for Doing the writes, but interestingly databases like even rocks db or level db, right use a log structured format The reason being most of them are built for storage on ssds and on ssds with write amplification and and your write cycles that you have Uh a log structured storage format is better suited and gives you more life as compared to in place updates That you would probably see in a relational store, right and log structured also means that your data is mostly keyed By the primary key. So that's why if you see a hh base or a kassandra retrieval by the primary key super easy secondary Indices even though they've they've provided the lookups are in just as fast. So we'll we'll talk about that later, but Uh, generally it's also good to see that what is underlying data organization on this Uh, you'll see more and more recent databases having, uh, you know ellison trees Next one is on replication and consensus. So replication is so if you're trying to see that you have a globally replicated Completely consistent database, then it means that you're trying to say that somewhere that the replication is possibly I don't know not read it in detail, but probably spanner replication is probably it is like sink replication Versus a number of the databases that we know uses a sink replication Right and number of replicas the number of copies of the data that you have guaranteed writes Hinted handoffs. So all of these are attributes that if you get to read about mostly the open source databases The white paper the engineering blogs for them You will get to see these common things emerging, right? And it always helps to make sense for the kind of guarantees that you can get from the database Or whether it's a single master read, right? Like for example, hitch base achieves consistency by being a single master of read Right, which means it's a single copy of data always Is is the read and write all reads and write just go to that On consensus it's used in leader election and I know that There's this talk of paxos. So now in on paxos if I say how many of us have heard of paxos show of hands please How many of us understand paxos Not me, right incredibly complex as a consensus protocol and Yeah, and and what you would see is that there are newer protocols like raft Which are also employed for doing this consensus and leader election. It does help to understand replication What is the kind of consensus protocol in use because you'll see that sometimes because of the wrong choice of a consensus protocol The database can actually be Can perform very badly or perform very poorly when it comes to the safety guarantees And if you care to look at it look at jebson tests again show of hands who's heard of jebson tests It's about Very interesting. Just read up jebson tests. Probably you'll you'll not sleep at night, right? You might have chosen a database and if you read the jebson test results, you'll see that. Oh my god What did I choose but the good part is what standard started off as just a critique of those databases Today you have a number of databases actually come and claim saying that my database has crossed as basically uh I hear to all of the jebson tests, right and there are even enough example Jebson basically started off by the work of one single guy kyle kinsbury So look it up and do read about jebson test results for for the database that you're trying to Running out of time. So, uh, you're also trying to look at data export and restore like primarily if you have Strong rp on rto, you know requirements and and a dr scenario Look for tools that that provide full export versus incremental Snapshot tools for restoring from export and all these things start to matter when you're looking at very large database sizes, right? Like multiple terabytes of data you're trying to optimize and you're trying to ship data like here I had a point about uh large-scale data migration So there's a reference of when we did the flipkart data center migration We actually took disks in a truck, right? It was the fastest bandwidth throughput that we could ever get for transferring data from point A to point B right, so, uh, so it does matter to look at these tools for for uh data transfer and and migration, right Another aspect is the meantime to recovery, which is really the time that you take to let's say promote a hot standby to a master Let's say in a mysql scenario and thought In bitters as an example is like a click off a button You can promote a mysql standby to a master by the click off a button Now that says something about how you can manage and keep your operations running, right? So it's good to detect failure and validate and promote to a new master or leader cost Disk memory ratios So going back to an aadhar again, we stored the data on uh after we moved out of hdfs and moved to disk because To keep such a large cluster operational the amount of compute that you would have to dedicate to hdfs is Is more so instead it is cheaper to move it to to an nfs filer, right? So it's good to look at those uh examples and the amount of data that you can hold in memory and the kind of ratios that you can have Now power consumption you will see that in very large data centers when they say that we'll give you ssd disk They're actually doing themselves a favor because it's it's more power efficient to run ssds than to run spinning disk, right? And if you look at glacier store and there are some references there Very very interesting like like the facebook the way they store photos Not all disks in the rack are even powered up, right? They try to optimize that because it's super Critical when you're trying to create such large dense storage Last one is on to look for some db specific optimizations If you really see you're trying to look at how effectively can you use memory and and reduce access to disk or optimize access to disk, right? So if you look at memory, so it all translates to data blocks or block cache buffer pools It's there are enough guidelines around How much buffer pool size you want to give to a mysql instance or in case of hbase? Like what is the size of the block cache because it reduces primarily reduces this guy? provides lower latency reads and also potentially lower latency because of high data locality Of reference, right? There are a few references To go read up more on that but effectively what you're trying to do is try to keep data and optimize for keeping data in In in in memory as much as possible and not hit this and if you do have to hit this techniques like bloom filters right saying which of those data blocks actually contain the data Either you have indexes or you have optimistic data structures like bloom filters, which again tell you that hey There's a good possibility that this data Exists in this block, but it can be you can get a false positive, but never a false negative Right, so these kind of optimizations and if you read again about databases, they'll say that there is support for Let's say in hbase support for bloom filters at a column family Right, and you know why that is for they're just trying to optimize on the on the disc seek and and the data blocks that you have to So that just a couple of things but in essence what it means is the kind of Leavers that you have in order to minimize the access to disc and specifically disc seek times more than transfer and and the kind of levers that you can You know employ as an user And that actually takes me to the end of this name definitely not a comprehensive list but a bunch of Things that you can look at both over the board and under under the board when you make your next database choice some references I mean It's there all over the talk. So interesting leads if you go and start looking into each one of them We'll be able to relate to the various topics that have covered in the last 30 minutes So with that i'm done open to any questions If anybody has a question, please raise your hand And we'll get you in line Wow, it's an entire auditorium without a single question. Oh wait, there's a question right It's almost lunchtime you could almost leave Hi, um, hello check at the last At the last here Can you hear me? Yeah, yeah, okay. So, um, I just have one question or I don't know more like a feedbacker. I want to ask So at one point you talked about lsm Structures, right and you said we prefer lsm For the better lifetime of ssd Because compared that to like in point updates in place of time place updates Can you explain how exactly that happens? I mean because With lsm you generally have a process that comes Along with it called the process of compaction Where you end up rewriting the files by taking deltas over again Whereas in place updates is all about doing random, right? Yeah So random writes actually results in right amplification, right? Whereas in when you do when you do the compaction the way you describe it in in an lsm It writes larger blocks. So the right amplification is significantly reduced The back sauce protocol we are talking about and where you said raft and xab are also maybe Might have listed as three protocols, but raft and xab are just variations of practice So we just need to keep that in so raft is just a way to simplify Back sauce Yeah, I just as an end user found it got too easier to even make an attempt understanding raft Back sauce maybe higher I don't know. It's so incredibly hard But the point more was to see what is the strength of the replication protocol So if a database were to come and tell me that They use back sauce for both leader and replication I know that those cases have done some very serious engineering work Whereas let's say I know elastic search used to use a replication protocol, which was very fault-prone And and that was came out in in the jet centers I'd like to thank you very much We will have a q and a