 Okay, so good evening everybody, I'm Vishnu, I work at Dataspark here in Singapore, I'm a senior data engineer, I'm also a MySQL fan, big fan of MySQL and also other data stores out there, also a doodle maker, so we both of us are somewhat into art, he's more him than me. So prior to Dataspark I worked with a few companies, formerly at Flipkart in India, so that's where I kind of gained most of my knowledge per se into data stores and that's what I thought of presenting here today. It's not, this talk is not more about Python but more of the data part of the meetup, so it's the pie data but my friend Max said why don't you present something so I thought okay let me just think about the data part of it. So yeah this slide doesn't start with the title, yeah we'll get to the title in a while. So this is a comma separated list of possibly all the well-known databases out there, the Hadoop, HBase, Druid, Elasticsearch, CockroachDB, CouchDB, Vertica, InfoBrite, Redshift, etc. etc. It's a huge list it can go on, you probably might be using some of them, all of them, I don't know, or one of them at least. Hadoop, I guess, right, or HBase. So what's common to them is obviously the data part of it but there's something different about each and every one of them. So let's get to the fundamental question and funding the operator word here. Which one should I use? That's everyone's question in general which database should I use? Or which data store? As a matter of fact. So let's boil down to the fun of it. A lot of people have fun evaluating it, putting data in firing queries, running spark jobs, etc. etc. So that's where the title comes in. It's a little, I mean, in the middle of the presentation, demystifying data stores. So what we'll try to do is try to demystify the data storing part of a data store. And hopefully that should give you guys some insight into which one should I use. So as I said, normally we go about it like which one should I use? These guys are using Elasticsearch. Facebook is using PrestoDB, things like that which most likely draw your attention to the database. But what I say is generally try to look from the databases perspective. You have a database out there. Let's try and learn more about him. So all of you have seen this game. It's a very famous Nintendo baseball game. As you can see in baseball, we have four bases, five actually, but okay, four of them. So what we'll try to do is we'll try to do the same. We'll try to score a home run by covering all the bases. There are four of them. The blue guy is trying to run to all of them. So base zero, which is the data itself. When you think of data, if you go back to the traditional way of thinking about it, it's generally a row having multiple columns, right? So that's generally, okay, people think there's a big row. It has columns. The other way is, okay, key value, which is more prominent these days, like the no sequel world uses key value kind of thing. So even in key value, there are different types. The first one is you would think key blob. When I say blob, I think of an object. So your most, the best, good example is S3. It's a blob. You're still storing something in the data store and it's a blob. The second one is the even more prominent one, key document. You think JSON, you think XML. So when you come to document stores, I think MongoDB is out there, which calls himself the document store. There are other document stores, but the key operator word is how you're looking at the data over here. Is it a blob or a document, right? Then there's also the cousin graph database. In itself, it's also a key value store. The graph, which consists of the nodes and edges, are somehow represented as key value again. So at the end, the graph is still, the graph database is still represented as a data store. So if you look at it, the first step is always to try and see how you view your data yourself. You can still put a blob or a mp3 or a mp4 into my SQL or Postgres, but it still, it will work. It's not that it doesn't work, but should I put it there is the first question. So the second base or the first base, base one is how is the data stored? So let's assume that this is a design discussion and we're trying to figure out, okay, if I give you some data, how do I actually store it? That's the key operator word of the data store, right? So what are the general approaches which data stores most out there will generally take? Shall I write it to a file? How do I write it to a file? That's also the key operator part. Or do I keep it in memory and also store it in memory? That's also another way of storing it, right? So let's take a simple row or a data record over here. Here it's a row which represents an order placed on Amazon or Lazada or something. There's an order ID, a customer ID, there's a bill amount and perhaps the address, bogus trade and the amount of tax and three items were placed. So if I look at this record, it actually signifies something, you know, some model or entity over here, right? So typically everyone would immediately associate, okay, the first column is the possible primary key and you have the rest of them which are all columns. This is what you would look at because immediately this comes to your mind coming from the traditional MySQL Postgres way of thinking, right? So let's keep that in mind that, okay, we have a row here. It has multiple columns apparently and you're somehow tagging it with the primary key and other column attributes. So the first approach generally which we take is, okay, let's store all the columns side by side together on disk. So if I have 10 rows, each having 10 columns, the first row is stored, all the 10 columns side by side, then the next row, then the next row. This is a typical way of storing your row here, right? So this is generally called the row-based data store. Most of the older ones which have been there have been storing data this way for a long time. There's nothing wrong about it and some people will say, okay, we don't have to do that this way and things like that. But this is a normal approach which you would take a row-based. So when is this useful? Long time ago I attended this workshop about MongoDB and he was trying to put this idea, okay, stop using MySQL, let's use Mongo. Let's use the same use case over here, this particular order, okay? And let's store it in Mongo and when I click the button on the UI, it shows me the entire order in one shot because assuming that a customer support guy is looking at your particular order, he would like to see all the columns displayed to him on the UI, not part of it. He would like to see when the order was placed, by whom, what is the amount. So if you look at it, the operative word here is entire row. So imagine if the entire row is stored together on disk. When you go to disk, you go to the disk, start reading sequentially and get the row entirely in one shot. This is like, you can think of it as perhaps a single disk access to fetch the entire row, right? So if you look at it, I mean, I'm kind of simplifying things over here, but even in the Mongo space, the notion would be, okay, let me fetch the entire document in one shot rather than going to disk multiple times. So this is pretty much the thinking behind a row-based data store. When you fetch the row, you fetch everything. That's why everything is stored together. You don't store things separately, okay? So this is typically a row-based approach. So let's go to approach number two. We have our columns. We have a row and we have our columns. Let's store each column separately, such that they're all independent by themselves. You can think of them now. They all gain independence from the row, and they're stored independently over here. So what's the advantage of this? In the first case, all the rows were together like a family. Now we split them apart and store them in different places. So this is, again, called, this is generally called a column-based data store because you're thinking now in terms of columns. You're no longer thinking in terms of row. When you store, you store a column separately from the other column. You don't store them together. So I tried my best to give you a visual representation. So if I have two orders over here, I've kind of vertically split them apart. The first column is the order IDs. The second column is the customer and so on and so forth. So what's the use case over here of storing this? What kind of questions can I ask if I store it this way? A typical question would be what's the average billing amount or the sum of the items? This is also a question, mostly maybe a business use case. What was the average billing amount for the last year? In this case, I'm not interested in who plays the order. I'm not interested in where the order went. Did it go to Boogie Street or Orchard or etc. I'm just interested in one particular column. So if I store the columns separately, when I read, when I'm interested in one column, I just fetch that one column. I don't fetch everything. So there's a little difference from the previous approach. In the previous approach, if I want this data, I fetch probably the entire two rows. So in terms of data fetch, it's pretty much simpler in this case. You just fetch what you want. So the term, the operative term, I've just highlighted over here. If I want the average, I just take 5 and 2, do an average and give it to you. If I want number of items, I just take that column and give it to you. So again, the operative word here is fetching the necessary columns you need for the questions you're asked. So this means probably if you're going to disk, you're fetching less amount of data, which means reduced IO. So these days, when you say IO, SSD, normal hard disk, the spinning one, etc. So I'm going to leave that out from this presentation. But you can just simplify things and say, okay, I'm going to fetch less data. And this helps you when you fire queries in various scenarios. So there are other advantages of storing the columns separately. So let's take, let's say we have a column age. I've given you a hypothetical example here, like say, row one, row two, row three, row four, and those are the ages 28, 30, 28, 28. With this kind of approach of storing the columns separately, you can think of other optimizations as well. So the first one, which they generally tend to apply is this sort the column. And then they compress it. So a simple compression would mean just storing 28, putting the number three beside it to say, okay, 28 occurs three times, and 30 occurs once. So if you think about it, you no longer store 28 three times, you just store 28 and put a descriptor saying it's present three times. So this is not exactly the way they do it in the column stores. But this is just again, more or less the guideline guide, not guideline, the guide how you try to do it, you can think of bits, you can think of a lot of other ways to do this kind of compression. So that's the general selling point of column or data stores that they say, okay, we'll give you better compression when you put your data into us. That's because generally, sometimes they are most of them actually apply sorting technique. And then they try to compress by removing unnecessary data. So in this case, even if I sort it, I get one more advantage, I can do a better search. Because 28, 28, 28, then 30, I can just skip values in between, right? So yeah, so I just wanted to group the databases right now we have my SQL and process would fall into row based, that's how they store the rows. And the guys out there who do column are very well, Vertica is one of them. Infobrite, these are all commercial players. Vertica is based off the C store, which is a column, which is a, I think a very old paper written by forgot, but it's a very well known this thing. And most recently, I put root up there because I've been evaluating it for the last two weeks. Druid is one of the open source databases out there storing data and column of format. So hopefully, with these two approaches, you might get a sense of, okay, why am I using my SQL and Postgres? And why am I using say Vertica? Even though we have just covered base one, there are other bases as well. But with this information, at least you get an idea as to, okay, these guys, they store data like this. Maybe someone stores it in a different format in a different way. And that might be beneficial to me. So there's an approach 2.5, which I named it approach 2.5. We store in approach to we store the columns separately. But there's one more approach where you store a group of columns together and store the groups separately. So this is generally called a column family, you take a family of columns, and you group them and you store them separately. So I can have multiple column families, each having a number of columns in it. So it's pretty much an extension of the second approach where n is equal to one in the second approach. But in this approach, n can be greater than one. So just to visualize it, I can have two column families over here. The ones on the left are all the strings and the descriptors or whatever you can say like name, age, name, street num, this thing. On the right side, I put all the metrics, the numbers. To me, this is a logical grouping that okay, all the numbers which are used for crunching reports are on the right side and the ones on the left are probably not used in reporting. So how is this useful? If I store the columns together, I can actually query and if I want to query all those columns together, I now have a column family which I can pull in one shot rather than going to multiple places and pulling columns together. So typically, H base and some would say Cassandra fall into this, they have the notion of column families, a group of columns, but still adhering to that concept that okay, we store things separately, column families are stored separately. So base number two, which we need to think about is the indexing. This is a common question everyone asks like, okay, what's the indexing like? So my sequel and PostFest, my sequel in particular, Beatrice, which I use for indexing. Then you have the inverted index which solar uses, elastic search, Lucine basically uses that. Then Druid also uses the inverted index. Then there's a new index, not a new one, but there's a fractal tree index which is used by TokuTV. Then there's the clustered key, bit map, or no index at all. For that matter, you can even have no indexes. Right? So how is this useful to you when selecting a database or a data store? The question is, it might be useful, or it might not. So but depending on the index, certain type of queries, they like certain type of indexes. It's not that one index fits every query pattern. Right? And not all databases would implement indexing all the kinds of indexes, like I want Btree, I want fractal, I want inverted, not all of them do that. Some of them do it. They do it for a particular reason, because that's how they've been designed to do it. So yeah, range queries, Btree, fractal tree, they like inserts. It's typically order of one as per thing. So keep this in your back of your mind like, okay, this guy has some indexing power. What's the nature of the index? Is it tree? Is it a simple inverted index? How is it helpful to my kind of query? You know, the other part of it is the index loading mechanism. So how does the index get loaded into memory? Because at the end, you need to read the index to actually go to the row. Right? On a side note, okay, every database might do it differently. They might load the entire index into memory, or they might load parts of it. So that's where I wanted to highlight something about Redis, which is also very more prominent key value data store, right? Redis is memory bound. The entire index has to be loaded in memory. So if it can't load it in memory, then it's not like it can go to disk and fetch everything. That's the power of Redis. It's an assumption that all the data should be in memory. This is not related to loading. It's more a design choice for Redis. But that's something you need to also think about. Okay, if I have 200 gigs of data, do I need 200 gigs of RAM? For Redis? The answer will be yes, unless you shard Redis. But we're not going into that. We're just talking about this thing. So these three, these two points, what's the type of index? How do I load it into memory? These questions also will possibly, they need to come up when you actually choose your data store. So base three, the cap theorem. Okay, I'm not going to go too much into the cap theorem. But it's, it's something which nags all the data stores out there. In the back, when they go out designing the data store or writing it, they think about this cap theorem. So I'll explain a little bit about it. And the bullet points have highlighted our horizontal scaling and sharding over here. Most data stores will say, okay, we do horizontal scaling, we do sharding, you can add how many ever nodes you want. It'll work fine, right? This is somewhere linked to the so called cap theorem, which is basically talks about two very simple concept. When a network partition occurs, let's say you cut off the network link between those nodes. What's the choice you make? Do you say my database is consistent? Or do you say my data's available? So how is consistency and availability handled? That is something the cap theorem talks about. And most data stores out there will actually make a choice. They will either say, okay, if a network partition occurs, I will not sacrifice consistency. But I will sacrifice availability. Some might say we will go for availability. We don't care about consistency. When I talk about consistency, it actually means something like other numbers accurate, I've inserted 100 rows, I only see three rows, things like that. So I mean, again, simplifying stuff. But this is something, when you think about horizontal scaling, adding more and more nodes, there's a silent enemy over there, the network, which always comes into which is not seen in the, in the documentation or anywhere else, it's generally there. And when it breaks, it breaks your DB as well. So how does you need to ask the question, okay, how does this guy handle the network partition? Does it impact my querying? Does it impact my writing? Do I get correct data? That's something you need to ask the data store and hopefully the documentation should tell you, okay, this is what I do. And if you're cool with it, go ahead with it. If you're not cool with it, okay, find something else. That's that's the point. Right. So we're coming to the last base. So apart from the cap theorem, the other things as well. So all of you must have heard of the word acid, atomicity, consistency, integrity, durability, right? This is a term generally used with the typical relational databases MySQL and Postgres, an Oracle, forgot about Oracle, but the typical words you'll say is, okay, transaction, commit and rollback. Okay. Probably we don't hear it these days, but it's always there if you're using MySQL and Postgres and Oracle, transaction, commit and rollback. So this concept of transactions rolling back and committing, not many guys out, not many data stores out there actually do it. Very few data stores do this kind of behavior. So the question mark is, do you really need that transaction capability? Does if you don't need it, then okay, you, you can choose some other data store. You don't need to choose this. But if you do need it, what's the, what's the choice? So I think if you go to Dropboxes, their presentations on the engineering blog, most of the metadata for Dropbox is actually stored in MySQL. The answer for that is why, why do they do that is because their general thinking is the most critical part of the data is the metadata. And they just trust MySQL or a Postgres or Oracle to store it better, keep it better with the durability aspects, then a Mongo or a Cassandra or anything out there. Okay, the data might not be in MySQL, but the metadata, they choose their primary data in MySQL or Postgres. So another case is like the invoices for Lazada or Amazon or Flipkart. Where would you like to keep your invoices? Would you like to keep it in no SQL, Mongo kind of thingy or Cassandra? You can provide it, the guarantees of durability are not sacrificed. You cannot afford to lose your invoices, things like that. So, so that comes, that's an important question. Like, do I really need acid? Do I really need those flavors for my data store? Does it actually fit my use case? That's an important question you need to ask. More than likely you might not need depending on use case, but if you're doing simple transactions, you want to roll back, obviously you have MySQL and Postgres, I mean, and Oracle and some other guys. But that's the idea. Keep acid. Acid is also sweet, by the way. That's what I always like to say. It might look tough, but it is very sweet. So that comes to the new concept of base, which mostly most no SQL stores talk about. Basically available, soft state, eventual consistency. So I like to, I like the eventual consistency part of base. It's kind of like when the network partition comes in the cap theorem, what actually happens if I break the network link? What happens to consistency? People say eventual, meaning if I make an update, eventually it will reach all the guys in the cluster eventually. So if I quit it after an hour, I might see it. So that's also to do with the soft state of things. It's the state has not propagated across your so called horizontally sharded cluster, because the network is broken, right? But it's basically available. It's still there, but it's not consistent across the cluster. So that is a term which will fall into I mean, can be tagged onto multiple no SQL stores, which are out there. You make a choice. You want acid, you want base. You can't have both. You can have both, but there are solutions out there, but it's something you have to choose or keep in the back of your mind when you make a choice. So the next big thing, I mean, it's not actually a bullet, big bullet point, but it's a question. If I do sharding with those horizontally scalable databases, can I do joints across large sets of data? This is a typical question, which everyone will ask, okay, I put my data in them cash D cluster, can I do joints? The answer is, perhaps you can do it. It's not that but out of the box, it might not come to you. It's not something like, okay, I'm cash D understand SQL, and he's going to go and do the joint, right? Because even if you think about logical perspective, your data is all around the place. Like it's like if you guys are the data, like each a record, you're all split across in each of these shards here, which are the chairs. And if I want to join that side, on this side, you're all present in different locations, I need to get you together to do the joint, right? So that's the sharding part of it. I've sharded my audience here, but, okay, kind of distributed. But now if I want to join, I have to get you together. So that's a choice you're actually making when you say, okay, I'm sharding my database or my solution, let's say HPS Cassandra comes sharding comes out of the box, right? So that's something you also need to think about. And that's why I put this apart. This is somewhat related to the cap theorem. Because when you think about horizontal scaling and sharding, okay, it's all well and good. But do I get joints mostly from the reporting angle? So there's one more term which is very rarely used. It's called distribution awareness. Basically, are you aware of how you distribute your data? It's like, let's assume that all the orders are on the left side of the room. And all the order items are on the right side of the room. In this case, I'm totally aware as to where my data is. So if I want to go and do a join, I just need to bring you guys together. But if I intermix up the orders and the order items, if I want to do a join, I will not be aware of where the data is to do the joint. So as far as I know, there's only one solution out there which does distribution awareness. It's basically when you do the sharding, let's say I place the order in the first chair, order ID one, all the items of order ID one also go into the same chair on the first chair. So basically, I shard by the order ID, because every order, every item of the order will have the same order ID. This is distribution awareness. I'm actually aware that I'm distributing the data in this particular place. And I think Oracle's MySQL cluster is the only guy who does it out of the box. You don't have to do anything magical about it. This is something most people will do. They implemented themselves. They put the order in node one, order ID one, they put all the items for that order onto the same node. Again, this is referred to also as co-location. You want your data to be located together. So this helps you in the join. If the order and its order items are on the same node, the join can be done because they're all present on the same node. So even though the data is sharded, so these are the questions which come up with sharding and joining and acid and base have perhaps not much of relation to this, but they fall into this base four bucket, which I have. So you need to actually make a choice that do I actually need all of this? Is it necessary for me? How important? If people are not doing joins, you don't care where you shard the data. So there's one more point, the query interface. What comes out of the box? Oh, this guy, he has SQL layer on top of it when I use him. So is that important to you? So if you look at a database like solar, data store like solar, it does have SQL access, but it's pretty limited. Most of the querying is done via HTTP JSON interface. But it's not SQL, which you're used to like in my SQL and Postgres. Is that important to you? Can I make do without SQL? How does this actually impact you is when you write the queries when you're developing your app or whatever the ease of development is something which is very important. If I'm writing big JSONs as queries versus a simple one line SQL, I can probably implement it in like five minutes. I wanted to highlight this point because my team data spark, we are actually kind of fighting this problem. We have one of our data stores as solar, primarily for doing counting and all that stuff, faceting and stuff. But most of the queries are all JSONs and stuff. So it takes a little while for us to actually write the solar query. But if it was SQL, I could have just said, okay, select blah, blah, blah group by where order by done five minutes kind of thing. So this is something you need to check out as well. Okay, what's the query interface? Is it important to you? That's a different thing. It might not be important. Well, and good. So coming back to baseball, the basis, as I mentioned, the data, the how is the data stored, the indexing part, the cap theorem, acid, and be and base, basically, the term base, right. So try to cover the basis. That's my message to you. Keep those questions in the back of your mind. See if they actually impact you in any way. It's good to know that, okay, you have this question. It's like, perhaps like an interview. You interview a database and throw these questions at him. And if he answers satisfactorily, he or she select him. That's what I would like to say. And yeah, there's a big PS. There's no silver bullet to data stores out there. If there's no magical Goku DB, which I from Dragon Ball Z, which does everything for you. Every guy does it differently. There are pros and cons. Try to evaluate them and hopefully you make the right choice. Thank you. Any questions if I can take? Yeah. So the question is to give an example of a data store which I actually used and why I chose it. So I'll give you a very live recent example. Our current data store is actually solar. We actually index the visits of all the single subscribers. So basically, if a single subscriber is visiting this building, it's actually a visit. We index that into solar. And we do counting as we ask the question, give me a number of unique people who visited this building last last week, right? And we get the count. So what we were looking at because of the volumes of the data and the nature of the indexing in solar, and how it actually stores data, we have we are finding a bit of problem when it comes to performance. So we started an evaluation. We first looked at Cassandra a little bit. We could not complete the evaluation, but we are looking exactly right now at Druid, which is a column store. So let me break that down to you. Why we are possibly looking at Druid is simple. If I have the columns as say, person ID, age, gender, what is what else? Some other. Okay, what mobile is he using? What's the OS of the person? And all the person attributes basically, and then obviously location is one dimension. He has visited this location and time is another location dimension. He has visited this building at six o'clock, this person, right? Druid is a column data store. So in the first place, each of these columns will be stored separately, right? If I actually do it has some pre aggregation logic magic, which happens behind what I can choose to do is I can say, okay, let's create. Let me ask a question as to give me a number of people who actually visited this particular location, unique people, right? If I just store the column as separately, for example, let's say orchard, and let's say all of us have visited this side of the room has never visited orchard. This side has, right? And I have these many rows equal to number of people here. What I need to actually fetch is only location column. I don't need to fetch all the columns. So I can do a simple count of number of I need to fetch two columns, actually, the person ID and the location and do a filter on top of it. So in this case, the column in nature actually helps me in one sense to fetch what I need from disk, I don't need to fetch all the columns. I don't need to fetch age. I don't need to fetch gender, right? It can help me do a group by as well. The second thing is, we don't actually store the data in that format. Druid has the notion of aggregations. So what we actually store is Vishnu has visited this building at in the date twice. We store it in this format. So the columns now are okay, Vishnu, an ID location, this building, one dimension time, this day and number of visits to. So if I ask the question, give me a number of visits by Vishnu to this building, I get an I just need to fetch the number of visits column and give you an answer. So basically what I'm trying to boil down to is in my in this particular example, I'm only interested in a few columns, which I'm fetching from disk. And that's where the column and nature help me. Okay. So what's that if I put all this data and say my SQL, for example, I would be storing this entire record together a all the columns side by side. When I fetch and ask the same question, if the column if the row is let's let's assume the length of the column row is one kb, I would actually read the entire one kb into memory and give you the answer. But for a column store, for example, like druid, I would only fetch the last part of the column, which is location. So I'm only fetching a subset of the data. I'm not fetching one kb. I'm fetching say 100 bytes from disk. So that actually helps me in doing that kind of it helps me actually answer the question more easily. Let me put it that way. So that's where column store actually helped. Does that answer question? Any other questions? Thank you.