 Okay, so hi everyone, I am Himanshli, I am working as a tech lead DB engineering at Inmobi. So today I am, as already he told, I am going to talk about that PostgreSQL as NoSQL. So PostgreSQL that is a powerful open source RDBMS, right, and we are using in our company for many critical application there as a master database for that. And there is other one that is sitting there that is NoSQL. So if you see both are like just conflicting terms in there. So actually this NoSQL term that always confuses me, because if you see PostgreSQL if its history that it was also not using SQL, there was something called QUEL that it was using. But we cannot tell PostgreSQL as NoSQL database, right, because what is the modern NoSQL DBs that is being different from existing RDBMS, right? So what I think that is main thing that there is a freedom of your data model, how you are defining that, right? So is schema less, that is the main thing that is NoSQL DBs are providing. So actually my talk is about PostgreSQL as schema less database, how we can use it. So what I am going to talk today that is what was the requirement of schema less database in our scenario and what data type and volume that we had to handle and how we can use our existing setup that is already there instead of migrating to a new NoSQL DB. And what performance this guy was providing to us when we were adopting this option there. So first of all what is schema less data, right? So normally if you see schema less doesn't mean something that is not structured, right? When we are talking about a schema database there is thing that we have already a predefined data model. You define your schemas how you are going to store your data, but in this case you get that freedom there, right? So but still it is like here you have instead of how in schema DBs you are, your data is row based, right? Each rows that you have already defined on that basis it is being stored there. But here you have a document based store, right? Where you have a key value pair and that can be a hierarchy of arrays there, right? And the thing is application needs to be aware about what kind of data you are going to store there. So and it even if it is like it can be dynamic whenever it is being stored as document because one document whatever data is inside that can be completely different with for example if you have five keys for a first document in second it can be like 25 keys, right? But yes your application that should be able to handle that thing. So it should be whatever how you are writing your application it should be about it should be aware about what keys you are going to store in your document, right? So yes, so back to what the requirement that came to us in our team. So there is already there was an existing application that that was running on PostgreSQL. Now they came with some requirement where they introduced one new entity that is called margin, okay? Now for that they were the data was not what we are generating the data was going to come from the client. And here they were not sure about that for now they were knowing that there can be maximum like a 15 filled that they were known filled, right? And even this data was not standard for all merchants. For some merchants there can be only five that data can come for some merchants it can go to tail. So in that range it was varying and this will only be known when the data is coming from that client side, right? So for that now they came that, okay? So this is the requirement. And in existing a schema in existing DB how you are going to use that? So there was some solution came, right? Just define a schema for that maybe just define a table for that in which maximum keep 15 columns there, right? And after that if tomorrow if something new comes we will go and add one more column. So it does not look that clean, right? Tomorrow maybe if something again you for older merchant you may have to go and again update the data for that new column that you are adding. So whenever it was something hacky that we could have done in there. But we thought, okay, it's not that clean way, right? So and also the data was not that small that we were expecting. So it was like for start to start with they were telling that data can be like a one million per hour. So this like I already told the problem with the existing thing that it was schema was predefined and so many null value chances were there. Whenever you are adding any new column, so you will have to do a schema change. And that will be costly if your data size is used there. And in case if you have to update the existing values. So we were thinking, so should we this report thing should we change it to a separate module and should we think about some nos equal debits where you can just go and dump your data because that was the requirement it was looking like. So we thought, okay, because we have read about it that PostgreSQL is providing you a document store that you can use there. So, but we were not using it. So there are three types of option that this is this guy is providing there. That is one is XML, second one is called HStore and third is JSON. So normally I'm not going to talk about XML because I have not tested upon this because the application that is running on Java and normally they will be I mean JSON format will be better for them. So what we have tested that HStore and JSON that we tested that if how it works there just we did some benchmarking on that, okay. So this first HStore, okay. So this is not part of core PostgreSQL, okay. This is a country module that is giving you this feature for storing the document type, right. So this is giving you a where you can store your is not like normal databases. How you store in the role level, you can store your data in a document level, right. That's what we wanted, right. And only the better one plus thing that you are getting, that what our DBMS feature that is it compilancing there. And you can map is whatever keys and values that are being stored as strings there. And they are even providing you some rich functions where you can to process over data. First you just insert under two different kind of processing they are providing for type just pasted here some examples like if we actually have a document there, so A is your key. So if you can just see that what, sorry. So what is the value of A there that you can get? In second it is some if condition if A is present in a document that there is like if A has the key has value two is present in that. So such operators it is providing there. And there is even it is giving you indexing feature there. There are two types of indexing that is it is providing that is just engine. So we have tested the both indexing how it works and when you are fetching the data from there. And whenever you create this indexing so it is not like you have to go and create over particular keys there. Whenever you are indexing it creates indexing is being done over a document level and it creates indexing on each keys there. So even if normal text indexes if what in normal rdbms how you create there in post case if you have a text field over that you create expression indexes right. So same indexes also if you want you can use it here if you have such requirement there. It is I have just pasted some example like how I mean it is a normal extension. So there is a very simple command you just create extension hs store and now that extension is created you know in your database. And now you can use it say this is just like how you create table. How you are creating the index there and some queries how you perform there. So using normal SQL you are you can retrieve those values on the keys and giving some filter conditions like that. Second thing that we tested that second approach that post guess is providing. So JSON that is like I think very famous these days. So normally it was introduced in 9.2 version that was the last stable version currently we have 9.3 that is there. But in 9.2 they just introduced it as a normal JSON but internally it was being stored as text only. Only thing that what it was providing it was giving you a validation layer when you are inserting the data into table. So it will just validate it is a proper JSON format. But internally it was storing it as a normal text. So it is not that good in performance and all it is like a normal text storage. So whenever you are retrieving that query becomes very costly. So that is what the problem was with 9.2, 9.3 they gave some more new features for fetching the data and do different operations on that JSON type. But it is still due to the storage how they were doing it is that is what because I have tested next slides I will show you the performance so I have tested on 9.3 only. So the performance was not good due to its storage thing how it is doing. So that is what they are going to fix in 9.4 actually where they are introducing that JSON binary format. So now finally they are instead of storing it as text they are going to store it as a normal JSON format there. So in that case even they are giving you a more indexing features there where retrieval is going to be better there and even some more functions are being introduced there. And other than this for even in H store of JSON if you want your performance to be some better what there is some unlogging feature that PostgreSQL is providing. So that was also we have tested. So normally if you know that there is some data that is you do not there is two thing right if there is whatever data that is being logged in something called axe log there. So if you want key my queries should be faster I do not want that my data should be that like if your server going down and you have some backup and you do not want that whenever my server is coming up my data should be like you know available there. So in that case if you are thinking about performance more than reliability you can use your unlogged feature as well. These are some function where you can to migrate between H store to JSON if you want like first you introduced you just started using H store and now like if 9.4 came and now you want to move to JSON. So these functions are being provided where that will help in fetching from H store and migrating to JSON format. So here is some testing that I did there just to convince them that if we can go with this approach there. So what data I used that I just created on CSV file where I generated this two million data and it was a normal laptop with 8 GB RAM there and on that it was being tested. So PostgreSQL was installed there and Mongo was installed there. And if you see the first thing was being test that what is the right speed there. So for this this is for 2 million. So if you see the graph that how many records per second it was being inserted there. So there was one normal H store type there was there was other was with just index with gene index and then JSON and then Mongo. Mongo has that index on primary key there. So this was the result that it came there. So I think here gene was the slowest in this insertion part there. And next was what volume it was taking on the disk. So if you see is why the gene was slow in the writing if you see how it is how it is stores the data. So in that indexing size is much more than what indexing size is this. But JSON was as normal as just here just when you are miswriting is ok there. Mongo DB size was if we compare it was too much for I mean I am not sure how they store that data there. And then the some selection queries were performed first one I tested on giving the filter condition on primary keys and other was on some text key filter. So this is the result that came there. So yes so when I am querying on primary key the just and the gene they were not good in that condition if you compare with JSON on Mongo. Next one was when the query is being faced on any key filter if whatever is present in the document. So in that case this was doing good if in comparison to if we are comparing with Mongo's time that in milliseconds. So yeah what conclusions we got from all this testing. So yes it we can handle schema less database here I mean sorry schema less data here. And its storage is really very efficient if we are if you talk about that with gist indexing there and even you have that some build indexes you can build indexes on expression as well. And the basic difference that we found that what is between that gist indexing because finally JSON we did not find good with this version. So we thought for HStore and in HStore you had two options. So gist was looking better for when you have the data dynamic in that case because retrieval is much faster in that gist indexing. So we decided to go with HStore gist there. So yeah what you are getting in PostgreSQL here you are normally we didn't do much changes there whatever was existing system you just install the extension there and now as normally you were using your SQL your transaction your constraints. Yeah constraints is also that you can use like normally you even you want to more more layer of validation there you can use check constraints over your that column there and even you can do normal joins and you are getting that is key malice features as well and we saved our migration cost and new set of of no SQL DBs and only problem here is that these all are not RDBMS given feature but it is only for PostgreSQL. So now if you tomorrow you want to migrate to some other RDBMS or something so that is not possible in that case you need to move to no SQL for it at least this reporting part yeah thank you any questions. Quick question so you spoke about HStore and I saw in one of your initial styles you had one million row per second or something something that said one million per hour sorry per hour yeah that was that is the production requirement when the application came that came that this will be the data that they were expecting okay this is what our graph I show I'm showing I'm showing with the when we tested that with this much data okay so your HStore is going to require some storage system right because looks like you you're probably getting a lot of data and how do you how do you store it is it backed up with sand internal disk or internal disk and what do we for managing that much data even we introduce their table partitioning right hourly thing normally it is like every hour we have one partition there okay so normally the data comes and for like the data is coming in format of CSV file so we are managing that that CSV file should have that our mention and normally it is directly going to into that partition there okay and when you need to do your cleanup because you're talking documents store now right it's not like any RDBMS system where cleanups are very easy to work on how how easy are you know are there challenges in keeping your database cleaned up right so in our case it is normal like it is like you want to keep data for seven days and in that case you just go and delete the old partition that's what we are that's all okay it's it's basically date range partition system that you maintain okay cool thanks till what volume of data have you seen the performance to be stable till what volume what how many GBs of data have you seen the performance to be constant to be predictable so as I told already we are just keeping data for seven days there and normally the query that this time is happening you have indexes on each partition there so till now actually it is working good we have not so for seven days around how much data would be there it will be around 75 GB 75 GB