 Okay, so this is the DevOps talk and this is going to be completely on database. And this is based on a case, actually a case study which I have done. So I had some specific data which we were actually working in Nocary and we actually wanted it to be stored somewhere for our analytics purposes. I'm not saying that data was having a social media kind of relationship with each other but we were not sure that how our relationship gonna be in the future like we didn't we didn't had a specific schema from the beginning and we wanted it to put somewhere store some fashion so that it can be retrieved as fast as possible. So in that process I realized that not everything can fit in rows and columns and then we discovered that how so I'll be talking about those specific problems and challenges which I've faced over there and how I have solved them in a context. Okay, so let me set the azinda first so I'll talk about relational data what the data supposed to be and and how the relation are actually there and we'll talk about a little bit about graph databases what are there and and how they are helpful in helpful. We'll talk about the graph which I found as a solution when I was dealing with my experiment and I also talk about digraph ORM which I have created because of certain regions. So so what exactly a relational data is the relational data as per my opinion is it's a data where a relation is also equally important as the data itself and if you see the way industry is moving and the way we are dealing with our data these days we are having more metadata compared to the data we have like we store one specific information which is our code data and then around that we store a cloud of information what which which I'm calling a metadata and these metadata is related to exact the data which we had and that creates a relationship which can go anywhere from from one place to another and and these relationship which I'm talking about can be one to one which we've seen in in one to one one to many or many to many which we've seen in in rdvms in general. So let's take some specific data so suppose this is a movie data which have so I have a list of movies I have a list of directors and I have lists of lists of genres okay and in the experiment what we'll do is we'll try to put this data and try to create a schema and try to put it rdvms and my rdvms and and no sequel and see what those sort of challenges we can pretty much find over there. So if we if I if I ask you to create a schema for that specific data your first approach would be exactly this one like you can create a movie table you can put your director and your genre into two separate tables and then you have a foreign key where you can link in that particular particular context but you might realize later that a single movie can have multiple director okay and in that scenario what you need to do is you need to migrate your movie table now you have to introduce one another table to keep your movie and director relationship the way it's needed and similarly as and the same thing you have to do on genre as well because one specific movie can have can belongs to multiple multiple genre so if you see this challenge like initially we were not aware of this particular problem and later on we figure it out on the way building that application or maybe that product which we're trying to build so so so the learning from this is rdvms forces us to fix on on the schema which we have like we need to decide our schema beforehand before we can explore our product that how it can grow and how it can build we need to decide that how our data can can be stored because because joints are costly so we tend to duplicate the data the way we want like we can store some of data into some specific table and they can reduce the join required in that particular relationship schema and we also have a very complex mapping process like the third table which we have created that have only one specific job it's gonna store two IDs for two different tables it in nothing have to do with anything else and that specific table is is the adding is the added complexity complexity in the whole schema now every time you have to you are actually trying to access some information from the parent table you have to join this table and the table belongs to to the other table to get the holes whole specific set of data and most of the time what we do is because counts are something where we do our filters and sorting we tend to pre-compute the counts in in RDBMS and that pre-competition is completely done through our application logic so so we have to maintain that specific logic somewhere in our application so let's do the same thing in no sequel and and see so this is this is how a movie document supposed to look like you're gonna have a movie information into that you're gonna have your director information and in the same you have your genre information this seems very interesting when you only have to fetch the information but think about a director information is updated now you have to go find out the director in all the movies and then update that director information and then you're gonna save it so that's actually very hard to maintain so so to solve that problem what we can do is we can split this into two specific table and we can keep the IDs similarly we did in RDBMS but we cannot join these three tables to actually get the data so ultimately so ultimately what gonna happen is we're gonna have to run n plus m plus one queries internally to get whole set of information for one single movie where n is the number of genre it have and m is the number of director it's gonna have depending on how it's gonna be so so what actual problems we have in no sequel is the first thing it does it skills the relationship like we cannot maintain the relationship the the way we can maintain the relationship in in actually RDBMS so if your data is not if your data is doesn't have any any sort of relationship maybe no sequel can be a little bit helpful over there we do need we do do the data duplication in no sequel as well like you've seen that the director information I'm actually copying from every specific every specific movie document and similarly the other challenges like a said security and all those specific things which we have talked about my sequel on the community is it's still there sorry no sequel is still there so what the solution is be so the solution seems seems graph databases why because because they are sparse in schema like they don't force us to define the schema in the beginning our schema can be defined the way our data is being stored automatically like if we just start storing our data building their relationship and the schema is going to be created in the background we don't have to worry about scheme at all no data duplication because now we are going to store our data into a form of node and indices where indices gonna be relation of these nodes and nodes gonna have one individual actually one individual object or one individual values any join in single lookup because because in the graph what is happens is you are not actually pointing to an index you are pointing to actual data so you don't have to go in my index table to figure it out where that data is gonna be you just directly have that data accessible you just go through one specific a specific as a sterile cell and you get that particular information counts are very cheap because you exactly know that how many as are related to that particular node which you are currently in and you can count all those specific related nodes if you want to so count is also very cheap so graph database seems to solve all the problem which we have then then why it's not being used it's not being used okay okay I think I just missed one slide so this is like if I just change that movie data into an actual relationship format it's pretty much looks like a graph no like this so the graph would be the best way to represent that data and store that data if if if we convert that relationship with the data which we had and it's become a graph why we cannot store that data into into specifically a graph and the reason being that is because with the graph databases there are some myths the first myth is everybody thinks that it's an overkill like we are just going to have a very minimum relationship in our databases do we really need that kind of infrastructure and that kind of time and effort to be invested the second myth which which I and second question which I thought of found every every time is that are the robust can I use them as a as a my primary databases because most of the graph solution which I've seen are completely built on top of any other specific databases and they are just simply graph layers rather than having a complete complete graph databases scalability is also a concern because the data the most popular graph databases in the market as of now they don't scale the way it's supposed to be like the way we can scale my sequel the way we can scale MongoDB is scalability always become always been an issue in in graph databases so the the question to those specific the the answer to those specific questions are pretty simple like the way I have experience and explore graph databases I found it not an overkill because getting up and setting up was very easy it was it took me around just five to ten minutes to get started with a graph database instance instance and I was able to create schema and do our basic insert and insert and query operation over there they can act as act as your primary database because the one which I'm going to talk about here is is not a graph layer it's a complete graph database so you have your storage system you have your cluster management system and stuff like that and the third thing are the scalable so the the the graph database which are which which we'll be talking to you here is is scalable by default like you cannot have one single instance by default to actually create the whole cluster of of of that database by architecture is divided into two three groups and that scale internally itself so and and these are the problems which I talk about about the current current graph database like if you if you've seen Neo4j Neo4j is single server architecture you can have one specific graph layer you need to put a mammoth machine to support a larger data set and similarly the other database like Kali, TitanDB and DSC graph are are not the graph databases but the graph layer so you need to have a MySQL or MongoDB where your source of truth lives and then then these specific layers gonna pick information from that and convert it converted into graph in the memory and similarly in the industry there is no such standard like if we can't have we don't have standard like we have in RDBMS and no SQL in specifically in graph databases so that's why every specific company have created their own graph database and graph layers like Google have Knowledge Graph, Facebook have Facebook Tau, Twitter have FlockDB and so any specific company you name it if if a company is doing anything smart they use some sort of graph solution because that's how it's gonna be actually that's how they can achieve the performance they want so let's meet Dgraph what Dgraph is it's a it's an open source transactional distributed distributed graph database which is completely written in go you can go on digraph.iu to check it more feature the features the features for digraph is fairly simple it's fast it's actually built like search engine the the the founder of of digraph is is X Google so he was working in their specific search engine team and he have created the whole specific solution in the beginning and now it's it's a community project and they have an enterprise solution as well it supports distributed joints filters and sort like if your data is sitting on multiple nodes you can filter by them you can join them and and and get a complete data set the way you want it's it's originally scalable because you can have your data started or distributed in multiple nodes the way you want and it's consistent because it uses raft internally for HR application and it's also highly available and fault tolerant by nature like the way they have architected the the digraph it's fault tolerant and it's exactly highly available so we did a benchmark that how it's perform against two famous graph databases in the market the one is Neo4j and the other is Kali and and and these are the results so when it's comes to data loading and compared to Neo4j digraph is hundred times faster like if you have one terabyte of data and you wanted to load into a graph it's it's similarly 100x faster when it's compared to Neo4j and when you are and when we have started querying to that particular data that was six three to six time fastest because the way storage happens in Neo4j they have the predicates information in there and and and the way storage happens in in digraph they they they have the UID structure will talk more about the data modeling section of digraph and similarly similarly we see we have seen improvements in actually comparison with Kali as well these two benchmarks is published one is done by internally internally digraph team and one is done by from community itself so these two links are there you can go and check the more details around that the digraph architecture is fairly simple you have one zero layer that zero layer can be one or can be can be multiple depending on your need and this is and and this is the zero zero level group which actually manage your your cluster like how many clusters you have how many storage engine you have this is all managed by your your zero and then you have a alpha layer that alpha layer is actual storage storage layer which talks to your your cluster layer for information that where should be putting what and other specific stuff and and this whole and these two things conclude your digraph cluster but there is one another thing which call is rattle rattle is an UI specific application but which lets you interact with your digraph cluster like you can have one specific rattle and behind that you can create a multiple multiple cluster system of cluster system of your digraph and this and this lets you get information like how many predecessors you have created how many UIDs they have generated with the current UIDs running what sort of data available in those specific predecessors and those specific those specific UIDs the data modeling in digraph is you have UIDs which generates automatically you don't have any control on that you have predecessors which you can define what sort of key you want for your specific values to be stored and then you have values which can be your specific specific values if it if you see in this particular example every I mean the UID 0x1 which is a binary number have have a have a predecessors called director.name and then I am signing Stephen Stephen into that particular particular name so this combination of this UID and this predecessors gonna become a key for our actual graph engine storage and these values can be single or multiple like you can have an array of data stored to one single single predecessors as well. The query language in graph actually in digraph is very interesting we needed some sort of some sort of query language to query the recursive approach of the graph and we thought of creating a new query language but but we end up using using graphql so how many of you are familiar with graphql okay so so the thing which I liked about about graphql is that it lets you query recursively okay like you can query one part and then your other part and the other part and so on so it's an you can form a recursive approach of querying information and that is how a graph query language supposed to be so we pick the graph query language which is which is graphql and then added some information and removes some information and as of now we are calling it graphql plus minus but yeah we can come up a good name later what it does it's completely inspired by graphql it and sexually modified for better database support because the the graphql is completely built to query data only not to query databases like functions and other support is not there you only have one specific mutations and query to be done using using graphql but this is but there is more like you can implement filter you can implement indexes and and stuff like that and we are trying to build our graphql plus minus somehow to support completely completely graphql because that's more familiar and there are lots of clients available to that but but I don't know how much time that gonna take if you want to play with this graphql plus minus you can go on play.dgraph.io and you can have a playground which we have set actually set up for you. So how easy it's to get started with dgraph like we have Docker system you can just run one specific command and your cluster is ready similarly if you do in Mongo or MySQL so it's not it's not a hectic job to set up a graph layer and and there are banners to available for Linux Mac and Windows you can go and install them the way you want. So this is what I really liked about about dgraph like I don't have to go and figure it out a lot of dependencies about graph layers get in an understanding that how it works and and and then I can go start and build and invest time into into our into our graph layer or maybe graph database like I just like I might have tried out MySQL I can try it out this particular database database over there as well. And the schema is fairly simple although it's completely optional you can create a schema beforehand or you want your or you start inserting your data and this schema will generate automatically. So if we take that previous example which we have did in RDBMS and NoSQL this is the schema which we'll be creating in in dgraph like you have one indices called director.name which is going to be a type of string I'm just indexing it with exact and full text we have trigram term and other indexes as well on string which help you to pull put regex to actually match the information and stuff like that and similarly the other fields if you see that movie the director and movie.genre these are UIDs but UID does it doesn't have a value but you have a UID which points to exactly to the next data so that's where the relationship actually forms and if we want to add data into into the system it's fairly simple you just define so this underscore colon Steven is actually a UID which we are referring through a variable and this gonna be used as in reference internally into the whole query so this whole query insert insert the this single query insert the whole information of that particular that particular one movie like to light zone which have director Steven and which is have movies as horror and then you see that these three UIDs is generated 0x1 0x2 and 0x3 and internally this is how our current data set gonna look like I have one node which is UID 0x1 which gonna have indices called director.name and its value is Steven Spielberg and then there would be another another value which gonna be UID 0x2 where I'm storing your storing our genre and similarly we have our movie specific node where I am just in actually storing its name and its year because these are two are the actual values and then director and and the genre are actually actually the relationship that's we calling indices so these two gonna go over there and then if you if you want to query all the data into a JSON format this is the very simple query which is this looks similar to similar to GraphQL that how it's supposed to be but but there are a few things like those functions things you see that is added into into the into the GraphQL system and other thing is pretty much similar and I have also yeah and if you can look closely I have also using aliasing over there like the name gonna the movie.name become name and the movie.year become year in the actual data and you see that particular information over there so the problem was when I started with this specific specific thing I was working with Node.js okay and this is a complete this was completely new thing for me like I've never worked on RDVF system I've never worked on that how RDVF works and this system this this this this GraphQL was fine but this GraphQL plus minus was the thing which I was struggling with so I thought why can't we have something similar to SQLize something similar to something similar to Mongoose and other other ORM which we have in the market for Node.js and here it is I've created my own like that's D-Graph ORM so so what it does is I've just it's up simplified that how schema creation happens how you can run a query how you can run a mutation and it's it's fairly it's heavily inspired by SQLize and and Mongoose and internally it uses D-Graph.js which is the official library of D-Graph for Node.js connected for Node.js client and you can find it out on that particular link so that schema which I've shown you earlier that schema become this in in your Node.js you import your D-Graph you create a new schema you define what schema gonna look like you put all those properties and stuff like that and then you generate a model out of it and that model and that will generate internally this particular thing this particular RDF notation of that schema and that schema will be run internally through internally on D-Graph. Similarly when you are trying to create a movie this is how you can do I mean this is just a movie slipstop you're I mean director and in genre you have created earlier and that was 0x1 and 0x2 ID which you have got I'm just actually is having a less snippet to actually just to fit the screen and this is how how the actual actual data will be generated for D-Graph.js internally and similarly when you are actually querying so in graph what happens is you cannot just query select star okay you need to define one root node by which you can start your traversal that is how graph works so if you see in this example what it actually doing is I'm just providing directly a UID okay like this is the UID and you can go and then I can finding all those relationship related to that but you don't have these UID all the time so graph D-Graph have given you many ways to find the root node like you can query from equal like you can define one value which is equals to one indices you can find using trigram like okay these are the three values I don't know which value gonna be but but pretty much find out what value exactly it is and then there's a regex pattern match like you can give a regex to find out your root node or you can have a greater than equals to on on on various data types it also supports geo geo coordinates as of now so you can do like just give me all the users which is near 200 kilometers from this specific specific coordinates and stuff like that so all those things is already implemented over there and it's it's fairly available you can go and check and this is generated the same function which we had earlier in our in our D-Graph system so who is using D-Graph so there are lots of clients using their enterprise version and we can't just reveal their name as if now because there are some specific thing they have signed to them but there are two open source project which started happens very early the one is started by Intute which is Catalyst and they are internally using D-Graph as in as in the graph layer and the other one is another one was from VMware where they have created a new tool all-perser and they are using completely D-Graph as in database over there so what the takeaways from this particular talk the first thing is don't scared with graph databases use them try them play with them and then figure it out either they are stripping down some of your application logic to the database or not either they are very helpful to maintaining any sort of relationship complex or minimal and they can be used as a primary database like we think about MySQL or maybe MongoDB these days they scales in no time so you don't have to worry about scaling them because as I mentioned in the architecture you have zero and you have rattle these two things that require in a single cluster and that can be one and can be multiple so you can distribute or maybe scale your graph system in no time they have better performance I'm not saying in terms of only only only the complex logic stuff but if you if you just have to store this this movie information which is very minimum relationship data which which I could have thing actually thought of and and and when we were trying to put it on on RDVMS or no SQL we have certain challenges and we and to overcome those challenges we have to write lots of code in our applications application and and and we know that our next day our product manager gonna come and say okay now we are changing this particular things to that because our data is going to have another data into that and and and stuff like that so those thing earlier earlier those things were not that a big problem because we were brainstorming enough and we were not having that much changes in in the system in terms of data but these days the way we are storing data and we are come actually consuming data the data changes on a very rapid fashion so we need to think of somewhere to put that particular changes and and and build solutions are around in a faster fashion and similarly like getting started with digraph is fairly easy like the the the docker thing made it as simple as you could have started with MongoDB or maybe it's really my sequel here are some references which can be found where like and this and this digraph ORM is also available on npm so if if you want to try it out it's there and I think I think this slide is also going to be available to you after after the talk on on on that particular place only and these are the links of documentation there's a structure architecture and and stuff like that thank you any questions so my question is like for example if I have a like node at edge right at any of the layer right like in the graph we say the root node it can lie down the tree right so for example if you are trying to query like from the director name right the director name is stiffer so what was the time it took right just to query that so if so if you go on play dot digraph dot io as of now the movie data is already available there we have kind of stripped down the IMDB on a certain label and that's put that data over there and if you see like the data set which I had so I was building one specific application using digraph where we are querying our user analytical data so in knocking.com what happens is we store a lot of information from the user side like where they are moving from one page to another how they are clicking how their whole behavior is happens we are storing that information in in in an ominous fashion and and that data as of now is around. 13 230 gigabytes of data as of now and if I have to query one specific node which which is most simply like just get me all the mouse click so that I can plot a heat map or stuff like that that took around 120 or 130 milliseconds in in the in the whole operation if if I'm querying it right from the same cluster and then there would be some network call if I'm calling it from a different way so that sort of speed which we have over there so I just have an idea right what if we do that we like for example if we would like to query so there are cases right so like you can do this kind of a query exact match query on the director name right if you create a create a kind of index of it right with the node IDs like you told like you can have 0 or x1 right with the names right in a rdbms kind of a passion right and over there you can have where you just do a perfect match and then just search that node with that node ID yeah the implementation can be in very ways like you can you can use redis to just map your director name and your and your UIDs but but in general because because internally it uses indexes on on those specific values it's have that kind of similar fashion you get in rdbms or maybe no sequel because they also use indexes on values but if you directly have UIDs they perform fairly better compared to compared to these two databases but but when you have that kind of match like if you if you're going to use trigram which is a regex match for your specific string values then it's gonna be a little bit difficult because you're now you have to deal with a very complex indexing system internally so that's the one thing so you have to be very care little bit careful that what sort of indexes you are applying to what sort of values