 Good morning everybody. It's so nice to be here. Can everybody hear me okay? So the first time I'm doing a talk with headphones on, so I'm just going to take some time to warm up. So yeah, I'm Shreemati. So before we get started on the topic, I wanted to find out if anybody here has heard of GravDBs. How many of you show off hands? Okay, pretty much half the room I think. And how many of you have used one? I mean, how many of you have used one in production? Used one in production successfully? Okay, there's one hand, two, okay. Awesome. Nice to have you here. So the goal today is to walk through the process of selecting graph databases and using one and the things to keep in mind while doing that, some of my learnings. So quickly about me. So I have been doing software development for 12, 13 years now. I've stopped counting after a point of time. So I have spent my time building scalable back-end systems, high performance websites, mobile apps, such and such. And I work for a company called Sahit Software Solutions. I've been with this consulting firm for the last five years. And the good thing is you get to work on different problems. That's the good thing about being in consulting, different problems, tough problems. So most recently I've been working on big data platforms, that is building data pipelines to take data science models to production, which is PySpark, a lot of PySpark, Airflow and things like that. But today in this talk, I wanted to talk about a couple of projects where I've used GravDBs to store information in a graph model, use graph queries to navigate through highly connected data sets. And my plan is to share some of my learnings here. So quickly about what we're going to talk about is what is a graph DB and why should you care. When to use one. That is the key question that I want to spend some time on and share some of my understandings in this space. And we'll get into a type of graph model called Property Graphs, one of the most popular OLTP graph databases out there. And we'll talk about Apache TinkerPop, an open source initiative that has been around for seven to eight years now to standardize graph databases and have a standardized query language for graph databases. And we look into what it takes to query a graph, to traverse a graph to get results out. So first of all, this will be a basic talk. I have 30 minutes. This is a large topic. So I'm a mother of two girls. My six-year-old is here today. Thank you, Paikan, for the childcare support. Just wanted to say that. So in line with that, I'd like to keep this talk simple so that there is takeaways for people and there's more interactions as well. So happy to talk more after the talk as well. So no talk on graph databases can start without mentioning graph theory. So graph theory was one of the first problems in the graph space, a geospatial problem. The problem itself is the most popular maths problem called Seven Bridges of Kienigsberg. So Kienigsberg is a city in, I mean, it's now in Russia. And back in the 1700s, there was a math problem here to understand. So the city itself is connected, divided by a river, the Pregel River. And there are seven bridges connecting the city. And the problem was to devise a way to walk through the city that would cross each of the bridges only once. And this led Leonhard Euler to solve this problem. He came up with a negative resolution that this cannot be solved, but what this gave rise to is the foundations of graph theory. He was able to model this problem as a set of vertices and, you know, edges and try to understand how to traverse to this path only once. So what is a graph? So graph is nothing but a tuple of two elements, sets of vertices, sets of edges. Your edges are the relationships between vertices. It helps you understand relationships within data. So quickly let's look at the databases landscape out there. So there are different data models out there. We've all been, you know, familiar with the relational DB. It's one of the most popular databases out there. 90% of the web runs on relational DB. But there are problems, there are challenges scaling to, you know, like if the data runs in the order of terabytes and petabytes, there is problem when it comes to scaling. And this gave rise to the no-sequel moment where there are lots of solutions that have evolved which cater to every specific need. For instance, what you're seeing here is the databases landscape. The complexity of the data, the most simple data to the most complex representation of data. But complex it means, you know, highly interconnected relationships. That's your graphs, that's what we're going to be talking about today. But quickly to just, you know, compare with the others, there's your key value store where, you know, all, there is no real relationships. Everything is atomic. Give a key, give a value, it stores it in your, you know, red disk and react, your caches are in that spectrum. And then we have the tabular databases, which is your white column stores, which help you store a little bit more than the key value in a flexible way. But once again, no connectedness between, you know, multiple elements. Then we have the document DBs, where your mom goes and things where some connectedness in the data can be brought about within an atomic document. But across documents, we can have joints, but it's not as powerful. And there is no concept of referential integrity, such and such. And here we come to graphs. So relational databases, one other thing that we have to keep in mind is how things get difficult when you have relationships across four tables, five tables. We've all been there. We've written those joins and had performance problems, tuned them. So we look at how graph changes this as we go along. So quickly, this is a picture from DB engines. And this is something that I've been following for the last few years. And this trend in graph DBs has been soaring ever since 2013. Primarily because, so prior to that, it was primarily in the academia space where people were trying to understand how to exploit graphs. But very recently, enterprises have started looking at it, have started to understand how some problems will do better to solve using graphs. And I just thought I'd share that. Quickly, a real example on how we use graphs every day. So this is a Google search, a snippet produced by Google search for the album Imagine. So what we get is immediately a list of data points about Imagine. So we get the artist, we get the date this album was released, we get the studio information, we get the producers and the genres. So this is contextual to the music industry. So similarly, you search for a movie, you immediately see the context switching to producers, music directors, you know, such and such. So the context changes, the semantics changes. So such things are produced by what Google has something called the knowledge graph. So this is like one of the examples that we can try to immediately relate to the power of a graph and ability to, you know, spew out results. So it's a very dynamic structure, highly contextual. There is, you know, flexibility and scalability when it comes to running such queries. So why graphs? So graphs typically make sense. If I had to summarize where to use graphs, the one line answer would be where your connections and data and relationships are, you know, of extreme value. So some of the popular applications are like we saw in the previous slide, knowledge graphs. Knowledge graphs are a loaded term. So it started out as a semantic understanding of a particular domain, but now it has navigated to trying to do things like, you know, data is kept in silos in different enterprises. To give you an example, if you have a retail store experience, I mean there's in-store experience, there's the online experience. So there's customer in two places, there's data in silos. There's a system powering that, there's a system powering this. So now how do you unify these journeys, the customer journeys? Graphs have been a part of trying to understand how to unify these journeys, get the context together. These problems are referred to as customer 360 problems. Enterprises are looking at it like that. And network monitoring is another thing. Where is the point of failure in a network? There are different components that make a system. So if one system goes down, what is the impact on the network? Where are the failure points? Where are the places where, you know, we used, if we had to model this using the relational stuff, then you can think recursive CTs to get to what is the problems, what are the problem points. Whereas in a graph, it's a few lines of code to get this out. We'll see a little bit in the slides to come. And then recommendation systems is one popular application. But this is something that everybody thought graphs was going to be a game changer when it comes to recommendations. If somebody likes something, then what is the chance of another person liking that? But it doesn't scale really well. If you have a small dataset, recommendation systems using a graph is going to work. But if it's a large database, then you can try to, you know, look at ML pipelines to get your recommendation systems out there. Then the other thing is social networks. I think I don't need to really even talk about it. So who follows who, that relationships are, can be, those answers can be got from a graph. And another thing is the identity access management systems. So who has permissions on a particular folder? If you want to manage these permissions with a group or a particular user, you want to get these rules in place. Graph seems to be a good solution to solve some of these problems in an easy way. So everything is a graph, right? Everything that we see around us, any data models, anything is related. So can we solve everything using a graph? Is that the answer? Short answer is no one tool can solve everything. We can see that with the advent of this concept of microservices. You look at every problem in its siloed space and try to choose the right technology for the problem. I have a few examples. So let's try to walk through some small use cases to drive home this point. So imagine the problem, the questions that you're trying to answer from your data model. It was something like, get me everybody who is attending this conference. Find me everyone with the name John. Find me everyone who lives in Chennai. If you had queries like this, what would be the, you know, in the databases landscape, what can we choose? Ideally for such a thing, you're relational or, you know, a good search engine is good enough. You don't really need a graph to answer any of these questions. Whereas imagine if you're looking at related data. What is the way to get introduced to somebody of, you know, influence? There's a score. Somebody has a high score you want to get introduced to them. What path do you take? How do person A and person B know each other? How is company X related to company Y? So these are things where a graph is a good, you know, a candidate for solving such questions. So one of the problems that I worked on was how is company X related to company Y? But so this was about, you know, putting a dashboard of the companies and trying to go into each company to understand what is the connections. But once again, going back to the, what technology to use for what problem. So for something like, you know, a dashboard, a paginated view, which would be to pull the entire database, where you would be traversing the entire database to pull, you know, a row of data or to search for something. So in that case, something like an elastic search is a better option. But to show a particular company if you want to get its relationships, a graph is the better use. So this was one thing where in real life we chose graph to even do the dashboard and we saw pain points and we had to backtrack to put in a different technology for that purpose. Something like aggregation. How many people are here? What is the average number of attendees? Things like that. IDBMS. We don't even need to think about graph when it comes to aggregation queries. But if you were trying to do something around pattern matching, whose profile is similar to whose? If you're trying to do such kind of, relationships look up, or is the user, you know, the same as Johannes, if you're trying to do these fussy searches, then the answer now pivots to, it's either a graph or a search. I mean, it totally depends on your context. So if you have a hammer, everything looks like a nail. That is something that happens with graph DBs. So hence this exercise of trying to, you know, showcase that it's important to think about the problem and what is the need for the problem and try to solve for it. So quickly, now moving back to graph databases, so what does the scene look like? So there are two main graph data models. There's one property graphs, which we will look into detail in the upcoming slides. And that is what is being standardized by this community called Apache Tinker Pop. It is, and we have on the other spectrum, we have the RDF graphs. RDF graphs are, they stand for resource description framework. It is a W3C standard for data exchange. The exchange model, you know, for sharing data and the data itself is stored as a graph. So RDFs are more popular in academia and trying to, you know, come up with semantic web. These kind of cases seem to focus on RDF. Whereas the OLTP cases, scenarios seem to look at property graphs. And from a query language perspective, the standardization, there is the Gremlin traversal API, which is used to, which is what we'll be looking into detail today. And there is ParkUL, which is a, you know, helps you write queries to do pattern matching for the RDF graphs. It is typically time finds its use in data mining and finding if a pattern particularly exists and things like that. So delving into property graphs, what we see here is a property graph connecting people and software that they have created. Fictional data, so don't look too much into that. But what we see here is a directed, there is directions between one edge to another. There is multiple attributes. So for instance, you see on person one, you see multiple attributes associated with that person. There is the name as well as the edge. And we see that there is an idea associated with every node as well as the edge. And there is a label to every node. So there's a label called person, which identifies person nodes. There is label for each of the edges. There is the created edge as well as the nodes edge. So quickly, so we saw a set of nodes, vertices, each of them have a unique ID. They have a label. They have a set of outgoing edges, incoming edges, and they can have multiple attributes, property elements with it. Similarly, edges can also have properties. And there is so much more to it. There are multi properties. There's meta properties that you can have against properties to facilitate different queries that you might have and modeling purposes. So quickly now talking about, so what we see here is the Apache TinkerPop stack. So it is an open source, vendor agnostic graph computing framework that defines the structure of the property graph. So it helps databases. There are different commercial databases out there. The most popular one is Neo4j. Then AWS Neptune has been around for the last one year, which is TinkerPop compliant, which means the database is there. It is TinkerPop compliant. So you can write Gremlin queries to access the database. So Gremlin itself is a DSL for traversing your graphs. And it is an expressive language to define these traversals. And it has different bindings and different languages. The one that in the Python space, we have Gremlin Python, which helps you write using Python, your Gremlin queries, which get translated to groovy on the Gremlin server and then get routed to the database server itself. This is the flow of how the queries work. So you have your dot notation for function chaining and you have the typical, you know, formats with which we write Python code even for these queries. Just to understand how it looks. So this is nothing but Python code, where we have the handler to the graph, which is your G. And you can just, it's very readable. Add a vertex called with the label person. You can put properties to it, where there's an ID property that you have to give. And then I'm trying to give the fact that there is the name is Sri and from Chennai. So now we can try to add another node called conference and try to say, you know, I can India as the properties. And once you have these two, you can try to put an edge and say, you know, what qualify that edge with a label called attends and also have attributes at that particular edge. You know, this can, you know, really open up ways to store this data. But one, one learning though is this, getting this data model right is very hard. You will get it wrong if you're going to, you know, embark on this journey the very first time. It's not going to be immediately apparent. It's going to take trials and errors to get to the right thing for the problem that you're looking at. So the traversal itself, how does that work? So there's a simple query for the data that we just saw earlier. So we want to know who are all the people, the names of people that a person called Marco knows. So Marco is node one. So Gremlin first starts at node one and it has to now traverse two outgoing edges with the label knows. So this is where the provider can parallel process. I mean, you can try to have true traverses going out. There are two knows edges. So Gremlin goes that way and tries to get the name Josh and this way to get the name Wada. So this is how the traversal typically works in the Gremlin space. So and I wanted to also try to show how much more the queries can be. So imagine this is like your typical recommendation system. So what has user A liked and who helps us like those things? What have they liked that user A hasn't already liked? You're typical trying to show recommendations. So if you had to look at this particular example that we're seeing and we wanted to say, so what has Marco created? Let's get all his collaborators. What have they created that Marco didn't create? You know, something like that. There's a parallel in trying to understand this with the recommended systems. So how these queries would go are, so let's get to the outgoing edges of Marco, try to aggregate all of his collections and post this you can try to get the incoming edges, outgoing from Marco, but incoming. The others that also created the same software, then try to go to what else did they create that Marco hasn't created. Such queries are possible. I mean, you can try to get down to that level of query. And there's much more. I've kept it very simple today, but to feel free to go take a look. There's good guides out there. And this is like the slide doesn't look so good, but this is a recursive CTE that I wrote back in, I think 2012. So it's very simple. So you give me all the employees. Imagine you had the employee and their supervisor, two-dimensional, two columns. So you wanted to find every employee and their supervisor and the hierarchy level. So such a thing on the graph, it starts to start looking like that. And the whole query itself becomes simpler. So it becomes three lines of code rather than this whole blurb of how we used to do it in the relational world. Just thought and quickly. So summarizing. So graphs are a natural fit to complex domains where relationships are important, where they help you, you know, find and traverse through relationship and help you get to answers. And one important thing is the query time is proportional to the amount of graph that you have to traverse. If you're traversing the entire graph, you're doing something wrong. But if you're traversing a small set, it will be a very powerful query compared to your other data models. But what are the bads? It is a really large, you know, rapidly evolving core system. A lot of attention is paid to this space. So there's a lot of open source work done in this space. If anybody's interested, it's Hacktoberfest. We have a lot of repos in this space. And there's a complete lack of documentation as well. So it is a, you know, very important time for communities to come together to understand how to make that better and, you know, write better code, which is what I think this community does mostly well. So... And one thing I wanted to talk about as well is it's new for new people to understand what is a good use case, where does it fit, and then how to do it correctly. There is gotchas out there, so you might have to play a little bit before you try to identify this for a production use case. So if you like the illustrations, these are where the credits are from. So I'd like to take any questions at this point of time that you might have. Hi. So my question is that the graph that you had for an example, it looked like that the questions that, I mean, it is more suited to questions like, does A know B? But let us say I have an application where I'm interested in also asking more meta questions like how many persons are there or how many persons above the height of X are there. So that I feel that, I mean, correct me if I'm wrong, that is more suited to our RDBMS setting. Just asking that sort of a question. So, or even if it is not, my question is that if there are questions that my application may require, that the core is solvable through graphs and the others are not. So is it better to have two systems or is graph databases optimized for asking RDBMS questions as well? The short answer is it depends. But to get into a little more detail, so polyglot systems are what, I mean, how we are trying to design systems these days. So trying to divide and conquer seems to give better answers rather than trying to put all owners on one database to scale for you. So it depends on the problem and where it suits well and how you see outcomes. I know it's a bit hazy, but yeah. Thank you. Hi. Really nice talk. I enjoyed it. Thank you. So my question is, there is problems in graph space like the traveling salesman which are little MP hard kind of queries. So how do you protect against such a query or do you have any mechanisms in place to look at the query and say, no, this is an MP hard query and we're not going to run it? That's a very interesting question. So yeah. There are ways to monitor query times, which is something you do as part of profiling. You can try to get profiling indicators in to understand long running queries. But I haven't really faced, I mean, something like the traveling salesman problem. But yes, profiling is one answer to understand how much time a query takes and how much of the graph traversal happens. If you're going to traverse the entire graph, that's going to be a costly query and you're not really going to gain much out of, you know, using this model. Okay. Yeah. Second query is to the organizer. Where can I find the... It is on the website. Put up the slides on the website. Hi, my name is Shankar. So given that you were saying that, no, it's going to be a bit difficult to write queries with Gremlin. It's not going to be as easy as probably SQL, right? There's a curve, yeah. So what are the best practices in terms of profiling, basically? What do you give as an advice? What do I give as an advice is... Specific tools or specific techniques, so that we can speed up writing graph queries. And also, since it is used in large-scale scenarios typically these days. I think profiling takes, you know, significance. Correct. So AWS Neptune is one thing that I used where they do give ability to understand how much time a query takes and what's the amount of graph that it traverses. So you can try to turn on these profiling metrics to understand the parameters. So the graph databases by itself have the profiling... Some do. Some you have to... You can try to interject annotations and to understand how much time it takes. But apart from that, optimizing Gremlin, there is a lot of ways. The documentation is currently weak. So one thing that has helped me is take a look at the source code to understand how the traverse even behaves. And try to understand what is the best lambda steps to use. There are different steps. There are the basic lambda steps and then there are derived steps. So there is optimization there about how to run the most costly queries on top and then try to come to the smaller ones. So how we come across profilers with visualization because graph D is typically visualized. Correct. I have had visualizers that help you in the development mode when you can understand how many vertices and things like that. But from a profiling standpoint, no... Okay. Thank you. Hi. So this talk was very good. It really intrigued me. So my question is, if you see there are typical two types of databases, OLTP and OLAP databases. So where do you like to fit graph databases in between them? Yeah, that's a good question. So the community itself is working towards it in both spaces. So what I showed today, the property graphs are the OLTP real time queries. There are OLAP and data mining type of queries which use spark processors underneath to divide and conquer. There's a large data set out there and the traversal takes time. So there is effort put in that space as well. I think the words that I can think of are giraffe processors, spark processors to try to accrue the results for such cases. So yeah, but the queries that I found were not like the examples that I just saw. So there were more or less like not the primary database that you cannot use graph databases as the primary databases. Like you would have to definitely put in some kind of RDBMS in front of it and then extract all the data to a graph database and then run the queries. For a production scale system, high scale system, that's pretty much what I think is the way to go. You use graphs for where the relationships matters and you have these. You want to try to decipher a pattern. You want to try to get to specific relationships and work with that. Their graph suits best. But if it's a small scale system, you can try to use graph as well. So it depends on the scale of the system. So how do you define a schema and how do you? Because graph is a very evolving right. It can grow as big as you want and it can shrink as. How do you define and data governance like you should not let the graph grow that big. You just add the nodes, but you do not let the relationships grow. You understand? That's a good question. So what I have typically seen is it does lend itself to a very flexible schema. Each node can have any number. But you can try to get your schema constraints in the client layer and try to have your typically how we try to have JSON schemas to define what is the incoming object. What can it hold? From limiting the number of relationship perspective is something that you have to put it in the custom layer and try to put the safety checks. I haven't seen it in that side right till now. This is the last question guys. Hi Shree. This is Sandhya. We were talking about the recommendation systems. So one question there is say we have, you know about the different e-commerce websites which have a lot of products. This is really high. Along with that the user even data also will come in from the websites also. When you say somebody bought this bought this one also. So when some things like that come in your data space is actually increasing for the events also and for the products also. When you define relationships which will be dependent on different different cases like say how it's created, knows those kind of relationships when your search space is actually increasing will you have aggregation layer on top and actually try to rank a particular product or a user on top of all those data that you have curated for a particular time or something to reduce so that the recommendation system is actually giving you faster results. Correct. So I did mention this earlier that using a graph for a recommendation system you are better of having an ML pipeline that's one way. Correct. Or the OLAP queries that he mentioned which is it can work offline this batch processing and try to do these aggregations that's like a summarizer layer. You have like a data pipeline trying to do this on your weekly data or daily data. And try to accrue these aggregated numbers and then use graphs to pull out your high level aspects. I think that's one way to. Thank you.