 Hi everyone, so my name is Nicole and I'm going to discuss with you today about graphs databases and how to handle relationships within your data with Python. So first let me introduce myself. I am a full stack developer at lab codes and I work mainly with Django in the back end and AngularJS in the front end and I am also a student at Federal University of Pernabuco and I'm trying to get my master's degree and I'm almost there, I'm in the final part and my master research is related to performing OLAP queries, OLAP analysis on graph databases. So that's why I'm interested in this topic and that's why I'm talking to you today about this. I'm also a member of the Python users group in Pernabuco and Pilates group in Recife, that's my t-shirt. Pilates in Brazil, so yeah. But wait, Recife, Pernabuco, I said a lot of words that you guys probably didn't understand. So I came all the way from Brazil to HM&E to attend EuroPython. It was a really long trip, 19 hours between airplanes and trains to get here. So I live in Recife, that's in the northeast part of Brazil. We have like this very active community of Pythonists back there. We recently organized Jungle Girls last month. I was one of the organizers and we also organized the 50th Python user groups meeting. So we are very active, we are really proud of it. And I work at LabCodes, as I said before. It's a software, and what is LabCodes? LabCodes is a software studio that's from Recife to the world. And being a software studio means that we develop solutions, be it a process or a web app or a product for our clients. And it's according to our client's desire and our client's needs. We can solve a problem, we can develop a new product or we can implement a new process in their business. We have five years of experience with clients in Brazil and in the US. And the technologies that we use are mainly Python, Jungle, JavaScript, mainly with AngularJS and React, and a little bit of Vue, VueJS. We also work with Postgrease, Cassandra, Elasticsearch and a lot of others. So LabCodes grew with the help of the community. So that's why we are always trying to give something back to the community. We helped organize some of the Python user groups meetings in our state. We helped organize Brazil Picon in 2014. And since 2012, we have been to all Brazilians Python conferences every year. And we also participated in a lot of Jungle Girls events as coaches and organizers. So we are trying to always give back to the community because we came from it, we came from a group of Pythonists that met in a Brazilian conference. So it's just fair to give it back. But yeah, about this talk, this is my agenda. I will start talking about relationships, what I mean by it. And then we'll introduce the concept of graph databases. I don't know if anyone has heard of it. Has anyone heard of graph databases? Whoa, that's really good. That's amazing. Good. And then I will proceed to talk about Neo4j. That's the most popular graph databases that we have in the industry. But we also have others, so I will be comparing some of the solutions that we have available today. I also do a small comparison between Neo4j and relational databases. How we can compare those two, because we are the ones that are used to develop web developers, are used to work with relational databases. So I brought some of the concepts here so we can compare. And at the end, I will talk about some applications, some co-applications that we can have for graph databases. So yeah, let's start with relationships. What is that that I keep talking about? So relationships in data is really, is pretty much related to the relationships that we have in real life. Every time that you add a new friend on Facebook or you follow someone on Twitter or you pin an image in Pinterest or you accept a new connection on LinkedIn, you are creating a new relationship. Not just in your personal life. That means that now you are friends with someone else or you follow someone else. But you are also creating in the database a relationship between your data and someone else's data. So when you add someone as your friend, your profile data is now linked somehow to the profile data of the other person. But now as you are at Cast, we have a lot of relationships on the Internet. Only in Brazil, we have 160 million users of Facebook. So it's like a lot of people using Facebook back in Brazil and each one of them, each one's friends and likes a page or participates in a group. So that's a lot of relationships. But how can we represent and manipulate all these relationships in a good way? So to start, I will present a small scenario, very common, a social network where just an example where a user can be friends with another user or like a page. Working in a similar way as we know Facebook, just a small scenario that we can work with. So now that you have this scenario, let's try to represent the data of this scenario using tables as we would do in relational databases. So at first, we would have a user table so we can keep track of the data of our users and each user has a name, a gender and an age. Okay, that's cool. So now we need to store the information of which user is friend with each user. So let's create a table called friends with. And in this table, we have the ID, two IDs of users and each row represents a friendship between two users. Cool. That's really nice. I understand so far. So let's create another table to represent pages. Each page has a name, an ID, obviously, and a category. But we also need to store the information of which user likes which page. So let's create a table called likes that connects user ID to a page ID. Okay. Now I need to know, now I have all my data in the tables and I'm running my application. And I need to know what are the pages that the user with the name John likes. Okay. Okay. User with name John. So let's go back here to the user table and we find out that John has an ID of one. Cool. So now we have to go to the other table and the likes table and get the ID of the pages that the user with ID one likes. And we have that are the pages with ID two and ID one. And then we have to go back to the pages table and get that the page with ID one is Coca-Cola and the page with ID two is the Beatles. So yeah, you saw that to answer this really simple question I had to query three different tables just to get the pages that the user John likes. And that was not fun. It was a lot of going back and forth to try to figure out by an ID which one is which one. It's not great and basically table sucks for this kind of thing. It's not the best way to do it. So how can we use a better data structure to represent this data that we can answer this kind of question faster, more intuitively? That's where I present to you graphs. Graph is a data structure that's usually represented by the letter G in mathematical format and is formed by a set of vertices V and a set of edges E. It's a rather simple data structure that anyone that has studied this in the past knows that it's just a representation. It is a very simple way and intuitive and very graphic way to represent your data. Given this concept, now we can represent our scenario using graphs. Here I have the graph representing now the same data that we had before in tables represented as graph. Each circle is a vertex or a node and each line represents an edge or a relationship between the nodes. We have in the green ones are the users, the red ones are the pages. As you can see, we have labels for each relationship that we can see friends with and relationship of likes. Given this representation, it's much easier to find out which pages the user likes. We just find the circle with the name John and then we follow the lines and we get to the pages that the users likes. So that's pretty good. We can have a graphic view of our data. It's really nice. But how can we use this kind of structure in our database? That's when it comes to graph databases. So graph databases is just a system that stores data in graph structures which allows users to explicitly store the relationships between data. So with explicit relationships, we can get a direct information retrieval method. We can directly retrieval the information of the relationship. Besides that, we have other advantages for graph databases, not only explicitly storing the relationship in the database. They also allow more elaborated data analysis. We can use common algorithms from graph theory area that I don't know if anyone here has ever heard of it, but we can process algorithms of community detection, pattern recognition or centrality measures and run these algorithms directly on our databases and find out new information, new analytical information about our data. Another advantage is that graph databases have a very flexible data schema. So let's imagine now that for our little scenario, we would like to introduce the concept of groups as we have groups on Facebook, group of members. Probably to do that in the relational database, we would need to create a new table and probably change the columns of another table and do a lot of things to make this work. But for graph databases, we only had to already add the node with the type of group and connect those to the existing nodes. We don't even have to look how the database is organized before. We can just add there. That's why I didn't show everything because it's not necessary. I just have to add the node and then connect to whichever I want to connect. It's that simple. Another advantage is that recent graph databases implementations are implemented using no-SQL storage mechanisms. So they carry all their advantages of no-SQL databases with them, which means that they have horizontal scalability, which means that we can improve the performance of the database just by increasing the number of simple machines that runs our applications. We don't need a huge, amazing server. We can have small, simple machines that can run our application, which means that we can do some distributed processing to improve the performance of our application. It's really good. So I decided to go a little bit further explaining Neo4j as our graph database because it is the most popular graph database according to DB engines. DB engines is a website that contains a list of all the databases available, both relational or no-SQL graph databases, and they keep track of it. And they keep a list of most popular databases. And Neo4j is the most popular in the category of graph databases. It is implemented in Java, has its own query language that is called Cypher. I'm going to show some examples of Cypher. And the data can be accessed through a REST API or a Java API. And now I will come with some examples of Cypher queries for you. Let's say that we want to create a vertex or a node, as we call it in Neo4j. This is the command that we use. We just have to use the keyword create. And then we, inside the parenthesis, the John is just an alias for the node. And we say that the type of the node is user. And in the brackets, we pass all the parameters that all the attributes that the node has, has name, gender, and age. And for this query specifically, I want to return these nodes just I can show you. This is the result of running this command in Neo4j. It's just a single node with the name John. Yeah, basically. But when we do this using the REST API, we get a JSON object. And inside the JSON object, we have this graph object that comes with a list of nodes and a list of relationships in our graph. And we can see that for now we only have one node of the label user. And with the properties that we see, that's the name John, with no relationships. So let's create a relationship, because that's the main thing of graph databases. How we can create relationships using Cypher. Let's say that we have already two nodes in our database. That's John and Mary. And at first, we need to retrieve these nodes by using the keyword match. And we are getting them by their name. And then we create a relationship of the label friends with between these two nodes. And I will return so we can see it. So at first, we had Mary and John totally separated, just two nodes in our database. But now after I create the relationship, we have this arrow come from John to Mary. We use this notation, this arrow notation in the Cypher query, where we use an arrow to indicate the direction of the relationship. But this is totally optional. We can have relationships without directions or with directions both ways. So you can even add more information about your relationship. This one I prefer to add a direction. But it's totally up to you. And the rest, the JSON answer for our API request using to create a relationship, we have now our graph object with two nodes that's John and Mary. And one relationship that connects those two nodes. So we can see how we can do that using the rest API. Now let's load all that information that we had before. And in our scenario, let's add to, let's load everything up to Neo4j. This is what we get. We get all the three users and the three pages that we have all the relationships. So now let's query. Let's try to retrieve some data from this database. And to query, we use the keyword match. And then we can just say match the node with, the user node with the name John. And that has a relationship of likes to a page. And return me those pages. That's easy. So it returns me this. This graph, this graphic, this image I took from the Neo4j in Bebed browser. Browser interface that they have, as soon as you install it, you can go to there. It's really easy to use. It's really graphic. You can totally see your data like this. And makes things easy if you are starting with Neo4j. So it's really nice. If you are doing a REST API query, the JSON object that returns now has an object of data. That's the data that it wants to return to your query. And it returns two rows because the user John likes two pages. So it returns, each row returns the relationship of John with the Beatles and John with Coca-Cola. So we have that information in JSON also. Yeah, that's pretty good, but we are only using Cypher. Where is Python? So to integrate Neo4j with your Python application, we use Python Neo. Python Neo is a Python module that integrates Neo4j to your application and it supports Python 2 and 3. So I will show you an example of how to do all the things that I said before, to create a node, create a relationship and query, or your database use Python Neo. So this is the Python code to it. We import from the Python Neo object of graph, node and relationship. With our graph running, with our database running, we can get the graph object from that, just had to pass the password to it. And then we start a transaction, the g.begin. We start a transaction, and in this transaction, we can create a lot of, we can create nodes and relationships and only commit every change that we want at once at the end of the transaction. So at first, I create the node for John and then I call the create method for the transaction. I do the same thing for Mary and then I create the relationship, but everything just gets pushed to the database as soon as I do the commit command. So when I call the commit method, it pushes everything to Neo4j. But how can we query? To query is just as simple. We have our graph object and we now use a node selector. And with this node selector, we can select a node with the label user and the name John and get the first one that corresponds to this query. And then we can match this, the relationship that starts with this node that represents the user John and there has the relationship of types likes and then get all these relationships and print the end node of these relationships which corresponds to the pages that the user John likes. And I just printed here so you can see the response. Python also allows you to only run a Cypher query inside of it. So this is using all Python to run this, but you can also pass the command run and you can pass a whole Cypher query to it and it will return to you the response. So yeah, now I talked a lot about Neo4j and I think we are pretty used to it by now. But what are the other options that we have so far? According to, I took the other two most popular graph databases from DB and Genes. We have OrientedDB and TitanDB. And I put up a small comparison between those three. Neo4j is a native graph database. But OrientedDB is a multi-model database which means that it not only contains graph database, but it also supports key value store, column store, a bunch of documents store, a bunch of other kinds of methods to store your data. While TitanDB works with graph, but it has to have a back-end DB to work with it. It can be Cassandra, Berkeley DB, other kind of database to connect to TitanDB. All of the three are implemented in Java. And each one of them has its own query language. We have Cypher for Neo4j, we have an extended version of SQL for OrientedDB and we have Gremly for TitanDB. Comparing those three types of query language, we have answering the same questions, what are the pages that the user don't like. In Neo4j, we already saw how that's done. It's pretty easy, pretty straightforward. But OrientedDB brings something more familiar to the ones that are used to SQL because we can see the structure of select from where that we are used to. But they only add some keywords to work with graphs. So it's selecting the relationship likes that goes both ways and it's expanding to get the nodes that are at the end of this relationship. And it will return from user with the name John. So it will get the nodes that are from the user with name John and it will expand to the other nodes that this node is connected to. And it will bring the pages. Gremly has a totally different syntax. It's not something that we are used to, but someone that has worked with Gremly should find this easy. But it's doing basically the same thing. It's going to our graph, G, and to our vertex set that's V, and it's getting the vertex that has the name John, and getting out the nodes that's connected to it. But it's basically the same idea behind it. Yeah. So let's compare some things about performance. I didn't perform these experiments. I took it from some papers that I found online that were comparing these three databases. And I thought it was cool to bring here if someone was wondering how these three perform between each other. So one of the most common operations that we do in a database is just retrieving an instance using their ID. So in this test, it calculates the average time for each of these three databases to retrieve a node given its ID. In a graph that has 500,000 vertices and four clients performing these operations 200 times. So they did this experiment and they calculated the average time response for that. And it was done by some researchers in Belgium. And here we see that TitanDB is slightly lower than the other two. Being oriented there is the fastest one, but Neo4j is quite close to the line. So we see that they are pretty close. But yeah, Titan is the one that's a bit slower because, yeah, but this can change if you change the back end DB, I mean, for this experiment, they are using Cassandra. So I didn't find any other experiments using all the kind of database. So this could change, but yeah. Another performance experiment that they had is related to the amount of memory required by each of these databases. So they stored the graph with 32,000 vertices and 256,000 edges. And it was done by some researchers at the Institute of Technology. And we see that oriented DB requires a lot of internal memory to store this kind of graph while TitanDB is the one that requires the least. Neo4j is in the middle of it. So yeah. But bringing to the database, that's a concept that we are most used to. If we want to perform the same query that I was doing before, during the last examples, if we wanted to do that in SQL, we would have to, this would be the query, we would do the select from where, and we would have to join these three tables to get this information. Besides the fact that this is not as legible as the one in the first query. So that's, I think, the advantage of Cypher to SQL. So yeah. And we know that join operations are really great for us, but it takes some of the performance of your application. So that's not so good. I also brought a performance experiment that some researchers at the Mississippi University did comparing Neo4j with MySQL. And they basically tried to submit two types of queries. A structural query and a data query. And a structural query basically goes navigating through your data, through the relationships of your data. And doing like, we did first search in your data like a tree. And the data query only retrieves a node by its attributes. And we can see here that actually, Neo4j is not that good for data query. If your application is only trying to retrieve nodes by some attributes and not using the relationships so much, the topology of your graph so much, maybe that's not a good idea to use Neo4j because MySQL can perform better than that. But if you have a lot of navigations between the relationships of your data, then Neo4j is the way to go because it's easier to use and it's faster than MySQL. So yeah. You have to analyze what your application is doing and how it's using the data to choose wisely if you go or not for a graph database. So yeah, let's see some applications because I'm talking about only theoretical. So where could I use it? Where is it important to use graph databases? There are several areas where we could use graph databases. One of them is social network. That's the example that I've been doing so far. And yeah, we can see that it's pretty straightforward to relate those users as nodes and relationships as edges. It's pretty straightforward. But we also have some work done in this area for bioinformatics and genetic analysis. I'm not from this area. So I don't know exactly what they do. But it appears to be that the particles of our DNA, they have some interactions between them. And those interactions are relationships between particles. So they store this kind of information in a graph database. So they could process this information in a graph database in a better way than it would be in a relational database. So that's an area that's taking a lot of advantage from a graph structure. Another interesting area that's using graph databases nowadays is telecommunications because they can represent the information of a person calling another person or the connections of the cables easily using graph databases. So they can visually see their network using this kind of database. Yeah. But I personally brought to you today more specifically. Yeah. For me it's important, but I don't know about you, an application about graph, about Game of Thrones that uses graph databases. So it's a really, really interesting application of graph databases that someone did and it was amazing. So I brought to you here the work of these guys. That's just, it blew my mind. It was Andrew Beveridge and I don't know if I'm saying those names correctly, but they took the time to analyze the network formed by the characters of the book A Storm of Swords from Game of Thrones. And they went through all the book and they registered the relationships between each of the characters and they also gave weights to this kind of relationship. So if a person was really close to another person, it has a higher weight of the relationship if it's just like, it's not so close, so the weight is low. And they did this and they put in a CSV file in that format, like the source, the character Amon to the character Green has a weight of five. But Lemon with Samuel has a weight of 31. So they did this to all the book and all the characters of the book and you know, that's a lot. So that's really amazing. And there is someone, the other guy called William Leon that had a brilliant idea to load this information he signed in Neo4j. And he started to play around with it and do some analysis on it. And to do this analysis, he used a module called iGraph. I don't know if anyone have ever heard of it. It's a Python module that allows you to manipulate the graph using network analysis algorithms such as centrality and community detection. It's pretty easy to call these algorithms from this module and has a pretty neat way to connect with Neo4j. So yeah, let's take a look how William Leon did this using iGraph and Neo4j using Python. So he got the graph from Python to Neo. It's connected to Neo4j. And then he loaded all the information from the graph, the Neo4j graph into iGraph. And then he just called the community walk-trap method. And considering the weights that each relationship had. And it was able, using this method, he was able to identify the clusters or communities of the list of characters. So the result of that method was a table looking like this. So we have eight clusters. And each cluster has a set of characters that are part of this community. I don't know why Lancel is over here, but since it's not deterministic, some stuff like this may happen to your data. But yeah, he took this information and he also did some centrality measure. And he was able to come up with this graph. This graph represents all the characters of the book. And they are divided. Each color represents a community or a cluster. And the size of the node represents the importance of that character given the centrality measure. And the width of the edges represents the weight of the relationship between the nodes. So the blue one, you can see that, I don't know if you can see, but we have, if you look at, we have that the big node over there is John, John Snow. And if you look around, you see that the blue community represents the people that's in the wall, on the wall. So you see all the characters over there. The green one, you see that the big node is Daenerys. And if you look around, you see that that's all the characters that are from her part of the show. The big yellow one, you see, we have Robert, the king, and Tyrion, and Cersei are the important people that represent the main part of Game of Thrones. So yeah, it was really fun. It was really a nice way. I found it a really interesting application of graph databases, how you could play around and analyze your data to extract some more information about it. And it's pretty nice to see how your data behaves, even this analysis. If you worked with a relational database, you couldn't have this visual of your data, which is really nice. So yeah, that was it. If you have any questions, please feel free to ask now or later, I will be around here. This talk is on the speaker deck, slash lab codes. I also have written a blog post about this presentation. It's all the same thing. You can find all the notes online in our Medium account in lab codes. And these are my Twitter and GitHub. So if you have any questions, feel free. That's it. Thanks. And we also have stickers over here, if you want the stickers. Yeah. Thank you so much for a great talk. Very important. Okay, we have a lot of questions. Got some time. Hi there. Thank you for the amazing talk. I come from a basically no SQL and relational database background. And my question is mainly how do you scale such a database, a graph database? Maybe with TitanDB and Cassandra, I can see it. But with the others, I don't have any experience to understand the scaling part of it. To scale? Yeah, how do you grow when your data gets to a size where it's basically you cannot longer throw the memory at it to store everything in one server? Yeah, for Neo4j, you have distributed processing. So you can do that easily. But for InternetDB, they also provide a really nice way to distribute your processing, because it's also a no SQL kind of database. So they already have, you know, in their, I didn't bring here, but they have in their website, really nice tutorials on how you can do this. But it's basically the same thing. You can come up with an instance of your database and spread out. And they have nice mechanisms to make everything distributed and come, how can I say it, come together without conflicts. Okay, so like sharding is in whatever we use already? Yes. Yeah, yeah. But TitanDB, I'm not sure, I'm sorry, I didn't search much about TitanDB, but for InternetDB, it's totally feasible. It's Cassandra back end, so imagine that's more. Yes, I assume that. But I didn't read much about it. But yeah, I assume that with Cassandra, you can do it. Yeah, sure. Hello, thank you for your talk. I have two questions. The first one, can I somehow specify a scheme for the notes in the database for, let's say, user has only name and ID? Actually, Neo4j is pretty schemaless. You don't have a way to put your schema inside of Neo4j per se. You probably should do this in your application level instead of in the database. The big thing is that you don't have to come up with a schema for it. So yeah. Yeah, thank you. And the second question, I forgot. Yeah, it's somehow, can I somehow specify the storage mechanism for the graph, like for sparse graphs or dense graphs? Is it optimally zable by storing relationships in table or like linked list, you know, inside the Neo4j? Yeah, I'm not sure about that. I've never looked to it if there is different ways to do it. But as far as I've got to Neo4j, it's a pretty black box that it takes care of you for you, these things. I'm not sure if you are able to tweak these to customize the way that they store the specific parts of your data. I'm not sure. I know that with TitanDB, you can change your back end as your application needs. But Neo4j, I'm not sure if you are able to do that. Yeah. We have time for two, three more questions. These DBs are perfect for storing relationships, but relationships change over time. How do you keep the back data? Because if you change your relationship, the old relationship is no longer there. Yeah, I mean, the next step for me would be to try to come up with a time series for how you would implement a time series application using databases. But yeah, I'm not sure. You probably should. You can store attributes in your relationship. So you can store lists of attributes in your relationship, which you could store the historic part of your relationship where it has been. But I'm not sure how you could implement a really time series application using Neo4j. But you can do it implementing in an attribute in your relationship. Okay, thanks. Thank you for the presentation. I have two questions. If I'm not wrong, you said in Neo4j there is a user interface where you can actually see the graph. Yes. Do you know how does the interface react if you have a big graph of, I don't know, hundreds of millions of nodes? Yeah, it limits. It limits. I'm working now with a database that was supposed to be 16,000 nodes. And when I search for all the nodes, it's just returned to me a thousand. So it limits to you because it's really can be really heavy in the front end. So yeah, it limits. The second question would be, I did a quick search on Titan Debe. And there were some posts like is Titan Debe that do you know anything about it? Are they continuing the project? Yeah, I've heard this morning, actually, someone tweeted me about Titan Debe that I didn't have the time to look at it. But it seems that they have like a discontinuation going on. So I probably have to update my talk, but I just heard of this today. And I'm going to take a further look. I don't know if they moved the project or if they changed the name. I'm going to take a look. But yeah, I saw this morning. It's really weird. Yeah. Very good presentation. Thank you. I would like to know, for example, when you use a graph, it's possible do you use the Dijkstra algorithm, for example, to find the short test path between two nodes? In graph database, do you have this type algorithm native or necessary I get the data and use the iGraph to look for this, for example? You have you have some libraries to add to your cipher to your Neo4j and you call these functions from cipher. So they are already coupled with the graph databases because this one, the iGraph one is really nice, but you can do it without a graph database. But you have some libraries that calculates the the centrality measure, the pattern recognition, all using Neo4j, only Neo4j without the iGraph. You have those libraries too. Thank you, Nicole. Great talk. And guys, remember to let's do clapping first.