 Hi, welcome to my talk. So today we're going to talk about graph database integration using GraphQL. But first I'm going to talk a little bit about who I am and why I'm giving this talk. So you can find me on GitHub or Twitter at N3 integration. My name is Rob Perry and I've been running software over the last 20 or so years. And I've written large scale applications, thick clients, lots of different systems over the years, interface with a lot of different databases, first starting out with file system type databases, search indices using no SQL databases, and more recently graph databases. So today we're going to talk about graphs and first I thought it would be beneficial to kind of describe what a graph is. So according to Wikipedia, a graph is a structure amounted to a set of objects in which some pairs of the objects are in some sense related. And this is easiest displayed through an image. So in this particular example, we have numbers that represent the vertices and then we have lines that connect to each one of the numbers as edges that show how you can reverse from one node in the graph to the other. So we're talking about graphs, so let's talk about GraphQL, the graph query language. It's meant to be a replacement for traditional web services. It's replacing RESTful APIs, and it kind of gives you more flexibility in how you query your data set. So if you think about a traditional RESTful API, you say, you know, give me some list of some object and that service is kind of enough to send you back everything. But maybe you don't want everything, maybe you just want identifiers or the name of some field or something like that, or you want just the relationship of some object or the name of the relationship. So in traditional RESTful API, you can't get that flexibility, but GraphQL kind of enables that flexibility and it's kind of implicit in the way that GraphQL works. It's not custom, it's how it's defined, how APIs are defined, and it kind of allows you to evolve things over time, much simpler than using a RESTful API. So let's talk about DGraph, DGraph's an open source distributed transactional database. And DGraph is a database that shows you use GraphQL as the query language, which is different than some of the other graph databases out there. So some of the graph languages that I'm familiar with are Sparkle and Gremlin. So DGraph kind of diverged from the other set of graph databases by using GraphQL. So as far as I know, they're the only graph databases that use GraphQL as the native language. And you know, they feel like that meant that was an important decision for them because GraphQL seemed to be a good language to interact with the graph, it just felt natural for them. So let's talk about the architecture. So DGraph is composed into three different systems. So at the top here, we have the Rattle UI, which is a web user interface that you run in your browser. We have the Alpha servers and the Alpha servers are responsible for storing the data. And most of your interaction with DGraph would probably be over the Alpha servers if you're hitting the servers directly for queries, mutations or through the web user interface. Most of those queries come in through the, or all of those queries come in through the Alphas. And then finally, we have the zeros down at the bottom. So the zero is responsible for cluster coordination membership. And so basically back channel type operations and normally you wouldn't interface with a zero because this is a distributed system. It's displayed as an odd number of hosts here. So we have five Alphas, three zeros, four group consensus. But if you didn't want to have a true HA distributed system, a simple architecture could be deployed where you just have one zero and one Alpha node and then a Rattle node as well if you wanted to use a user interface. So let's talk about the data set. So in this case, I found a data set on Carnegie Mellon. A user has a fairly decent size data set of the set list for every concert that have been the Grateful Dead played from 1970 or 72 to 1975, I mean 1995, sorry. And it basically has like the venue that they played at the location. So we've got the set list, which is has a data associated with it. Each set or concert was located at a venue. So we have an outgoing edge there. And then a venue is located in some location. So you can think of the location as city, state, country, that level of information. And then finally, each set list has a set of songs. And a song is either represented as a song that they play it as part of the encore or just a traditional song that they play it as part of their set list. So how do we get data into de-graph? So the way to get in is to use the RDF and quad syntax, which is text based format. So in this particular example, if I break it down, we have a subject wrapped in ankle brackets, which is as the user, the UID, we have the song also indicated in ankle brackets, which is the predicate. And then finally, we have the object, which is wrapped in quotes, followed by a dot delimiter. So the dot is required to end the RDF and quad line. And then in addition, if you wanted to create an RDF and quad edge, the syntax is very similar. You have subject again in angle brackets, the predicate and angle brackets. And then the difference between an edge and a traditional triple is that you have the edge also annotated in the ankle brackets, followed by the dots symbol or period. So maybe there are some cases where the triple isn't enough and you want to store additional metadata about some relationship, right? So what D graph provides out of the box is they have this concept of a facet. So facets are it's additional metadata about some edge relationship. So excuse me. And facets are indicated or annotated through parentheses. So in this case, we have circa equals 1970. And basically anything within the parentheses is a facet that you apply onto this triple. So D graph provides rich schemas. So they have a few basic data types, bool, daytime, float, in, string, geo, and then a UID. Here I've kind of showed the predicates and then the types. So predicate is your lowest level schema element. Think of it like a column in a relational database. And then you have a type which is similar to a table within a relational database where you have one name and then you have a set of values associated with that. But the predicate kind of determines whether or not it's a reference or it's an actual element of a type or a table, I suppose. So the way that they support geo is through geo.json. So I included this screenshot of the geo.json.org website. So if you're not familiar with geo.json, you can go check this website out. And this kind of shows you what the Json format is if you want to ingest geo data into D graph. So let's talk a little bit about schema indices. So as I showed on the previous, a few slides back, you probably saw these annotations at absurd, at index, at count, at reverse. So it kind of walks through what these things are. So the at absurd is a enhancement to transaction support to manage complex. So if there are multiple concurrent requests coming in and they try to manipulate the same value, it kind of does that handling for you. The index annotation is basically it allows you to expose your fields over as search criteria. If you don't index it, you cannot include it as a search. Criteria or a filter. But you can return the data. You just can't search by it. So if we wanted to index date time, there is the level of granularity goes from year to month to date to hour. And then if you have a string, you can index by exact, which is what it sounds like, it'll basically take the string as it is. And that's the way it's indexed. So when you create, it says exactly the way you entered it. Hash is not a cryptographic hash, but a uses a separate faster hashing algorithm and it'll hash the value that you specify. And then again, assuming case and everything else is the same. That's the way the data is indexed in the term case. So term, it kind of functions as if you think about a search index like Lucene or Laska search solar or one of the other search indices. The term query is it uses like a, it does some level of normalization, right? So it'll convert values to lowercase. It'll strip out non-alpha numeric characters and just kind of like give you the basic forms, which is useful if you're doing case and sensitive queries, things like that. Additionally, there is full text. So if you're doing a full text search, you're just inserting large blobs of text, then full text would probably be beneficial for you. And then finally on the list, we have trigrams and trigrams are useful if you're planning to support any kind of regular expressions against some field. And some of these you can use together, but some of them don't make sense to use together. So for example, you can use term and trigram, hash and trigram, exact trigram, full text and trigram, but you wouldn't do hash and term or hash and exact. And I believe that DeGraph gives you a warning to tell you that you shouldn't do that or they prevent you from doing that one or the other. So those are the basic types, but if you have an edge type, a UID, you can have a, you can specify the account annotation and what that'll do is it'll create an index for counting the edge references. So if you wanted to get counts frequently on some edge, then you'd want to use, you want to take advantage of that because the counts are pre-calculated and they don't have to get calculated on the fly. Additionally, there's the at reverse, which will manage bi-directional relationships between nodes. And if you noticed on a couple of slides back, that's used pretty heavily. So it kind of gives you that ability to kind of pivot in and out of the different data types and kind of traverse the graph in either direction, which I find to be super helpful. And then last but not least, DeGraph also provides the ability to create custom index tokenizers. So the DeGraph project itself is written in Go. So if you want to write a custom index tokenizer, it should also be written in Go. And then this is their interface. It's very simple. So it just has a name string, identifier, byte, the type string. And then the last method there, tokens. That's what actually does the magic, right? So it does the tokenization of some data. So some reasons you might want a custom index tokenizer, say, for example, you wanted IP address and you didn't want it to be treated as a string or you had an FQDN and you didn't want it to be, you wanted it to be handled a little bit differently, right? So you want to have additional optimizations and different customizations and you can write your own tokenizer. And then as you start up the alpha nodes, you can pass in the tokenizers that you've written. So it basically uses the built-in Go plugin support. So you create a, depending on which operating system you're running on, you create a shared object or a DLL. And then you pass that into the runtime when you boot it up. So I thought that it might be beneficial to use the REST API because it's a language agnostic interface into D-Graph. So if I walk through this, I've defined a function in Python, upload schema, I've defined a content type, application RDF. So using the D-Graph.schema file that I showed a few slides back, it reads that and then does a post request to the slash alter endpoint. And then it passes the schema as the body just reads the response back and then looks for the data message that came back from the server. If all went well and you got a 200. Otherwise, it would sway the status code if it hers out. So let's talk a little bit more about the data type query function. So if you have a string, the different functions that you have available are all of text, any of text. So these are useful if you're using full text search index, all of terms, any of terms, those are useful if you're writing your own custom tokenizer or if you have a you're using the term index. Then finally, there's match and regex. So match is a fuzzy match that uses Levenstein distance where you pass in the number of characters that you can kind of substitute based on the fuzziness of the search. And a regular expression, you just pass in a regular expression, assuming that you have at least three characters. Because again, the way the D-Graph supports regular expressions is through trigrams. If you try to pass in a regular expression with like dot star A or something along those lines, then it's going to complain and tell you that that regular expression is inefficient and it's not going to find any matches, but you find out pretty quick that doesn't work. So then if we think of numeric and daytime functions, then we have comparisons for less than, less than or equal, greater than, greater than or equal. And again, these things are, you can apply these to strings as well. And then finally, there's equality, which you can use for Booleans, numeric daytime, as well as strings. And last but not least, if you have geo data, there are specific geo functions near within contains and intersects. And again, these are using geo JSON format for geo support. So some other query functions. So if you want to do things like aggregation, you can get averages, counts, min, max, sum. Some other functions are the UID function, the type function, if you're looking for a specific schema type and you just want to query based on some type. The has function is if you want to query based on the presence of a predicate. Group by is pretty standard. It works in the way that it would with a relational database. Val is basically a function that you need. If you're you're creating variables and you need to display that value or you make use of that value, which I'll display a little bit down a few slides up. So what does a D graph query look like? So if we start here and we look at this example, so we start with set list and date. So using the type function, what this says is search by the node type set list. And then right below that we have date. So then we also want to traverse to the the venue. Once we get into the venue, we want to traverse to the location. And then we'll pop back up to the set list and we'll traverse to the song relationship through both play song and play it on core. And that's basically what this example is showing there. So some additional query functionality that D graph supports over its language of graph QL plus minus is they support variables which here we've defined as a variable type set list. And we've defined the variable name B1 as UID where the variable name is on the left hand side. And then as the right hand side is the name of whichever field you're pulling out of that particular object. So UID is the field that you get for free, which is basically its primary identifier. So in addition to that, you also get sorting. So you use order ascending or order descending. And then you pass in the field name that you want to sort by. In addition to that, you can paginate through the data using the keywords first and offset. So in this case, we want to pull off the first 10 at the offset zero. If you want to keep patting it through, you just keep bumping up those by some value, some incrementing number. I also thought it would be beneficial to show again the language agnostic interface with graph QL or through D graph where we post through their REST API. So you pass in the content type application graph QL plus minus hit the slash query endpoint. And in this case, it takes a time out using a duration. In this case, it's 20 seconds. Debug equals true, which generates some additional data on the back end. And then the RO is read only equals true. And then BE is best effort equals true. And then again, the same as the previous one, check for the status code. And then pull out the data elements that are coming back or display the error message that was returned. So based on the dataset that we have, these are some questions that we might want to ask. And I recorded a demo earlier and I can kind of walk through each of these queries. All right. So the first one is we're looking for concerts that included the song me and my uncle. So we're creating a variable. We're doing a quality function on the song. We're traversing up to the set list through the played song predicate, and then we're storing that variable V1, the UID into V1. We're passing that into their root function where we're looking for the date or the venue and these are the results. Next, we're looking for songs that were performed at the Civic Center. So again, we have a variable defined looking for the terms in the location. In this case, it's Providence. We're pivoting out to the is located in, which will get us to the venue. We take that variable, pass it into the next function. And then because we're looking for the outgoing edges or the incoming edges of that venue, we can pivot up to the set list and we can pass it into their root function and we can order by date here. So if we wanted to get the top end songs, in this case, five, we'll do a query by the type song and then we'll get the play count by doing a count on all inbound edges to that song from a set list. And we'll pass it into the root function. We'll order descending by the play count and then we want the first five. So for the next one, we want to show the top five on course. So again, we do a query type by song. We store the variable V1 for the UID and we do a query against the count of inbound edges to the song. And then if we're doing a query for the top five venues, again, query by type venue, we get concerts, we do the count. We look at the inbound edges. We pass that in again, the same as the previous two examples. If we're looking for all songs performed in Boston in 1994, then we would look for all terms location Boston. We would pivot out to is located in, which would give us the venue pass that into the next function. And then we look for the outgoing edge or the incoming edge, the venue, which would be the set list. And then we filter by dates greater than or equal to 1984. And then we pass that variable V2 into the root function. And then we pull out all the songs that were played for each matching result. And finally, if we're doing a query on international locations, we can do a regular expression query. In this case, I've used comma space X to indicate that it's going to be a international destination. We can pivot out to is located in to get us the venue. Pivot out from the venue to the set list. Pivot out from the set list and go right back in because we weren't interested in the location. And in this case, we re alias the location as name in this particular example, and that's stored in the variable V3. That's it for the demo. So next, I wanted to talk a little bit about mutation support within D graph. So again, another example using their RESTful interface using Python as a language. So in this case, we build up the mutation and we pass in the RDF. So again, the RDF that we're expecting here is the Nquad format that was documented a couple of slides back. And then we have a set function. So in this case, if you're passing a mutation, you can use the keyword set, which will do a an insert or an update where you can do a well, no, it would be an insert. You can do a deletion or if you wanted to, you can do an absurd statement. So in this case, it's not demonstrating that. But if you add an absurd, then you have curly braids, then you have set or no, then you have query, you issue a query, you save those results in different variables, and then you use those variable names in your set blocks. And then that's how you'd issue absurd statements. But again, so we hit the slash mutate endpoint commit now equals true, assuming that there's saying that we want this to be an atomic commit one transaction. And then we're looking at the status code, checking that the data message that comes back and then if it fails, we just display the status code there. And then, you know, one question that make him up is, how do you actually load the data up front? So D graph gives you they give you three different ways to get data into the system. So initially, you need to load the data through the bulk loader. Bulk loader is a binary that you run on the D graph zero before your alpha nodes are brought online. And the input format for that is again, it's going to be the RTF format or JSON. But again, like if you use the RTF format, it's very easy to understand. It's very easy to troubleshoot, debug, as well as generate. But your knowledge may vary. But so the bulk loader uses a map reduce process and they give you several different knobs that you can kind of turn to kind of tune the performance that give you the number. Like, for example, you can choose a number of reducers. So if you wanted to take the data and you want to shard across multiple nodes, you perfectly have the capability to do that. You just pass the number of reducers that you want. And that'll tell you how many different shards that you're going to create. You can define the number of mappers. The only caveat that I could tell you at this point is that the bulk load process is not distributed. It runs on a single node. So generally, if you have large amounts of data, you'd want to run that on like a system that has lots of CPU and lots of memory, especially if you're loading data, data volumes in the billions, right? Billions of nodes, billions of edges or tens of billions, hundreds of billions. Definitely throw a compute at it. Once you get the data, your DGRAPH database up and running, you can run the live loader in addition to the mutations that I showed previously. So again, the DGRAPH team, they ship the the Docker containers or the binary. It comes with a the live loader binary that you can run. So it does a lot of optimizations and running against the the alpha nodes on the same host. But that's all I've got. And definitely appreciate you coming to to hear me speak by DGRAPH. You know, definitely reach out if you have any questions. You know, you can hit me up on on Twitter or. Yeah, definitely appreciate your time. Thank you.