 Okay, hi. My name's Ted Wilmes. I work for a company called Xperia. We're a consulting company. I specifically do a lot of graph database work. So that's why I'm talking about Janus Graph today. I'm an active member of Apache Tinker Pop and the Janus Graph communities, and over the last few months I've gotten to know a little bit more about and use FoundationDB. So today I'm going to talk about graph databases in general just briefly to give people an introduction to what the state of the graph database world is. Then we'll talk about Janus Graph, how it currently runs on a number of different storage backends and then what it looks like on top of FoundationDB. Then I'll talk a little bit about what it took to put together an adapter to actually run Janus Graph on FoundationDB. And then if there's time at the end we'll talk a little bit about performance. So first of all, today I'm going to talk about graph databases, specifically the property graph data model. You may have heard of triple stores and RDF and things like that, but I'm going to be talking about property graphs. Property graphs have a basic idea where you have vertices and edges. Both vertices and edges can have labels and then they can have properties, just properties being key value pairs. So pretty simple model here. So what do people actually use graph databases for? So there's of course, as you would probably imagine, there's kind of the canonical graph database examples of social network analysis and things like that. But we're finding that lots of customers are also just wanting to use graph databases for other sorts of industry use cases. Here's just an example of a simple data model from a supply chain use case. Usually it's pretty easy to see graphs in your existing data sets that you're using. Whether or not it's a good idea to use a graph database or not is usually more of an operational question, but here we have a bill of materials graph and that has some connection into your supply chain where you have a network of suppliers. You can use that to look at things like, hey, what happens if a disruption happens here at a particular distributor? Another area where I've spent a lot of time is I got started in graphs is actually in the IOT space. Here's just a simple little graph of maybe some infrastructure that you might be monitoring. Specifically, I worked in oil and gas, so we would look at equipment out in the field and how that equipment was connected and then coupled that up with time series data. It gives you kind of a high level contextual view of the system that's laid out in addition to those time series data that you're actually getting. The graph world right now is kind of exploding. There's lots of different vendors that are coming into the space. The one that I'm going to talk about today is specifically the JanusGraph project. You all may have heard of Titan before. I think, I don't know if the person here did it, but there actually was somebody who wrote a Titan adapter. Did anybody do that? Anyway, for FoundationDB, I thought so maybe. JanusGraph is a fork of Titan. JanusGraph is hosted at the Linux Foundation. Apache TinkerPop is a graph processing framework that JanusGraph actually makes use of. In the graph space right now, there's a few different languages that are common. The two most common are Cypher, that's Neo4j's language, and then also Gremlin, which is part of the Apache TinkerPop stack. The Apache TinkerPop stack gives some basic graph query and language capabilities, a web server to contact it and drivers, and also some analytical processing tools for running graph algorithms over OLAP systems like Spark. The Gremlin query language won't go into a ton of detail. This is just supposed to be kind of a pointer if you get interested in this later. A Gremlin query language is a graph querying language that allows you to do anything from analytical type queries to just regular old OLTP type queries, mutations, and reads. The JanusGraph architecture, from the very beginning, when JanusGraph was originally tightened years and years ago, it was built from the beginning to actually abstract out the storage layer. JanusGraph doesn't actually store any data itself. It's a layer that actually sits on top of another storage backend. That's why it made it particularly easy to put FoundationDB in underneath of it. Traditionally folks of Iran, JanusGraph, probably mostly on Cassandra, and then I'd say a little bit further second from that HBase. But there's also a number of different other adapters that folks have developed over the years. JanusGraph also can integrate with third party indexing tools. So there's a lot of benefits to running in this layered approach, but there can be some challenges, especially when you're running on top of these eventually consistent backends. So what is the JanusGraph on FoundationDB value proposition? When FoundationDB got open sourced, I was really excited because the thing that JanusGraph really was missing was a distributed, highly available acid storage option. So there just wasn't anything out there. You could run it on a single instance with BerkeleyDB. You can probably pay Oracle to give you a highly available BerkeleyDB, but I'm not sure of anybody doing that. So this was really good news. So one, on the face of it, we get high availability and fault tolerance, acid transactions are a huge win. Overall, I think that pays off with simplified operations and developer experience. And then also there's just a lot of things internal to JanusGraph right now that have had to be done to get over the fact that a lot of times it's running on these distributed eventually consistent stores. There's sorts of bespoke locking recipes and things like that. Things that maybe the end users don't directly see but really affect their usage of the system. So right now Apache Tinker Pop and JanusGraph take kind of a generic, a high level approach to transactions. And they say, okay, we have this thing called the transaction, but in and of itself, it doesn't actually give you any specific sort of guarantees. Those guarantees are going to be inherited from the backend storage layer that you're actually putting underneath of it. So syntax doesn't really matter here, but you can start explicit transactions, do multiple queries, and then at the end, you know, commit your transaction or rollback. And then whether or not you have any sort of acid or something happen or any consistency that depends on the engine underneath it. So why might acid matter in a graph database? So I'm going to tell a little story of the parable, the edge to nowhere. So here we go. We're going to just start with two vertices, A and B vertex. And so let's say user one here looks up the vertex A and they look up vertex B. Then another user comes in. And in the meantime, while this first transaction still open, pretty straightforward, but they drop one of those vertices. Then I go in and I say, okay, let's go add an edge between these two things. Let's say, you know, Ted knows Tom and then commit that. Well, if you're running this on Cassandra, what's going to happen is you're going to get what we call kind of like this phantom edge, this edge to nowhere. And so you go back, I run a little query on my graph again, I say, hey, how many people does, you know, vertex A know, and it says, oh yeah, he knows one person. He doesn't really know one person though, in reality, but the database thinks he does. So this is the equivalent of if you couldn't trust, you know, a foreign key constraint in your relational database. So that's no good. There's things that are built into Janus graph to try to lessen the chance of this happening on an eventually consistent back end, but they're not foolproof. So this is something ideally that we want to push down into a system that is actually just purpose built and can handle that without any issues. Some other operations that rely on consistency, Janus graph vertices have IDs, ID block allocation, of course, when you're running that in a distributed system, you want to make sure two different, you know, vertices that you don't want them to end up with the same ID. This is also problematic from just ID allocation from performance standpoint. Schema changes, having agreement, this goes back to kind of metadata like we discussed earlier. If you're making changes to your graph schema in a cluster, you need to rely on consistency there. Unique index constraints. And then lastly, Janus allows you to plug in third party search indexing. So now you're storing data in say Cassandra and elastic search. Ideally, those things would stay in sync, but sometimes they don't. So Janus graph, like I said, has a few tools internal to it. There's even a nice warning in there that if anybody reads the documentation and basically says, this works most of the time, probably, but crush your fingers. So obviously there's a lot of room for improvement here. So right now, the options for the developer, it's really just largely, you know, it's left up to the developer. You need to do things at the application level to just kind of plan ahead and assume from the beginning that you're not going to get these sorts of guarantees. And it's just, it's extra complication. And frankly, it's something that people just, they miss. So, and I can't blame them. I've done it before. So the foundation DB storage adapter. So building a new adapter for Janus graph is fairly straightforward, partially because there's a lot of examples already in place already. So I won't go into detail in all these bullets, but there's a little recipe if you want to build your own Janus graph storage adapter. The data model for Janus graph itself was initially built upon basically kind of this generic big table model. And so what we have here is vertex and edge data stored in adjacency list format. So we have vertex ID, then it's properties for that vertex are stored. And then that vertices inbound and outbound edges are stored. So one of the nice things about foundation DB and also just about Cassandra in general is in a graph database, there's not many opportunities for sequential reads. So the ability to swipe slice queries on the edges is nice from a performance standpoint. So if we were to map this to Cassandra, if use Cassandra before your partition key would be the vertex ID. And then your rows are going to be sorted on the clustering keys. Right now, this data is just stored as serialized byte arrays. So if you were to look at it in CQL, you wouldn't be able to read it. So how do we map this then to foundation DB? This is not really that much different than the document layer example. We're basically going to use a combination of the directory layer and then the tuple support. And so our keys are going to look like this. If in Cassandra, we had a key space that stored graph for customer A and a key space for customer B. Here we're going to have our key is going to be prefixed with the graph. And then within that graph, there's different sorts of if you want to think about it, table terms. So there's table that stores vertex and edge data, one that stores index data, one that stores schema data. And so then we'll break it out a little further. And then finally, we get down to the actual key for that vertex and then whether it be a property or an edge. And then the value is just simply that byte array that we want to pull back, which is storing maybe property value information. So if we stress that out a little bit here, you can see from the first example basically how that's mapped. So obviously here's something like a prefix compression will be something that'll really help us out. So use the Berkeley DB adapter as inspiration for this particular adapter. There's a host of different methods that you basically go in and implement. And then there's a number of tests that you can use as inspiration. However, I like that autonomous testing idea. I think we could probably use that. So we took those Berkeley DB tests and adapted them then for the foundation DB adapter. So now if we go back to our example, we're visiting this edge to nowhere case, you don't really need to read this. But this is the one time I'm excited to see an exception. Basically, we did that first operation again, but now we're running as foundation DB. And look, we don't corrupt our graph. We're actually going to get this exception because there was a conflict when we went to commit because that vertex was deleted. So future work. I think future work, there's a lot of potential here. So one thing is, and maybe this will be overcome by events because of the improved support for transactions. But initially, the current adapter could use some improvements in the area where it's extending transactions past five seconds. So right now you can do reads past five seconds. In the current adapter, it doesn't, of course, maintain the same consistency guarantees as it would if it was in a single transaction. But I think it sounds like with the Redwood storage engine, we may be able to just kind of pull that out and hopefully just make use of that storage engine for longer read or size limits for our queries. The other thing, like I mentioned, I think that one of the biggest wins here is the custom foundation DB implementations of Janus graph internal pieces, so the ID allocation. Other sorts of invariance that we're trying to maintain within Janus graph, I think that Janus graph code can be greatly reduced and simplified if we can make use of foundation DB for that. I think another thing is now there's certain use cases for us where it would be helpful to have some sort of hybrid storage model. Right now we store data and kind of the graph equivalent of a row oriented format. I think in certain cases, it would be nice to store, say, more of the historical adjacency list data in a column oriented format. So to do that online, it would be nice if the database in the back end could be moving data from row to column oriented format. But to do that, we need to have that consistency. We don't want to, say, start moving some data and then I'll miss out, a write happened here or something like that. So I think that will greatly simplify that sort of thing. The other thing that I started looking at but isn't in the actual adapter yet is foundation DB. I think it was discussed earlier, publishes information about the locations of keys. And so if we co-locate our Janus graph layer with the foundation DB nodes, I think we can do some intelligent query routing and predicate push down that really isn't happening in any of the layers right now. So that gives performance improvements from a client perspective. And then lastly, it'd be great to pull full text search in so it's no longer another third party component that we need to keep in sync. So that'll be a larger thing to do, but I think getting full text and geospatial search pulled in, maybe we could make use of some of that sort of indexing work in the document layer. That would be great to use as inspiration for that. So if you are interested in graph databases or want to try this out, we have it up on our GitHub repo. If you're unfamiliar with how to use Janus graph, Janus graph site has pretty good documentation, Tinkrpop does too. So you can get started download Janus graph basically and then install this foundation DB adapter. It's easy to get up and running on one instance. Or if you want to stand up more foundation DB nodes, you can do that. I should have said that the Janus graph deployment model, you can also horizontally scale Janus graph itself out too. So you don't just have one Janus graph instance, but you can have however many, whether that be one to one with the storage layer or not. So okay, so I'm excited about the future of foundation DB, excited to have something that kind of fit this need and this whole that we had in the Janus graph stack and appreciate the opportunity to speak to you all.