 I think we're about ready to get started. Thank you everybody for coming. This is the introduction to the JAN's Graph Database. My name is Jason Plurad. I'm a developer in open source. I work in the IBM Cognitive Applications Group. What we do is we focus on open source and specifically around data and AI. And what we try to do is identify open source projects that we think could be strategic to help improve our products and our services. And we reach out to those communities and try to continue to build them and grow them so that way we can have a healthy ecosystem in the data and AI space. Is anyone here familiar with JAN's Graph? Okay, great. So JAN's Graph is an open source, scalable graph database hosted at the Linux Foundation. So this is an introductory talk and we're just going to start at the beginning. We'll talk about typical graph use cases, what brought us in IBM to the graph, some of the open source graph evolution we've seen over the years. I've been involved in this space probably since 2013 or so. And where we see JAN's Graph progressing. We've been at the Linux Foundation since 2016 and continuing to grow, even though it's still a relatively small project. So we're going to be talking about graphs today. We're not talking about graphs in the sense of a bar graph or a pie graph. We're talking about more of the mathematical graph, specifically a property graph model. It contains vertices, edges and properties. We're going to go into these more in depth on the next slides. This here is the typical example that we show in JAN's Graph. This is called our Graph of the Gods. And it's a rich model. It connects different types of objects or vertices. Those are your entities in the graph through edges, which would be the relationships. So specifically the vertex, it's the entity in graph. It always has an identifier and a label. And it can't be connected. It doesn't have to be connected to other edges in the graph. Obviously the most value is if it is connected to other vertices via an edge. You can also have additional properties on it. We'll see on the next slide with edges. So edges, these connect your entities. So it's a directional relationship. Now this is key. You have the out vertex. This is your source. It connects to the in vertex, which would be your target. So in this example here, we have the father edge. It comes from the father and it points towards, or it starts at the son and would be pointing towards the father, or the child pointing towards its father. Edges are similar to vertices in that they have an identifier and a label. And they can also have properties as well. You can also have multiple edges between the same vertices. And you can have edges between a vertex and a self-directed edge. The way that you work through a graph is through these edges. So being able to jump from one vertex to another, that's what we call a traversal. Then again, properties. Again, these are on both vertices and edges. These are typically how you store additional data. So more specific attributes that would help differentiate one vertex from another or one edge from another. Typically, these are simple key value pairs. You can have a collection of items as your value, whether it's a set or a list. So what we like about graph, specifically the property graph model, it's very rich compared to what you have in a relational model. Because you don't have to pre-define the scheme up front, like you would with a relational database. It's very easy to connect data, add additional later as it comes available. And the flexibility is really what attracted us to it. Being able to connect data to each other as more relationships or more properties become available, that gives you a lot of flexibility to get your data together and run an analysis on it. One of the first use cases that we had was around engagement analytics. In IBM, we use a product called IBM Connections. It's our social product. It has forum posts. It has blogs. It has file sharing, those sorts of activities. We've been using this product for years internally, and we just wanted to be able to derive more value out of it. The interesting thing about the IBM Connection product is that each of those applications that I had mentioned, forums and blogs, etc., they were all developed independently using a relational database. So their data was siloed. So there wasn't that opportunity to really understand how people were interacting with each other across the whole suite of applications. By putting that data into a graph, we were able to easily connect that kind of a data. So here we can see some of the relationships that we were able to gather through work of putting the data into the graph and through analysis from our data scientists of how different teams were connecting with each other. Now, how this can be interesting is that you can better identify different clusters of expertise, whether it's regional or by technology. You can identify who are the information brokers, who are the people that actually connect those pockets of information that would be strategic that to be working together better. Where we found that to be interesting, those connector people, they're very valuable to the company. So from an employee's respect, retaining those people became an interesting proposition because typically what we would find is that if those people, those highly connected, those central people that are able to connect diverse teams, those are the ones that are valuable. If they leave the company, they're more likely to take the senior technical people away with them to their next job. So the idea of retention of your resources was very interesting. More on an individual level, we thought it would be valuable also to give some information back to the people who are using the social tools to really understand what kind of value that they're driving both for themselves and for the company by using the connections collaboration tool. So here, this is a personal social dashboard using the same exact data, but more geared towards the individual so that they can better understand how they were contributing through the social tooling and how that connected them to other people. So here along the bottom, we have scores based on how they were able to share with other people, what kinds of people reacted to the things that they reposted. Eminence is more of a measure of how more broadly your posts were shared by your peers and obviously your network, this is pretty common these days. And one of the big aspects that we had to consider with building this was the idea of privacy as well. People were very concerned up front what my manager sees this. They were afraid that it was going to be possibly a performance indicator. But really, the idea was to give a personal feedback loop so you can understand what are your gaps, where should you look for mentoring, where should you look for training. Because in the spirit of openness, having that feedback loop really would help. Next use case we ran into was around airline routing. We worked with a major airline in the United States on this. IBM is, they are currently running a lot of the backend systems for the airlines routing right now on the main frame. But what we wanted to see was how easy would it be to get that data into a graph database. So that way when the airline needs to worry about routing a passenger, say when their flight gets delayed or gets canceled, what would be the next route to go from one hop to another? This graph here is actually a representation of the geo coordinates of all the airports in the graph. It was generated by one of my colleagues. And we actually use this ultimately as a learning tool inside the company to help teach people how to use a graph and how to understand. Since it's pretty simple to see how well connected everything is, and it's relatable from a personal level, the travel aspect of being delayed at an airport and trying to understand how to jump from one to another. What other rules do you have to put on top of that? If I need to connect from the United States to China, I might not want to transit through Canada because that would require perhaps other visa restrictions. Next, cloud databases. This is a photo from our cloud offering. In 2015, we acquired a couple cloud-based database companies. One was Cloud and one was Compose. Compose is our portfolio of managed open source databases. And we added Janice Graph into that as another open source offering. Because I think what we were finding with our customers is that open source databases are great. Running them in the cloud is another story. Being able to have a 24-by-7 managed cloud offering lowered the bar in terms of them being able to adopt open source databases. Many of our customers are big companies that are well-entrenched in older technologies. So having a lightweight model for them to play with an open source database on the cloud was great. We had some additional features on top, such as the visualization browser below. Which allowed them to better understand how their data was connected once they put it into the graph. That's out there right now if you go out there to compose.com. But typically what we found is that once we started seeing a graph, it was easy to find a graph in many different areas. Just because in the graph database space we like to say that the graph itself is, it's a whiteboard friendly type of model. It's easy to understand once you get up to the whiteboard to draw your vertices, your nodes, your entities and then just connect them. So it gets pretty addictive and that's typically one of the first things that people want to see. They say, where's my graph? Let me see it. Because once you can visualize it, you can better understand your data. So moving on to how we progressed in the open source. One of the first projects that we got involved with was TinkerPop. When we were thinking about doing graph, the first thing we did obviously was go out to the open source to see what was going on. And what we found at the time was that all of the open source or all of the graph databases were actually implementing this open source stack called TinkerPop. It was a vendor neutral open source project that was created by Marco Rodriguez and they were all implementing the stack and that was great. That gave us the flexibility to use the same model and compared the performance characteristics of different graph databases at the time. And what was going on was pretty amazing because the implementation of this open source project was being done by both proprietary and by open source databases. And that was great. So TinkerPop itself, it provides the graph model definition and it provides a graph server. It provides the graph traversal language which we're going to go into more depth in the next few slides. So we worked with Marco and his team to get them moved into open governance at the Apache Software Foundation. So they went into Apache top level I think in 2015 and they're continuing to grow. Many of the major graph vendors, graph database vendors, are still implementing the TinkerPop stack today. Neo4j is probably the biggest graph database right now that people may know about. But other big entries are the data sacks and the prize graph Microsoft has a cloud managed Azure Cosmos database. It does graph and other different styles in those SQL databases and Amazon Neptune which notably does property graphs and RDF graphs. RDF graphs are the, it's the semantic web type of graph. So we talked a little bit about the graph structure before. Now with Gremlin, Gremlin is a domain specific language for graphs. It defines how you would walk from one vertex to another along the edge. So that's what a traversal is. And Gremlin, you've seen his character part of the documentation actually from TinkerPop. I didn't mention before, but I'm a PMC member on Apache TinkerPop. So we use Gremlin, the characters all through our documentation. It helps to make concepts more approachable. The Marco, our founder, he actually worked with a graphic designer upfront to use the characters. So that way to make it more relatable and pretty much just tell the story of graph. And I think it's been pretty effective. So we're going to go through an example of a graph traversal. So the graph here is a pretty simple graph. I don't know if you can read all the details on it, but this is the TinkerPop modern graph. It's a very simple graph, six vertices, six edges. It contains people and software. So some of the people know each other and some of the people created some software projects. So when I work with teams who want to work with a graph, we always say, well, what information do you want to get out of the graph? So the best way to do that is just to speak it in your language. What projects did Margot's colleagues create with others? And this is what the query looks like in Gremlin right here. And we're going to take that step by step and kind of break down what each step does. Gremlin itself contains probably about 60 different steps, but they break down to five different categories. But it's a turn complete language. It's pretty flexible. And a lot of interesting things are happening in the TinkerPop space. So we'll start at the top, g.v. So g, that stands for the graph traversal source. Basically, that's your graph. .v would be all the vertices. So I've designated here with the little green guys, the little gremlins in each node. So when you call g.v, now we have a traverser, a little gremlin who's going to walk at each vertex. The next step is HasNameMarco. So this is a filter step. And what it will do is it will select only the vertices which have the name of Marco. So here the only green guy left is over here, the Marco node. All the other traversers, they have been killed. So you can see what the problem is here is that with the g.v in the previous step, we selected all the vertices and now we only have one left. If you have a large graph, that is not very efficient. It works fine in a six node graph, but if you have millions of nodes, that's a bit of a problem. So typically what you'll see in a graph database implementation is that you'll have an index. So here we have utilized index on property name to do this. So it'll go directly to that one node. So indexes will increase that performance, being able to zoom in on a single node instead of doing a full scan. Beautiful. All right. So next we're moving out on the nose edge. So starting from here, from the Marco node, what happens is the traverser splits. He splits across both of these edges. So he knows two edges. So now we have an alive gremlin here and a live gremlin here. Now we need to see what projects were created. So following along the created edge here, so out E, we're only looking at the edge because we're going to, in the next step, filter on properties on that edge where the weight is less than one. So this has a weight of one. Only this gremlin is still alive. Following inward towards the target vertex. So the gremlin moves from the edge to this vertex. Amid the values of the name. So here we can see that the name of this software project is LOP. So that's just one way to walk through this graph. We could have gone through a different direction. We could have done this in many different ways, just like any other programming language for that matter. But that just gives you a flavor of how you would do a walk. So that covers TinkerPop. But going deeper, TinkerPop doesn't provide a full implementation of a graph databases. And that's where we turn to something called TitanDB. This fills in the gaps that TinkerPop leaves behind. So this was actually a project that created by Marco Rodriguez and his partner, Matthias Brokler. It was the full implementation of a graph database. Again, it was an open source project, Apache licensed by this one consulting firm, Aurelius. But what happened was they were getting good adoption from the community, but they got acquired by Datastacks in 2015. When they got acquired, they kind of left the community hanging. They didn't say what the future of the project was going to be. So even though it was still open source, development had slowed down significantly. And meanwhile, they released Datastacks Graph 1. So the community was left hanging out there, not sure what to do. We knew from being in the community long enough that there were plenty of other companies out there that were in the same boat that we were, that we were using this great TitanDB product, and we wanted an alternative. I should mention that Datastacks Enterprise Graph, it was an enterprise commercial license only. So we got together with some partners and we made a fork of the Titan Graph database and turned it into Janice Graph at the Linux Foundation. As I mentioned, 2016. And really the goal here was to reconnect the open source community and embrace open governance. Since we were hosted at the Linux Foundation, no single company controlled it anymore, and we can just move together as we were before. So we have our founders, Google, IBM, Hortonworks, now Clutera, a good set of companies. But I mean, since then we've continued to add new companies and individuals as well. They're all listed there. I mean, these are big companies. Pretty much anyone who's looking for a scalable open source graph database, Janice Graph is probably one of the top options that come to mind. Right now in the graph database space, most of the options are proprietary. We're seeing some good adoption from other open source projects as well. I've lifted some of them here. In particular, I'd point out the ones that are actually hosted at the Linux Foundation. First is the project called Ageria. It falls under the ODPI umbrella. It's for open metadata and governance. So what you're able to do with that project is keep track of your data sources, add governance around it, add access control, things like that. It can use Janice Graph as a primary data store as an example because it has a pluggable model. Onap, if you went to some of the keynotes earlier today, they talked a lot about how this is pretty much becoming the standard around networking. There's a component called Active Inventory, Active and Available Inventory in the Onap project. Which uses a graph database to store how the different network components are connected together. Again, that's another classic type of graph use cases, the connections in a network. This is a graph slide of the Janice Graph architecture. So in addition to being an open source project with a good open source license, one of the key things that we liked about Titan and now Janice Graph was the flexibility of it. So here in this diagram, the light green items, those actually come out of the TinkerPop project. The TinkerPop APIs, the structure and the language. But these darker green boxes, this is what Janice Graph provides. It provides a management layer for managing your schema. It provides the database layer for doing transactions and data management. But probably the best part for us was the flexible storage and indexing layer. So this is an abstraction that allowed us to experiment again with different storage backends. So on the storage backends, we have Apache Cassandra, HBase, Berkeley DB. Since then, we've added backends for, I believe there's many others, there's FoundationDB, SillaDB. It goes on and on. What's great about that is that you can take advantage of the skills that you have already. So if you already have skills or you already have a data center of HBase, you can use that and you can do graph on top of it. Or if you wanted to, say, compare the performance characteristics of Cassandra versus HBase, you can do that. Google or partner, they have Google Cloud Bigtable, which is based on the HBase spec. So you can move easily using Janice Graph from on-prem to the cloud. Similarly for the indexing backends. So the reason that we have different backends here for the storage and the indexing, your primary storage, this is for storing your vertices in your edges. If you wanted to do things like full text search, numerical search, geospatial search, that's where we have additional data stored in these other backends, whether it's elastic search, Apache solar, we've seen. So again, you have the flexibility again to use whichever backend. From a community perspective, the most popular are Cassandra and elastic search. So again, the key benefits here again is the licensing, that's number one, having a free software license that's permissive and lets you do what you want with it. Right now in the database base, there are a lot of vendors like MongoDB and elastic search that are starting to move more towards a restrictive license. To protect how they want to operate. But with the Apache license, you can do whatever you want with this software and not having one company leading it just means that it'll always be open for you to use and collaborate with. I already talked about the pluggable storage options. We found that really to be the most interesting because like I said, you have the choice of using either open source options, you can write, use your own proprietary backends, whatever you need to suit your use cases. So again, I'm from the open source group at IBM and our commitment to open source, it comes down to code content and community. So we already talked about some of the some of the code, you know, with Janis-Graph. But we built many different assets around helping people getting onboarded with Janis-Graph. So I have some links down here and you can take a look. But we have some utilities, some blog series to help people get started. We have an end-to-end application where you can ingest data, query it out. So lots of good stuff there, free and available to help broaden our scope and your exposure to Janis-Graph. So from a project perspective, like I said, we started in 2016, we're continuing to grow. We've added both committers and technical steering committee members since the onset. Many different companies have onboard. But it really is just going, continuing on with the idea of diversity. One of the key gaps that we had from the start was around client driver support. The stack is built in Java, but many of our programmers these days are working with other languages. So over the past year or so, there's been a big push to get client drivers out. Right now, we've already released the .NET driver. I think Python is very close to being done in JavaScript. Those are the top three languages that people have been asking for. Different platform support. Most people are using Linux. I'm at a Linux conference. But there are still plenty of people who use Windows and those people have been looking for support. We recently released Docker images for the first time, which is fantastic. Getting on board with that. Apache on Bari. This is more for the operation side of the house being able to have an operations console to easily be able to deploy. This is good if you're using Apache HBase and Solar. So you can easily have Janis-Graph in that environment. I talked a little bit before about the back end support. These are mostly coming from the community. FoundationDB, Apple recently open sourced this last year. One of our TSC members started fleshing out a back end support for that database. What's interesting about FoundationDB is that they used to be open sourced as well. They got bought by Apple and then they went closed. Then Apple opened them up again. When they were originally open sourced, they had a back end adapter for Titan. It's coming full circle on them. Everyone wants to be open. As a database, what we want to continue focusing on is performance. Being able to benchmark. This isn't necessarily benchmarking against other databases. With the flexibility that you have different storage back ends, I think the key right now, people always ask, well, which back end should I use? Being able to benchmark there, that's something we want to focus on. Developer experience around bulk loading and ETL, that could all use help. From an industry perspective, this property graph schema working group, there are several vendors right now who are trying to get together and get some standards around graph databases. I think the outcomes from the property graph schema working group, there's going to be W3C, the World Wide Web Consortium. They're going to have a graph standard. That's where we're headed with a standardized graph language. From a jazz graph perspective, it's been our mission to be an open source implementation of TinkerPop. TinkerPop will also be compatible with that. Thank you again for coming out. I work in the open source. I work in the open. Pluridj in most places. Feel free to reach out to me with any questions you have. I can take some now. Thank you. I want to introduce you to the performance benchmark of the DNS. I don't have any benchmarks with me. I believe you can probably find some. From a performance benchmark, that's something I think our community can work better to publicize and produce, but right now we don't have any benchmarking in the product at all. It's been done by other parties, so I don't have that. I want to know what's the difference between the DNS and the Neo4j or the Tiger Graph? Tiger Graph. So Neo4j came about, it's probably the oldest graph database. The key difference that people like to point to with Neo4j is that it uses Cypher. Originally they were implementing TinkerPop, and they have Gremlin support, but over time they decided to create their own graph language called Cypher. That's one of the main differences. It has a GPL license, and once you get beyond a certain scale, you're required to get a commercial license. It's a good product. As far as differences, it's single vendor control, so there are good things that come with that. They have more tooling available. They have dedicated people to do marketing. Those are some of the differences. Same with Tiger Graph. Again, that's another proprietary database. Again, single vendor commercial license. I think that they were founded by former Twitter engineers, so I think when I've spoken with them, what they're trying to do is go after Neo4j, more in the large graph arena, and they have a different language. Tiger Graph does not implement TinkerPop or OpenCypher. They have their own language. I think they call GraphQL. But again, like I mentioned before, and this is why the graph database space is still emerging. It's still, I mean, after so many years, it still isn't standardized like the relational database with SQL is. But I think moving forward with the W3C and getting onto a graph standard, it'll help settle things in the industry around graph databases. Thank you. Any other questions? Okay, thank you.