 Cool. Hi everyone. My name's Will. Um, I'm a software developer uh, at Neo4j. Neo4j is a, a graph database. How many people are familiar with, with Neo4j or graph databases in general? Okay. Say maybe like uh, 40%. Um, so I, I work on the developer relations team. Uh, which means I build a lot of integrations. So building tools uh, to make sure you can use Neo4j with other frameworks uh, things like that. Uh, and then I also run what we call our data journalism accelerator program. So we, we noticed this more and more interest in using uh, graph databases for data journalism. And we wanted to make sure that there were uh, resources for journalists getting started with Neo4j. So we have this, this program that basically uh, we sort of work with uh, a few projects at a time to make sure that they can be successful with, with Neo4j if that's data modeling uh, import querying, visualization, sort of all of the things that you need to draw insights uh, from, from data for journalism stories. And so what I'm gonna talk about today are a couple of data sets related to data journalism um, that folks are using Neo4j with um, specifically the Trump world one and then, and then we'll see another um, in a second here. So these are the two data sets we're gonna talk about uh, the Panama Papers. We'll just mention, sort of briefly uh, and then we'll dive into what we're calling the Trump World Graph. Um, so first of all for, for the, the 60% of you that, that didn't raise your hand when I asked if you were familiar with, with graph databases or, or Neo4j um, let's talk about what is a graph database? And you know, I thought the, the most concise way uh, that I could explain this is of course in 140 characters. So this is Neo4j in 140 characters. Um, Neo4j is an open source, it's open source software that stores and queries data as nodes and relationships using the Cypher query language with index free adjacency. So that's 140 characters but there, there's a lot in there. So let's break that down a little bit. Uh, so open source software um, Neo4j is an open source project, you can, can download it, get started, you can build it on source, the code is on GitHub. Uh, so software that stores and queries data, so primarily a database um, you know, focused on, on durability. Uh, and the, the key point I want to get across here is that it, not only stores your data uh, like most databases do but also provides a mechanism for querying your data as a graph. Um, so let's, let's define what we're talking about when we say graph. Um, so, so with a graph data model we're talking about nodes and edges. So in this uh, in, in this data model here on the right you can see we have uh, officer nodes that are connected to legal entities. They're either uh, director or a shareholder of some legal entity. Uh, we can see the intermediary which is the law firm that created this, uh, this legal entity and then we can see that they have uh, address nodes. So this is the, the graph data model uh, that actually the, the ICIJ used to model this leaked data from a pandemonium law firm uh, that became known as the, the pandemot papers. Uh, is everyone familiar with, with the pandemot papers? Uh, so this pandemot papers, this was a, a huge uh, data leak from a pandemonium law firm uh, that you know, lots and lots of famous and influential people uh, were sort of caught up in this, this leaked data in show that they actually did have uh, have offshore companies uh, that, that were tied to them. So that's the data model uh, and then we need some query language uh, to query our graph uh, and with Neo4j we use a language called cipher uh, or open cipher and uh, this is an open source language so there are other, other database projects uh, to implement it as well. But cipher you can think of as SQL but for graphs. Uh, has anyone written a cipher query? A few folks? Cool. So cipher is all about uh, pattern matching. So you can see on that, that first line we're describing a graph pattern uh, that we're searching for in the graph. So we define nodes within parentheses. So you can see we have address nodes, officer nodes, entity nodes, and then relationships are within brackets. So here we're looking for uh, officers, these are, these are people that have some uh, address and that are connected to entities. So these are the, the legal entities. And we can filter on that where, where address contains Portland. Uh, so those first two lines say find paths in the graph where we have officers uh, with an address in Portland and now find the legal entities that they're connected to. Uh, and then in the return clause we're doing uh, doing a group buy and then an aggregation to count the number of paths. And what we end up, what we end up with um, is the answer to the question for everyone with an address in Portland, what are the jurisdiction of the offshore entities that they're connected to? So in Portland, the, the British Virgin Islands is the most common uh, offshore entity uh, followed by the Cook Islands. And you can see by the counts there, we, we don't have that many people that have, have offshore entities that were, were found in the Panama Papers data sets. Um, and, and largely you know, most of the people caught up in this data set were in, were in Europe uh, because of offshore laws. It doesn't make a lot of sense to use Panama as, as the registry for that, but you still have some, some people uh, with addresses in the US at least. Anyway that, just meant to be an example of, of Cypher to give you an idea of what it looks like. And then the other piece here is with graph databases uh, we have this concept of index-free adjacency. And what this means is just that as we traverse the graph as we go from one node to another that, that they're connected to, we do this without having to do an index lookup. So we have a constant time performance of graph traversals which means that our queries can scale to very, very large data sets and still have the same performance characteristics. Okay so that was sort of an overview of, of graph databases um, so we're all on the same page. So let's talk about uh, the Trump world data set. Uh, so what is this? Well, I, I think it's an interesting, it's been an interesting last year or so in, in data journalism um, you know since, since Trump was seen as, as a serious candidate I think a lot of journalists realized that this, this, his vast sort of business connections um, are very different than lots of other presidential candidates. And so the, the ways that they sort of report on these kind of things in the past is very different um, with, with Trump. And, and if you look at his financial disclosure documents um, which is sort of the, the starting point for a lot of these types of integrate investigations to find uh, you know connections to his assets and um, and business connections. You can see there, there are hundreds of uh, LLC's and corporations in there, LLC's and corporations that have fractional ownership in each other uh, and, and so on. And so if you look at sort of the, the tools that journalists need to use to make sense of this data both for their investigations internally and also to present this data uh, to their readers uh, a very common data structure that's used uh, is the graph. Uh, and, and we've seen this uh, a great example is Little Sis, so Little Sis collects crowdsourced data um, anyone can upload information about usually influential people and their connections. They do a great job of, of modeling and presenting this data uh, as a graph. Um, we've also seen lots of, lots of other research um, you know across lots of different publications as a way of using a graph to explain these types of connections. Uh, but it's important to realize that we can also, instead of just using these, these types of visualizations to present information to readers, that we can also as journalists store, model and query our data uh, as a graph at scale as well to work with, uh, work with public data. So, earlier this year um, Frederick and in Boston Obermeyer um, so these are two German uh, journalists at Sudoycha-Seitan um, they wrote this uh, this opinion piece in the Guardian talking about sort of the reason that the data investigation for Panama Papers worked um, and the, in their eyes the reason that this worked was because of collaboration. So Frederick and Bastion, they, they were the original journalist to receive the leaked data from Panama Papers directly from the source. And rather than working with the data themselves and trying to keep it uh, to report on and investigate themselves, they went to the ICIJ, which is the um, International Consortium of Investigative Journalists uh, which is basically a network of over 300 journalists that, that collaborate. So they took the data to the ICIJ because they realized that to get the most value out of this data, we need to share this, we need, we need a large team to work together. Um, and what ICIJ ultimately ended up building uh, was a tool that modeled the data in the Panama Papers leak uh, as a graph in Neo4j, and, and they built this sort of search engine on top of it. So the journalists could uh, could search the data set uh, for people uh, in their areas that whether it's their, their geographic area uh, that they're interested in covering and search the data uh, as a graph. And here we can see an example um, so in, in the bottom in the middle there, that is uh, the former prime minister of Iceland uh, Gunlapsen uh, and I say former because he was forced to resign when the information uh, in this visualization came to light. And that's the fact that he has uh, in interest in an offshore company that he did not report. And the reason this, this was discovered, you can see the other person noted at the top of this visualization uh, that's his wife who also has an interest in this, this offshore company Winters. And this was discovered because they have a shared address. So, so off to the right uh, that's their address that they have in common. And so uh, the journalists were able to see, oh there's this connection to the prime minister to someone that they share the same address. I wonder if they have other connections. Oh, that's his wife. And digging into the documents further it saw that, oh well he should, he sold his interest in, in this company to his wife for a dollar and didn't really report that and ultimately had to resign. So this is just, just one example um, of the, the kinds of insights you can get from this type of data when you look at it as a graph. So um, so meanwhile you know, we've seen lots of uh, journalists working to, to sort of make sense of the public data that's out there around Trump starting with the, the financial disclosure documents and, and sort of digging into other areas as well. Uh, but there wasn't anything like the ICIJ and Panama Papers sort of sharing data across groups publicly. Um, Intel, uh, Buzzfeed published uh, what they called Trump World. Uh, and this was their uh, data set that they had used internally to keep track of connections uh, around the Trump administration. Um, so people to companies and, and people to people connections as well. Uh, and they, they released this to the public in, in this blog post that said, hey, here's the data that we have. Please let us know. What are we missing? Um, please, please help us fill this in. And they very quickly had um, you know I, I think they sort of tripled the size of their data set from uh, from other people submitting things uh, in just a couple of days. And um, one, one of my co-workers wrote this blog post uh, about how you can import this data into, to Neo4j to model this as a graph so you can do things like um, see clusters around uh, some of the cabinet members, what companies are they connected to, what companies are they connected to in common, these kind of things. Um, and I should say this is the format that, that Buzzfeed originally released the data. This is just a, a Google spreadsheet um, which we can easily export to, to CSV to import into, to lots of different things. And when, when we published this blog post we got a lot of journalists um, came to us and said hey, this is great um, you know we, we want to, to work with this but we have this other data set um, can you help us sort of merge these or can you make this publicly available in a way that you know I don't have to um, you know go through the import process in, in Neo4j. And my team was working on another project uh, at the same time uh, that we called Neo4j Sandbox uh, and so the idea with Neo4j Sandbox is we wanted a way to be able to allow anyone to spin up a Neo4j instance uh, with a data set already loaded and with a way to sort of share queries and visualizations. Um, and so we had sort of an alpha version uh, of this and we thought well people are very interested in this data set um, you know let's make this one of the, the Sandbox uh, examples uh, which we did um, so I'll give you an example of uh, of the Sandbox. So here in, in you can, you can go to Neo4j.com uh, slash Sandbox and just sign in with Twitter and then you're presented with these different um, use cases. And I'm going to spin up the Trump World Sandbox. So what this is doing is it's spinning up an AWS instance um, and loading it with initially the data from uh, from Buzzfeed. So we import directly from the Google spreadsheet uh, and then we give the user access to this one uh, Neo4j instance. So this is private just to that user um, and we can jump into Neo4j and here's an example of what, what Neo4j looks like. So this is Neo4j browser which is like a query workbench uh, for Neo4j uh, and we have, the font's a little small but we have lots of different sort of uh, guides that we can walk through to explore the data. Um, so let's, let's skip here and just look at a couple of uh, interesting queries that we can run uh, to try to give you, give us an idea of what this data looks like. So here's, here's a query you might want to run. This is um, two degrees of uh, Jared Kushner. I want to change this to uh, Joshua Kushner because Jared ends up being very connected. So this essentially says find Joshua Kushner um, find the person node that represents Joshua Kushner and then traverse out two levels deep and show me everything that you find. So find all of the, the people and companies that are within two hops of Joshua Kushner. And if we look at this, well we can see, okay here's Joshua Kushner, he's on the board of a lot of companies, he's invested in a lot of companies, so he's the founder of uh, of this VC fund Thrive Capital which invests in a lot of startups. We can see all of the startups like uh, GitHub and Mapbox that, that Thrive Capital has invested in. Uh, Peter Thiel is here. Peter Thiel is an investor in Thrive Capital as well as directly connected to Donald Trump because he had some role in his, in his transition team. Um, so you can see this is the, this is the type of data that we have here. Uh, we can use Neo4j Browser and, and Graph Digitalization sort of explore this data. Um, we can also do things uh, I think are interesting to give you an idea of just sort of the, the connections in the data is look for uh, shortest paths between two people or between two companies. Uh, so what's the shortest path from Vladimir Putin to Donald Trump uh, in this data set? And it turns out to be not a very long path. Uh, so we can see that Donald Trump is connected to Rex Tillerson who's the Secretary of State uh, who has some connection to, to Vladimir Putin. Uh, and by the way we can, we can inspect any of these relationships and we can see the, the source document for, for this relationship that comes from the, the Buzzfeed data set. Um, so we have two connections that are just two degrees uh, from, from Donald Trump to Vladimir Putin. Um, so that's sort of an example of the, the type of things we can look at. Um, I'll, I'll share these slides later. A lot, lots of these queries are in here. Um, but one thing I want to talk about uh, before we, before we break here is just this, this idea of data fusion. So being able to, to model and query your data, you know as a graph is helpful. I think, I think a graph is a very intuitive data model. Cypher is very expressive. That's nice. But really one of the, the big powers of, of using graphs and data journalism is this idea of data fusion. Um, so a graph is a very flexible data model that allows us to very easily combine uh, data sets. So one of the, one of the things that someone in our, our journalism accelerator program was interested in looking at uh, were data on federal contracts. So are there any vendors of federal contracts that show up in the Trump world data set? So companies that are receiving federal contracts that may have say a board member or an investor who may be related to the administration. This is, this is potentially interesting um, information. Um, so, so what we did is we, we loaded this data um, into, into Neo4j uh, in, in the sandbox. There's sort of an example that, that shows how we can, can pull in the CSV file for all uh, government contracts. We search the Trump world data set to see if there are any sort of vendors in there. Um, and then we extend the data model uh, a bit. So here's just one example uh, from a, a specific contract. So here's, here's uh, the contract. So we have a contract node that we add. Um, so this specific contract was issued by the federal prison system. Um, and if we inspected it, it's for lease of uh, of prison space, essentially private prison uh, contract that was awarded to this merchandise mart properties uh, company. Well it turns out this is uh, this is a subsidiary of Vornado Realty Trust um, that is an investor in the Kushner companies. Uh, so there's some connection to Jared Kushner who's the CEO of Kushner companies, or was at least. Uh, and then we can see, well this, the chairman of Vornado um, is an advisor to Donald Trump uh, and we can also see that the CEO of Vornado uh, made campaign finance donations to Donald Trump as well. And you, you know, it, it's also interesting to note that since Trump was elected, shares in private prison stock have gone up significantly on, on sort of the, the belief that this is going to be, the Trump administration is going to be beneficial for the private prison system. So it's this, this type of data that I, I think is useful to sort of dig into, to see, you know, are there potential conflicts of interest here uh, in, in these kinds of things. Um, so this, this is just one simple example. Digging into this, we can also see uh, you know the, the contract uh, that was awarded by the Secret Service to um, to one of the, the Trump real estate companies for the lease of space in, in Trump Tower to maintain an office there and, and so on. Um, we, we've worked a lot on uh, importing other uh, data sets as well. So things like open corporates, which is a corporate registry. Um, things like Little Sys, which I mentioned earlier. Um, and all of this is on GitHub, so there's scripts to sort of import um, all of these data uh, which are, are very interesting when you are able to, to query across them. There are lots of non-profits in Trump world, so we import some of the 990 data uh, as well. So, the, the, the key takeaway I want to leave you with um, is just sort of that there are two benefits for using graph databases for open data. Uh, one is, you know, the data model is very intuitive. Modeling data as a graph makes a lot of sense because that's how we, we think of the data uh, and then we have this nice query language. We don't have to write SQL joins. We just sort of draw the graph pattern that, that we want to search for. Uh, and then the other benefit is this idea of data fusion uh, and our ability to combine data sets. Um, I, I think is really, really powerful as well when you can combine data sets and then query across them to find insights that you wouldn't just by, by inspecting uh, by inspecting one of the data sets. Um, if you're interested in learning more about this uh, there's the Neo4j sandbox which I mentioned that has uh, also a data set on, on U.S. Congress and campaign finance. Um, the Panama Papers data set is in there. There are also some more um, some more sort of data science type things like how do you build a recommendation system with movie data uh, this kind of stuff. Um, so I would encourage you to use that as, as a starting point for digging in further. Um, and with that I think I am just about out of time. Two minutes for questions. Alright. Yeah. Yeah um, yeah so the, the question is basically, you know um, are there, are there stories coming out of this, this data using, using this approach. Um, and there are things in the works, right? So um, so Panama Papers is a great example of, of sort of um, you know stories that have been published using this kind of analysis. Um, and I, I mentioned the, the data journalism accelerator program that we have. Um, there are lots of organizations that we're working with that are, are sort of working towards uh, publishing things that, that you know maybe haven't uh, haven't come out yet. The open source, the other open source graph databases? Like Gefi. Like Gefi? Okay. Okay yeah so, so a quick comparison between Neo4j and Gefi. Yeah. So, so Gefi is uh, a tool that's mainly useful for running graph algorithms and visualization. Um, the, the Neo4j can, can do those kinds of things as well. But Neo4j is primarily a database focused on durability and, and storage of your data. So it's a transactional uh, database. Um, whereas, whereas Gefi is more useful for more complex visualizations. Um, not necessarily of, of sort of storing and querying your data. Yeah. Cool. Thanks everyone.