 Yeah, hi everyone. I'm Martin. This is Max. We are both engineers at Neo4j and in the previous talk Stefan talked about features of OpenCypher and in this talk we will show you how some of those features are implemented in our Cypher for Apache Spark implementation. So this is a team so it's Max and I and then Matz is our team lead, Stefan former team lead and also Philip is working on that project. So why Cypher for big data? I mean if you know Cypher then you probably also know Neo4j and Neo4j is a transactional database for OTP workloads and that's where you typically position this database system. However many of our customers also have data lakes and already used big data tools for data integration so ETL as you probably know it and also large scale analytical processing so more like OLAP. And we were thinking about how can we help those people with Cypher which they are used to because they are already customers. How can they use Cypher within those two scenarios? So and the typical big data applications that we see is just a collection here. Collect data from user interactions at websites or typically that they use internal data from their companies and put them together to do some analytics like billing, marketing, ERP system data and so on. Or they're combined with external data on the logical data for example. And of course the goal is to improve customer targeting, supply chain optimization, fraud detection and so on. So all the all the graph use cases but at large scale. So in a common framework that you see in this environment is Apache Spark. Yeah, a distributed data flow system. And we thought about okay how can we help with Cypher? How can we help those big data analytical applications? So one pillar is data integration. So here you want to be able to use multiple large scale data sets within your analytical program. You want to retain and reuse intermediate results between queries. You want to integrate data from multiple data sources so not only HDFS or in our case Neo4j but also maybe a relational database and then just get it into a single program and start your analytics. And of course you want to support heterogeneous data. And the second pillar is of course complex analytical processing which means that you have building blocks which are Cypher queries and then you want to compose them to build up your complex workflows. And what's also nice about a system like Spark is that you can integrate our library with any other library that is already available. So like for machine learning, graph processing, graph X, or your own domain specific business logic. And of course Spark itself gives you all the nice features of distributed execution on a sharp nothing cluster typically. So here we introduce Caps. Caps is a Scala library Cypher on Apache Spark. And its main use case is of course to execute Cypher on these distributed graphs. You can integrate it in your Spark and the pipeline. At the moment we support different data sources. So for example you can pull in your data from Neo4j into the Spark environment or you have your graph stored in HDFS or your local file system in a CSV format which we support. And it can handle heterogeneous data and it's already able to compose Cypher queries which you will see later in an example and in our demo. So Caps is built by Neo4j and we donated this to the OpenCypher community that Stefan introduced in the previous talk. It's already released on GitHub, Apache 2.0 licensed. And we are planning a release GA release on mid-May where we have the supported data sources for Neo4j, HDFS and Hive. So we can also use SQL to get data out of Hive into our system. And for the rest of 2018 or beginning of 2019 we also want to add more data sources. So for example relational DBMS like Oracle Postgres and so on. And also support systems that are basically Kremlin focused but we have a translator from Cypher for Kremlin which you can use to also use data out of those systems. Okay so let's get a bit more technical. So what's the architecture? So as you see on the top this is a regular Cypher query. It's just looking for a person which loves a specific system. In this case Neo4j and there's an optional match so this person might also love Spark. And we want to return the person and the two systems. So we have a multi-layer architecture within CAPS. We start of course with parsing the query. For that we use the open Cypher front-end which is a shared module which is also used by Neo4j. So that gives you, you take the string in and get the abstract syntax tree out and it does all the parsing, rewriting of the query and some algebraic optimization. Then there's CAPS. CAPS is mainly used to translate the AST, the abstract syntax tree into something that can be executed on data frames which is the abstraction that Spark gives you. So it's basically a translator between those two things. And it also gives you the ability to import data of course, export data and we also have a schema and type handling because as you might know data frames have a fixed schema and Neo4j or property graph in general doesn't have a fixed schema so there needs to be some handling between those two worlds which I will explain later a bit. And then after CAPS when we have our data frame program we just hand it over to Spark and Spark has this nice thing called the catalyst optimizer which is a rule-based optimizer for query optimization which gives you some improvements on your runtimes probably. And then at the end of course the Spark runtime which executes the query on your cluster. So this is the high level view of the system. Okay so we had some, as I already mentioned there are some differences between Neo4j and Apache Spark or in specific the data frame API. So at first let's talk about the graph format. Neo4j as you might know is a so-called native graph database which means it's built up from bottom to top optimized for graph operation. So the storage layout up to the query language is all optimized for graphs whereas data frames which is the abstraction in Apache Spark is more like a table in the relational DBMS. So it has a fixed schema, you have the relational operators like join, union and so on and that's of course a gap there. Yeah as I mentioned the operators we have relational operators on the Spark side and we have native graph operators on the Neo4j side. So for example Xpand is an operator where I get all the relationships of a specific node or Varexpand to compute a variable length path expression. So like someone knows someone over one to ten hops. Then the schema I already mentioned we have schema optionality on Neo4j side so we have some constraints on the schema but you can also have even for the same label for node you can have a different kind of schema for that so different kind of properties on a Spark side this is all fixed. And then of course there's the type system. We have the Cypher type system and on the other hand the Spark SQL type system which are not compatible by default so those have also to be mapped. So let's talk about a few of those challenges at first about the schema. So as I said we require a schema for Spark data frames and if you use for example our HDFS data source which I said is a CSV then of course you have already a schema available which is explicitly defined and we can just derive it from there. So for example load your data out of Neo4j we implicitly infer the schema while loading the data so we build it up while we see the data. And for example if you look at this graph here I hope you can see some of it it's not that complicated just two persons that know each other and they love different systems and for that example a schema would look like this you have the node labels for example person and a person has two properties name which is of type string and year of birth which is of type integer and it's nullable because not all persons have the year of birth property key and then there's also the employee label and the employee is an implied label so if someone is an employee he also is a person because those two always appear together so a person employee also has the name property in that example here and then the same for relationship types you have the nodes relationship between those two persons which has a specific property key since of type integer the second challenge is the graph representation so from a logical point of view you have the property graph so how do you represent that in data frames and data frames as a set are tables so we chose the concept of node and relationship tables which are constructed by label so in that example we have one table for persons and one table for the systems that you see on the bottom and the employee for example which I said is an implied label so it always occurs together with the person label it's just an additional column in that node table here it's a boolean column so it's just true-false if that person is also an employee and then you have all the properties here so like you would also model it in a relational database of course so and then you have null properties for those optional property keys here same for relationships relationship table as we call it where you in addition also have the source and the target or start and end node identifier of this relationship and then also the properties that this relationship might have okay the next challenge is query translation so again we have our Cypher query here and the physical view is of course query optimization or query engine handling in general so we have on the left side our input which are the node tables that I just showed you and then we have a series of operators and in the end we have a result which is the result of our query so this is basically where the magic happens within the system and today I want to explain some of this magic at least so on the left side you see the same architecture as before it's a high-level view of the system here is caps and within caps we have four phases of query planning so the first phase is so-called intermediate language phase it's an intermediate representation of the query which is back-end agnostic because one other goal besides doing that on Spark is being able to port this project to our back-end system so maybe an in-memory system or Apache fling for example something like that so this is a back-end agnostic query representation at first so we translate the AST into that and then we start with logical planning so this is still a graph-specific operator so on the logical planning side you still have operators like expand, var expand that I mentioned before and we do some basic additional optimization towards the front end it's already doing at this place then we have the flat planning the flat planning is the step where we translate or where we compute the column layout of the resulting data frames so if you for example do an expand so go from a node to their relationships then you need to compute the column layout for the resulting data frame so the schema basically and this is what the flat planning does and then in the physical planning this is where we actually translate the graph native operators like expand, var expand and so on into data frame operations like join, select, distinct and so on and then of course like I said before this is handed over to the Spark engine again it's being optimized by the catalyst optimized and executed by the runtime okay and when you finish a query you have of course the result available which is again a data frame which we call a cipher result which contains each row represents one result so in that case we wanted to return the user and the two systems so for example Alice here as you can see loves Spark and Neo4j and Bob only loves Neo4j so the rest of the row here is filled up with value okay so now that we know some internals at least of Caps we want to talk about the API before we show you the demo so we try to adapt our API to or make it pretty similar to what Spark already provides to you or with the data frame API to make it especially easy for people that are used to Spark to also use Caps the central point is the so-called Caps session as analogous to the Spark session and this is the minimal program to run a cipher query on Apache Spark so in the first line for example we just create a local session which initiates the Spark cluster and so on and then in the second line we specify a data source by just saying session.readFrom and then we give them a URI where the scheme defines which kind of data source it is so in that case you might not read it but it's HDFS plus CSV which internally triggers the right code path to get the right reader for this kind of graph and then the address that's stored within HDFS on line four we actually line four is actually line three but whatever it actually triggers the query and on line five four we trigger the re-print result which would lead to that console output that you can see here as you are used to from Neo4j then there's the second example which of course is just available for the first row to read so on the left side sorry about that but on the left side this is just regular Spark code to create data frames on top just trust me this constructs a node data frame and this a relationship data frame with some schema and then on the right side this is the program that actually uses those data frames to run a query on it there's a concept called entity mapping which is basically just a mapping between the column names in the Spark data frame and the concepts that we need to construct nodes and relationships from that so you have to tell us where to find the node ID for example where to find a specific property key in which column and same for relationships with addition of of course start and end node key so it's pretty simple and that's the goal of an API Stefan talked before about multiple graphs this is an example that is also currently available in caps that you can run which involves multiple graphs Max will give you a more advanced example later but it's just to give you the idea so what we do here is we take two graphs we mount the first one from HDFS and the second one is coming from Neo4j indicated by this bold scheme in the URI and two queries to get the data out of Neo4j and we store them at two graphs my HDFS graph and my Neo graph and then within the query we can say from my HDFS graph give me all the employees and from my Neo4j graph give me all the persons and then join those two graphs where the users or the persons and the employee have a matching email address so we do data integration basically we have two graphs separate HDFS Neo4j and we join them together based on some knowledge that we have about those two graphs which creates a new relationship between the employee and the persons that match and create a new relationship called sameS and then this graph is used with an additional query that you can see here so it's also an example for composition because this is the first query this is the second query that we trigger on the result of the first one okay this is the concept and Max now will show you a more advanced example a running example of that okay so before we can start the example I just want to set you all up with the scenario we are talking about so let's assume we are a company and we want to do a marketing campaign in specific metropolitan areas and we have access to two data sets one data set is a social network it's a generated social network which we split up into two regions so we have our social network is once partitioned into the North America part and the Europe part and we store it into Neo4j databases just for scalability and for performance optimization and then on the other side we have a customer data set so it's customers who have bought products are also categorized into product categories and on the social network side of course we have people who know one another who live in certain cities and they are interested in different interests and as you can already see we have a shared attribute across those data sets so people on the social network side as well as customers on the product side both have an email address and this is where we can connect those data sets so what we do is we assume that if a person on the social network side and a customer on the product side have the same email address they are the same person in reality so they represent the same entity I forgot to mention that the product data set is stored in HDFS just to make this example a bit more complex or like diverse so what we want to do is we load the data from their corresponding data source as I said Neo4j and HDFS we extract subgraphs that only contain the data concerning those metropolitan areas we want to target in our case this is New York City San Francisco, Berlin and Stockholm and then merge those two graphs on the email attribute and then we compute recommendations so we want to recommend users products that their friends have bought and which as their friends and the user both are interested in yes that's basically our setting for the demo and now let's switch to Zeppelin so Zeppelin is a data science tool which allows you to run spark queries in an interactive fashion and we have adopted CAPS so that we somewhat support Zeppelin and we can use it for our demo is it big enough, is it readable from like at least some places in this room just to font size even more okay I'll leave it as it is so what we've did we've imported all the all the CAPS libraries it's like the CAPS session and we have this special Zeppelin support class here just for the Zeppelin and we've already created the CAPS session and what we start is we have to, what we start with is loading the data sets we first read our graphs from Neo4j this is done in those first two lines just run it here because sometimes it takes a while so we read the data from Neo4j from the two regions and we read the data from our product data sets which is done here so you see this is using bold which is the Neo4j scheme and then we have HGFS and CSV so we tell our system this format is stored in HGFS and uses our custom CSV format and what we do here this is somewhat analogous to the mount call that Martin showed previously so we have our graph and now we store it in the session under a certain name or in our case currently this is in a URI so we can later use it from within the Cypher query so we can reference that graph, change our input graph and now read from a different source so this worked I guess yes and now let's start with some simple queries to show you just this is like the Cypher 9 Cypher capabilities that we already support so let's take our social networks and have a look at the distribution of interest in our cities so both for New York City and San Francisco as well as Stockholm and Berlin and what we do here is we select all people and their interests and then group by city name and interest and let's have a look how people what people in those cities like so it's also to kind of demonstrate how nicely this integrates with Zeppelin so you can do your Cypher queries which are quite readable and then you get a quite nice reservation of that so what we see is here in New York City the interests are more or less evenly distributed and San Francisco is somewhat more old school they seem to like videos more than DVD I guess this is like real videos like VHS and they don't read as many books but I mean it's artificial data so it's just just for this for the fun looks almost the same for Berlin and Stockholm what we do now is to make our lives a bit more easier we extract those metropolitan areas so we don't want to look at all the cities that are available in our dataset but as I said we just want to target those four cities so we extract graphs of people who live in one of our target cities and if two people live in the same city and know one another about one or two hops that is they are friends or friends or friends then we create a new graph with an edge between those people so from now on they live in the same like social circle and then we return that graph that's around those queries and here we have not a table representation anymore because we don't really return a tabular result but we return a graph result we return a new graph so what we get here is visualization of the new graph you see those two communities I guess one of them represents all the people living in New York City the other one all the people living in San Francisco and they're all somewhat connected if they know one another or their friends know one another to run this as well now what we now do is we combine those two graphs so so far we always worked with the two different data social network datasets so we had the one for New York the one for Europe and we now want to combine them into a single graph just because it makes life easier for us and we again store this combined graph in our session using the friends UI and now we do merge the two datasets this is where the multiple graph capabilities come in handy now so we start by reading from our friends graph so this is what we use now we use this feature graph at friends so we tell the system for now I want to read from the friends graph and we find all people and access their email address and then we switch our input graph from now on we keep the results that we have computed so far but everything that is going to happen now or reads that are going to happen now are done using the products graph so again we find all users and their email addresses and now we do a cross product so or a value join so we find people and users that have the same email address for every table that we create we create a new graph with an is edge so user is a person or is the same person so up until now we have done our we have done our ETL so we had didn't I do that I did it already so this was all ETL stuff so we had our input data more or less raw input data and we've transformed it so that we can now do our analysis first we have to combine all the data into a single graph just makes it easier again because we also wanted access the person attributes and the and their interests and stuff like that so we create one large graph and use this graph to do our analysis so what we do is we find people that are in the same social circle and have and like a certain interest and then if the user is the same as a user or if the person in the social network is the same as a user in our products database and has brought a certain product that belongs into a category which he which his friend likes then we recommend his friends this product and then we also rate we also ranked those recommendations by the number or by the by the ranking or rating of the product and how many or how helpful those votes actually are or how many votes a certain rating has gotten yes and then we execute this and can actually visualize it in a table this takes a while running it locally on my laptop right now it's a graph with a couple of thousand nodes and a couple of ten or a hundred thousand edges yeah and as I said we started with two like raw input data sets we combine them using some ETL capabilities and then using the same language that is Cypher we could also do an analysis of the data and do some analytics here we are we now have a table with recommendations so what we see is that we could recommend Billy Anderson with a certain e-mail address and we should recommend him to buy the book The Life of Pi yeah that's all for the demo for so far and also for our talk as we said this slide we thank you for listening and we welcome you all to check out our project it's on Github as we said it's open source licensed you can use it it's still in alpha so we are constantly changing stuff so don't rely too much on certain details we are finalizing the API right now so this is somewhat stable already not only if you're interested in Apache for Spark but also if you're interested in implementing Cypher on another system check out our project we try to keep our project in a way that it's easy to reuse all parts like up to the like how you're back in the structure if you're back in this like relational also then you can use quite a lot of our project all the schema translations all the header calculations or if you just want to do your memory database then you can just use the frontend the intermediate representation and then you're already quite a far step from just having to do all the implementation on your own so both if you're interested in Spark or want to implement Cypher check out our project and contact us if you have any questions thank you please proceed for the questions of five minutes of questions anyone for Max or Martin okay if there are no questions Max or this one what do you say because Cypher is written in Scala and the question was why Scala and yeah so Spark is easy accessible from Scala it's also possible with Java but it's just nicer with Scala and also the rest of Cypher on the Neo4j side is also written in Scala so it just makes our life a whole lot easier is that a new version of Zeppelin that you're using or do you use an extension the Zeppelin the version of Zeppelin is a preview version so 0.8.0 snapshot or something like this just because it has this nice network or like graph representation on our website on the Github project there's a wiki link on how you can integrate Zeppelin Caps with Zeppelin and we explain also I think in some short detail how to you know you can build the snapshot version but you can also use it with the current stable release by the way the Zeppelin support for graphs and Neo4j was done by one of our community members in Italy and as soon as Zeppelin 8 is out or 0.8 is out you can all use graphs with Zeppelin which is really cool just also channel they have planned do you know or have one how to efficiently distribute the graph on the Spark tool the question was if we already looked into I guess what you mean is graph partitioning so how to smartly partition the graph over HDFS but it's on our roadmap for the next time so it's part of our performance tuning of course and this is mostly related to performance to have a good partitioning of the input graph and also recompute partitioning during query execution so that's yeah on the roadmap but currently we only rely on what Spark gives us by default which is I guess hash partitioning yeah so the open cypher implementation for this is in Scala does that mean if you want to write your own open cypher implementation to interface with your own source is there only like libraries and tools available for Scala like the parser or whatever is it only for Scala or there are other language I can take this maybe and so the question was if the only tooling for writing so the question was if the tooling available for writing your own cypher implementation is that only available in Scala that's not true so the tck is language agnostic so the test the set of test suits that your implementation needs to kind of conform to to be a good implementation is completely language agnostic also the ebnf or the grammar is available as an ebnf grammar and as an antler grammar so you could easily generate a parser for all kinds of languages but the layer beyond that is currently only available in Scala in an open source implementation but I know of one more open source implementation or implementation that's going to be open source that's coming out it's actually not from Naio and so you know watch the space I think more things are going to happen right so that's the other side of it but you get some starting tooling from us so you could do it if you wanted to thank you so much for your time