 Good morning everybody. Wonderful day, first talk of the graph processing death room here at FOSSTEM. And the title of the talk is Open Cipher and Evolving Query Language for Property Graphs. And my name is Stefan Plantico. I'm the lead of the Open Cipher Language group at Neo4j. That is the group at Neo4j that dominantly works on the design of the Cipher language. And it's not just me, that's also my colleagues, Petra Salmer, Anders Taylor, Tobias Lindaker, Mats Whitberg and Alistair Green. And I'd like to talk about Cipher today. And for those, this is I guess a pretty diverse crowd since the topic of the death room is graph processing in general. So I thought I'd give a quick introduction to property graphs. And Cipher, I'd like to talk a little bit about what we're doing in the Open Cipher project in order to establish Cipher as a standard for graph querying. And then the meat of the talk will be a bit of a sneak preview. I won't be able to go deep into all the topics, but I want to talk about some of the things that we have planned for Cipher 10, the next major addition of Cipher language. And of course some closing words on how you could get involved if you're interested, if you want to try your hand at language design, if you're working on a graph database product, etc. So property graphs with Cipher. Who here knows what property graphs are? Okay. Works as property graphs. Still reasonable. Okay. Who knows what Cipher is? Okay. Good. So it's good that I have this section. Good. Property graphs. Well, data, right? How do we model it? And people have been doing that for a long time in 15 different ways. Everybody knows SQL and tables, maybe some here know RDF as a dominant way of modeling. The way that we are interested in from Cipher is property graphs, which is the graph data model that was, I think, born out of observations of application engineers. Right? So if you're writing an application, an enterprise setting or some other setting, and you want to model your data and it feels natural for you to model your data as a graph and in many cases that is just a natural way to model your data, then this data model is for you. So what do you have here? We have two nodes or two people in the real world and they're friends. Hopefully that happens at the end of Boston for some of you guys in here. So let's capture some of that. Let's say one of them is a person. The other is also a person. And then another, also a director maybe for a movie. And then they could capture all kinds of information about them. So one of them maybe is called Ed Jones and he's 37 and his favorite color is blue. And the other is called Peter Fry and he has a nickname and maybe we even have information about his biography or something like that, right? And that's great. And then they're friends but they're not just friends but they have a really good friendship. It's going strong. And they're friends since 2003, right? And just the way it pops up here on the screen. Can you see that by the way in the back? Is that visible? I'm not sure if I can do it. Is there a way to broadcast the link to the slides maybe, Michael? Could you look into it? Could you look into that? Cool, thanks. That's what I can do, sorry. So the way it pops up here, right? If you were trying to model that situation that's maybe what you would put on a whiteboard. It probably wouldn't look as pretty on a whiteboard but generally that's what you would do, right? And so that's exactly what you do in the property graph model because the property graph model, this is a node, right? A person. It has labels that kind of classify it like being a person or a directive. And this thing here in the middle is called a relationship. They're friends, right? And then of course both nodes and relationships can have properties and that is the property graph model. So how do you query it? Well, we invented this language called Cypher at Neo and the property graph model is very visual and we try to keep that flavor or that was really key to Cypher, I think, in the design of Cypher. So the way you query graphs is by essentially writing tiny little graphs of data that you're interested in in ASCII art. And then you're getting all the instances in the database that match that ASCII art pattern and that's the result that you get. And here we see a bit of an example of Cypher. So you see my mouse, I guess. Here we have match. I'm looking for a person, a binary variable called me. So relationship friend to some other friend over here. Could be a person, could be something else, actually. Maybe I am also friend with an animal or so. And then I'm saying where, me.name is Frank Black. So I can filter this predicates, right? And I can say, okay, I'm looking for friends that are older than me. That's nice. So that finds me little pieces of graph. And then, of course, in the end for my application, I may want to have a tabular projection so I can say something like here, okay, give me the name of my friend in uppercase as a name. And maybe they have a title and I want the title as well. This is kind of the style of querying of Cypher. And then we have, of course, it's nice to get data out of your database but you also want to get the data in. So we have an update language that also uses the pattern syntax which makes the whole language very visual. So you say create you person. That means you have just created a new note called person and then we can maybe record the name and say something like set you.name. Here's some name, Aaron Fletcher. I stole this slide. And then we can, of course, create a relationship. And I say, okay, I want to connect this guy to me. In this case, me here is this Frank Black from above and now they're friends, right? And then the last point to highlight here, Cypher is already to a degree a compositional language in the sense that you cannot just write queries like find me something and then return it but you can find something and then, you know, do other projections, do aggregations and kind of create these linear pipelines of processing your data until you actually have the result that you want. Here, in this case, we're kind of collecting friends together with the number of enemies they have if you would have such a graph, right? And here the highlight a bit is how we do aggregation and Cypher. Yeah, they're all kinds of patterns. Patterns are the central concept, right? So we have note patterns for kind of saying, I'm talking about this note in the graph and then you can specify the labels or what properties you expect. You have rigid relationship patterns. Rigid here means that we're only talking about a single relationship, not many, right? So you can say, okay, it must be outgoing, it must have this relationship type, it must have these properties or maybe I don't care about the direction, all kinds of variations. It's very visual, right? And then there are also variable length relationship patterns. That's when you're looking for sequences of arbitrary lengths, right? When you're saying, okay, someone over there, someone over there connected, I don't care through how many steps, that kind of thing. And Cypher also supports paths as first-class values. So you can also not only find persons and their relationships in the graph, but also longer chains and treat them as values that you return to the application, right? It's a little primer to Cypher. So what is the open Cypher project? And again, who is here aware of the open Cypher project as opposed to just Cypher? Oh, okay, some, good, good. So open Cypher is a community effort to evolve Cypher and make it the de facto language for querying property graphs. And there are a couple of open Cypher implementations. I would expect that the most known still is the implementation by Neo4j, of course, but there's an implementation by SAP, there's an implementation by Redis, there's an implementation on top of Postgres by a Korean company called AgentsGraph. There's a Polish implementation about to be released. There's Cypher for Petri Spark. There's another thing I'm going to talk about in a minute. So there's a couple of implementations already. And what is this community? This community are the people who take part in either the calls or the meetings of the so-called open Cypher implementers group. And that's a group of vendors, researchers, implementers, interested parties, very diverse. It evolves Cypher through an open process, right? And so last year we had like three meetings, presented discuss, agree upon new features, and in this way things are driven forward. We have a website, yay, which actually has grown quite a lot last year. It has quite a few features, things you can find there. There's a blog, you have some overview about new features or features under consideration. It has all the information on meetings. It has, if you really want to dig deeper into all of this, it has a lot of upcoming meetings, recordings, slides, references, links to papers. We try to maintain a bibliography for the property graph space, right? Not just for Cypher there. And it has a bunch of artifacts, which are the actual outcomes that we kind of publish for people to help people adopt Cypher beyond what I already listed here. And these are right now the Cypher 9 specification, which just went online last Friday. We have Arntler and EBNF grammars for the language. We have formal semantics. We have a test suite. There's a style guide. So how should you write Cypher, those kinds of things, right? And then on the GitHub repository, the associated GitHub repository, we also host implementations of Cypher, namely that's going to be three. Right now it's one. It's mainly open Cypher for Apache Spark. You'll learn more about that in the next talk. It's going to be soon open Cypher for Grandin. Can't talk too much about this at Foster, but stay tuned, that's going to happen very soon. There's also going to be the open source front-end, which currently still lives in the Neo4j repository, but is more liberally licensed than Neo4j itself, and it's going to be here very soon. Yes, two of these artifacts I'd like to highlight. We just got a notification that our SIGMOD paper on the formal semantics of Cypher has been accepted, so there's now a proper mathematical definition of the semantics of Cypher, which is very important for implementers and also for proper academic treatment of the language. And we're going to see how we're going to, probably not the paper, but an excerpt of the paper is going to be published on open Cypher org. And the other thing that I'd like to highlight here is the TCK. So we have an ever-growing and very extensive suits of cucumber tests, which help you. If you want to build and implement Cypher, you could use that test suit to make sure that your implementation covers the features. Yes, I'm going to skip a bit as I'm running a bit short on time now. Okay, so much for open Cypher. Cypher 9 is what we have. That's essentially what you find in Neo4j today, minus some features that we think are two implementation-specific to be put into the standard, plus some clarifications, plus some other specification documents. But you want to move Cypher to the next level, and for us that is called Cypher 10. And we're working towards that, and we actually have begun to actively work on a natural language specification for that in hope to have that also out this year. And it will cover new features, more extensive subqueries, support for working with multiple graphs in one query, past patterns. It's a bit unclear if that will be part of Cypher 10 or be treated as a separate specification. At this point, configurable pattern matching semantics as well. And I want to talk a little bit about those things in this talk briefly. And the focus here is concepts, not syntax today. Okay, because the syntax of a lot of these things are still in a bit of flux. If you want to be involved in that, then, yeah, join the open Cypher implementation group or join the discussions on GitHub. So here it's about concepts. I was thinking about yesterday going over the slides again what is actually the underlying theme. The underlying theme is really query composition. And what is query composition? Well, what is composition in general? The meaning of the whole is determined by the meaning of its constituents and the rules used to combine them. So what does it mean? You try to organize your query into multiple parts so that you can extract parts of your query and put them somewhere else. And that gives you a lot of flexibility. That means you can reuse pieces of your code. And also it allows you to build more complex queries, more complex data processing workflows, programmatically also. So subqueries. So subqueries is actually an old idea, right? Subqueries are self-contained queries. They run somehow within the scope of an outer context. And that could be a clause. So it's a level of like a select statement in SQL or it could be within an expression. Cypher already has some forms of subqueries, but we want to generalize this and make this better than what we have today. And why is it a good idea to have subqueries? Well, it makes it easier to construct, maintain and read queries. And as I already said, subqueries enable composition, they enable post-processing of results. So the types of subqueries that we want to look at are nested subqueries that basically run any complete request. So you can have two queries, run result, two results, do the union. But you cannot go on after the union, right? And one of the main motivations of adding nested subqueries is what on GitHub is one of the most highly, highly voted up issue is post-union processing, right? To ability to do a union and then on top of the union do sorting and slicing. And here we see a bit of syntax, right? Curlies, then we have a full cypher query with a union in the middle, curly, and then we can go on, right? And say, okay, I have the union of author treats and favorite treats, and now I want to filter those and I'm only interested in the ones where the country of the treat is Sweden and then I sort them by time or something like that, right? That's going to be enabled by this. So let's talk a bit about the other kinds of subqueries because they come in three flavors. The first one is an existential subquery which you see quite a lot in graph querying. You say, okay, find me all instances of the patterns but make sure that this other thing also exists. I don't really need it back and I certainly don't need all of it back. I just need to make sure that something like this pattern also exists to the side. So to give an example here, find me all the actors where something is off here in the example. Yes, it's okay. Find me all the actors that played in a movie with someone that had the same name as them, right? So all the actors, there exists an actor that acted in a movie, right? With some other guys such that the other guy and himself had the same name, right? Other subqueries are scalar subqueries. That's when you just want to use the query to get a value at an expression level. So in this case, we find a director of a movie and then for that director, we want the minimum age of that guy and that we do with a scalar subquery and there's a similar form where we're just retrieving a list of matches. That's going to be subqueries. Let's talk about another big topic for Cipher 10, multiple graphs. So what people sometimes want to do is they want to take an input graph, select something and then they return a tabular projection. But what if instead of returning a tabular projection from my graph, I actually want to return a subgraph and then maybe do something with that subgraph, right? Instead of extracting a table. That's the whole idea. And there's a lot of reasons for wanting to work with multiple graphs. One is this use case I just outlined combining and transforming graphs from multiple sources, versioning, snapshotting, computing, difference graphs, so maybe you want to, you know, yesterday's data, today's data, what is the difference, those kinds of things, right? You might use graph views for access control. It might be helpful for shaping and integrating heterogeneous data. So it's a real feature. There's a lot of interesting topics in multiple graphs. And also it might, graph views also give an interesting, damn it, I shouldn't have done that. Sorry. Give an interesting mechanism for extracting data for kind of saying, okay, I have a graph with a lot of detail and now I'm trying to build up a graph that kind of contains an aggregation or a summary, right, in essence, essentially, of my dataset. So we want that in Cypher. But of course what we have today, usually you have a graph database system, you have a client that talks to it and it talks to a single graph, right? And there we want to move. There's a system that looks more like that, where you have a multiple graphs model, where you have an application server with a client that talks to the graph database system and the graph database system now has a catalog of graphs, right? Many, many graphs separate from each other and you can encourage them, you can combine data from them, you can build new ones, right? That's kind of the goal. And what does it mean for language like Cypher to do that? Well, you need to pass both multiple graphs and tabular data into a query to return both multiple graphs and tabular data from a query and of course you need some means of saying which graph you're talking about, am I matching data from here or am I matching data from there, right? And of course, that's the big topic, I think, you need to be able to construct new graphs from existing graphs, right? And how do you do that? Well, here's a little slide to show an analogy, right? So if you think about how pattern matching works, you write out an Asciate graph pattern of the kind of things you look for, right? And then you find all the potential matches against your data, right? Basically all the actors and the directors that play together in the same movie or something like that, right? And you get all the possible matches, right? You can talk of them first as matches, right? And then of course in language like Cypher that gives you a tabular projection, these matches are turned into records, right? Into tabular data, right? And then if you think about all the records together, that's a table, essentially, at the client side. Well, if matching means going from a graph to a table, then graph construction obviously could mean going from a table to a graph, right? And that's the idea you're pursuing here. So assuming you have a table of, let's say, a node, a relationship and a node, A, R, B, right? And you want to turn it into a graph. Well, you kind of have to describe how to take each row, bundle it up again in a single graph, and then if you take all these single graphs together, you have one new big graph, right? This is the style of processing. We want to explore here for Cypher. And we'll see some of that also in the next talk. So I'm not going to show too much syntax, but the, sorry, need to wait for this to go away. But the upshot is that we want to move to a model where you have a query language that can take data from multiple graphs, that can take tabular data's input, and then produce as part of its processing new graphs and tabular data. And we want it to be compositional, so it is not just in one step, but in multiple steps so that you can really build complex data processing pipelines. There's a lot of things, a bit still up in the air. I think they will become more clear in the first half of 2018, a lot of this. Graphs are addressed using catalog names, or maybe also through some kind of graph URI, right? There's a lot of extensions planned here, like set operations, subqueries over multiple graphs, managing graph persistence, creating views. I don't have the time to double click on all of that. But in this, if you're interested in those topics, there's a lot of material really on openscypher.org to look a bit deeper, or get, of course, involved in the discussions around this on openscypher.org or via Twitter. And a first prototype of how this could all come together, as I already said, will be shown by Martin and Max in the talk on Cypher for Patchy Spark. That's coming up next. Okay, let's talk about something else. Let's talk about path patterns. So, the meat of the graph is patterns, right? Which I already said, so you can kind of find things that are connected. But how can you describe how those kinds of ways in which they are connected, you know, how much expressivity you have there? Right now in Cypher, you can say, okay, connected via relationship, or connected with the relationship of this type in five steps, or in any number of steps, and shortest paths. Roundabout, that's what we can do, right? And we want to take that to the next level. And the next level, actually, there's a lot of research about regular path patterns. There's a language called GXPath from academia that has explored this in great depths. And we want to add these features to Cypher. And what that will give us is the capability to describe essentially the kind of chain of labels along the path that you're matching, in the same way that you would match a piece of text using a regular expression. Roundabout, right? And this also, there's also the desire to combine this with costs. So, just to give a brief example here, is so-called Bacon Path, right? So, if you have an actor who acted in a movie, outgoing, no, yeah, outgoing, and then incoming in a note called Bacon. So, the actor's name we take from a parameter from when we run the query. And there's an end missing here, sorry. And of course, the name of the note called Bacon over there is Kevin Bacon. But what we're looking for is essentially these chains, right? So, we have always actor goes out to movie and then goes again back to another actor. But we want to repeat this. So, we want to have this pattern of outgoing, movie, incoming actor, right? And we want to have this little piece of two relationships with the movie in the middle. We want to have that repeated multiple times, right? We kind of can express this through this language here where we have here the little plus similar to how it works in the Ragex. So, you can say, okay, repeat me this pattern, right? And there's a lot of syntax around this. There's a very long sip by my colleague Tobias Lindaker. So, that's our language change proposal. So, you can find that on GitHub. There's a lot of material on opencypher.org about this. Right, of all the things you can do there. But essentially, very close to Ragex. I already said it, you can sequence. You can do alternatives. You can do grouping. You can do transitive closure. You can also do relationship that you're traversing. And that gives a lot of added power in kind of the kinds of patterns you can match from your graph. We're also looking into a thing called a pass pattern which allows you basically to define some kind of macro so that you can define such a pattern once and then reuse it in your query multiple times. And then also nest those within each other. I'm not going to show that in depth now due to time. Last but not least, another thing we're looking into here is to add a notion of a cost function on top of that so that you can kind of with these defined pass patterns you can associate a way to calculate cost and that is intended to be a means for allowing weighted shortest paths and similar things in the language going forward. Yeah, that's pass patterns. Gets me to the last topic. Configural pattern and path matching semantics. Gets a little technical now, but it's necessary. So the way patterns are matched today in Cypher, if you have a chain, I'm going to read it for those in the back, from A via R1 to a node called B via R2 to a node called C. What it finds you, it finds you all the possible nodes for A, all the possible nodes for B, all the possible nodes for C, all the possible relationships for R1 that connect in A and B, all the possible relationships that connect a node B and a C for R2, but, and here comes the catch, R1 and R2 must be different. And that is called uniqueness, and we've had that in Cypher from basically day one. And the reason we do that is to avoid, if you have a variable length relationship pattern where you can bind arbitrary long chains, if we wouldn't have the uniqueness criterion, variable lengths would give us infinite result sets. So we need some kind of bound on the size of the result sets. That's what happened, that's why we have Cypher morphism. And it works reasonably well in practice, but some people have qualms with it. There's some critique from the theoretical community because certain variable length patterns under different semantics are not tractable from a complexity point of view. There's a lot of discussion on this. So there are these different classes of matching semantics called homomorphism, cypher morphism, and node isomorphism. And our stance on that is that you want to change the language so that the user has a choice of the pattern matching semantics that they can use. And there will be essentially a way of saying, this homomorphism, match with this morphism, match with that morphism. There are other variations of pattern matching semantics like saying I want all the match, or I just want any match. Sometimes that's desired. It's very similar to an existential subquery. And then there's a related topic also because this is called morphism. The whole topic of morphism is repeated because you can ask the same kind of question, not just for arbitrary complex patterns, but also just for the linear patterns. And there's a nice symmetry that in graph theory you have the notion of walk, trail, and path. And it may be a good idea to add that as well to the language so that you can say, OK, overall I want to match with homomorphism, but this little sequential piece in my pattern, I want that to be a trail or I want that to be a node path, et cetera. And also then, of course, other forms of semantics. Once you have a good means of making this configurable, other forms of semantics become possible like saying I want to do shortest paths. I want to do cheapest paths. This is a new cost function that we are going to introduce as part of path patterns. So the summary of all that, we will move, I think, in the direction of having configurable matching semantics so that you will have more flexibility. That will be out of the way of most standard users, but for those users who care, they will kind of have this feature. Not all combinations of the matching semantics are compatible or make sense, so it's a bit tricky to get that configuration matrix right, actually. And the end goal of all of that is that you want to give full control to the user here. OK, so slowly closing down. If you find all of this interesting and intriguing, I'd really like you to get involved. Follow the news at OpenCypher.org or via the OpenCypher Twitter account. There's a great Slack channel for implementers. I think that's also linked via the OpenCypher website. So you can kind of, we have, as I've run this Slack just for people who want to talk about OpenCypher. There's going to be a face-to-face meeting in Europe. Likely in May, we haven't fixed the date yet. And also not the location, but somewhere in Central Europe. And we also schedule new video calls, I think, for 2018. And another way to get involved is, of course, to go to GitHub to look at the discussions, to look at the Cypher improvement request, which are essentially issues. Or to look at full proposals, which are called SIPs. Or talk to us, or just create a pull request if you have a good idea for OpenCypher. And then, yeah, summary. So Cypher 10 is the next evolutionary step for creating property graphs. It will enable multiple graphs composition, subquery composition, configurable matching semantics, and potentially also other features like pass patterns. Thank you very much. No, I didn't get the second part, sorry. Yeah, I'm not sure if I got, okay. So the first question, the first part of the question was, how do you control the runtime characteristics? I think that is, I'm assuming you're referring to performance there, right? Yeah, you don't. Because that you achieve in a different way. You specify the semantics. And then people try to build good implementations, right? That's how it is, right? I think that's really hard to achieve, right? And also that could be very restricting. Even if you did some kind of bound or complexity bounds there, it might be restricted for implementers to do that. It's not even sure that that is a good idea, right? And the second part, I mean, we provide a lot of tooling for analyzing queries. You can use the Cypher front-end. You could use the formal semantics in order to try to better understand what the semantics of a query should be. So we provide some toolings, right? Cool. Other questions? There was one. Existing use case where I need to construct a pipeline consisting of different databases, even different kind of databases, some relational, some not. I'm sort of constructing this myself in code with something like implementing an open Cypher layer on top of that, the solution. That could be a solution. I can imagine that. And you really may want to listen to the talk of that's coming up next. Well, the question was how to use multiple databases together in one query. And if putting an open Cypher on top of that might be a great idea. And I was saying, yes, that's a splendid idea. That's why we're doing something in that direction in the Cypher for Apache Spark project. Okay? Cool. Otherwise, you can also grab Stefan. He's here the whole day. So if you have any questions or if you want to implement open Cypher for your database, then... I want to know you if you want to do that. The next talk? Yes.