 Great stuff. Thank you very much. Hello, everyone. My name is Jim Webber. I'm Neo4j's chief scientist. A few minutes ago, you just saw the kind of marketing level view of Neo4j and what it means to be a graph database. In this talk, I'm going to take you right down into the weeds. So we're going to talk about some computer science stuff. And we're going to talk about the future of distributed graph databases. So a lot of the things at BigThings Spain and previously at Big Data Spain were about uses of data. I'm going to talk to you about how some of that machinery works. So if you enjoyed computer science at university, this should be for you. Now, I'm very sad that we can't all be there this year. I miss coming to Madrid at this time of year when it's cold and grey here in the UK. But my colleague helpfully created an artist impression of how it would look if we were all in the same room. And it's not very flattering, is it? But there we go. It would have looked like this if we could get together in Madrid. So this talk is going to be pretty quick. It's going to be pretty hard hitting. I'm going to tell you in a few slides how graphs work and how graph databases work. I'm going to insult you by saying that programming is too easy and that distributed programming is too easy, which of course is not true. And then we're going to talk about some distributed systems theory. In particular, we're going to talk about the trade-off between reliability and availability in distributed systems and how knowing that we might be able to build a distributed version of Neo4j, which has good characteristics for scale, good characteristics for safety and good characteristics for availability. As it happens, all of these things are in competition with each other, so it's going to be a terrible mess. I hope you can tune into the end. So here's a graph. This graph represents the London Underground. Could you store this graph in Oracle or MongoDB or Cassandra? Yeah, you almost certainly could. But actually querying this graph in those databases would be pretty horrible and painful. If you have a graph database like Neo4j, being able to make sense of this graph is quite natural. For example, if you were to land at Heathrow Terminal 5, which is in the bottom left corner of this graph. And I said to you, please come to the Neo4j office in Southwark, which is in the middle, just south of the River Thames. And you knowing the rules, you clever humans, knowing the rules of graphs, you could find a root there. You probably go along the blue line until you hit the silver line and then you come down the silver line into Southwark. And I'd be waiting to meet you at the tube station there. How did you know to do that? How did you know those rules? Well, of course, graphs are very simple. You can go from node to node if and only if those nodes are connected by relationship. And Neo4j's job is to make sure that we can store these graphs and that you can query them at high speed with good safety properties. The graph isn't corrupted. So Neo4j's data model is very simple. We have these nodes and those nodes can have optional labels that tell you what the node does, a station or an airport or a person or a product. And inside those nodes, you can have properties, you can have data like first name, last name, social security number and so on. And connecting those nodes, you can have named directed relationships. You can have follows, loves, hates, bought, returned, connects. And those named directed relationships can also have properties on them. So in the tube map, for example, you might have travel time between the stations. You may have weights for a general root finding algorithm and so on. And relationships always have exactly one start node and one end node. And actually they can loop back to themselves. And it turns out that's the whole data model that you needed to learn. Now, if you compare that to relational databases, I'm sure many of us read Al-Mazry and Navathy when we were at university. It's quite a compact data model, one widescreen slide compared to 1272 pages. Okay, I cheated a little, but even so the graph databases booked for a while is only 209 pages. So for example, if we wanted to write a query for a graph, this is a piece of Neo4j's cipher. And this is very straightforward of those of us who have seen SQL, except it's designed to be humane. And what we're telling the database here is, hey, database, find me a pattern that's match, where I've got a node labeled customer and nodes are using round brackets, which has an outgoing relationship called typed friend at depth one or two to another node labeled customer. And where me and the customer aren't the same person, so remember friend of friend could be yourself. Send that this friend through to the next part of the query, where we look for items that a friend bought, to look for a basket that a friend bought had a product in, return the product, return the count of the product, ordered by the count of the product in a descending fashion, limited to five. And this super compact query here tells us the five most popular products that our friends and friends have bought. It's very, very humane. And this is Neo4j cipher, of course. But I will point out that Neo4j cipher is being standardized by ISO. The same committee that brought us SQL is now bringing us GQL. So you should find that this kind of cipher like syntax will be available in many databases in the future. So that's all about graph databases. That's the humane part. That was just to lure you in. Now it now comes the tough bit. I'm going to tell you that programming is too easy. So this is real systems programming here. I have a creation and an assignment to a variable int i equals 10. I have another assignment to that variable i equals 11. And when I do system.out.printline on i, I get the value 11, which is exactly what we'd expect, right? Sure. No, I'm kidding. This is real programming int i equals 10, later i equals 11, thread.sleep 2000, the classic debugging technique, out.printline i, 10. Wait, what? Wait, what? Did that just change? 11? 10? What's going on? This would be an awful way to program if we couldn't understand or reason about the assignment of values to variables. And programming is hard enough already. But if this kind of behavior was presented to us, this true behavior, the way that the machines potentially work without a border execution and high degrees of parallelism, it would drive you crazy, right? You would definitely quit and get another job, rather than just threaten to quit and get another job, as many of us do on a daily basis. So in reality, what's happening? There is a collusion, a cooperation between the code the compiler emits and the functionality in the hardware. And we have these abstractions called memory barriers. And a memory barrier enforces that once a value is written, it is that value which is read by subsequent threads. Do we notice it when I'm writing Java or Python or whatever? Do I notice these things? No, I absolutely don't. The hardware's job, even though the hardware is crazy concurrent little network computers with lots of causing, I don't notice this when I'm writing a single threaded app. I reason about my program in a very sequential manner. So in this case, because of the availability of memory barriers, we get the output we expect, which in this case is 11. So obviously, that should mean that distributed systems are easy, right? We can take the same kind of abstractions that our multi-core multi-processor systems have, and we can just scale them up. Cool. So here's an app. I think we've all either used or have built apps like this, where you found an item that you'd like to buy, and then you need to log in to go through the checkout process. Well, I don't have an account, so I'm going to create an account. So I've created an account, username, gymw, password, super top secret password, can't share that with you. So that's super. And now that I've got an account, I can log in. So I type in gymw and the password I just created. And then of course, what I get is a what? No account found. Try it again. That is just insane. I've just created an account. I'm trying to give you money, and now I'm getting this kind of awful, awful user experience. Now, I would like to blame the user experience designers, but I have a suspicion it's probably not their fault. For example, if I put in the distributed systems bugging statement, thread.sleep 2000, then log in successful. Which actually in many ways is worse, because now I don't even know why this worked. And it really does make you kind of cry out in anguish as to why this sometimes works and why this sometimes doesn't. And the reason is that we don't really have the equivalent of memory barriers in distributed systems because they can be slow. They can be hard to build. They can be hard to maintain. And anyway, not bothering about memory barriers, doing a refresh or a thread.sleep 2000 has a cool name. We call it eventual consistency. And because the real world is eventual, then it all kind of works out, right? Maybe I would like to introduce you to some old magic. Fisher Lynch and Patterson or FLP as it's known in the computer science literature. And Fisher Lynch and Patterson a long time ago decided that distributed systems don't work. So the short version of this is distributed systems don't work. You can never achieve consensus with a distributed system. In practice, things don't go wrong all that often. So there's a useful window of opportunity where things do work. But in the pathological case, distributed systems fail and there's nothing you can do about it. No amount of soul searching about cap theorem can get you out of this. For those of us building distributed systems, it means there's a trade-off between availability and reliability. Now, availability is the property of a system that it can respond to an event, that it can process a read or a write, even when faults occur. And reliability is the property of a system to produce the correct result in response to events, even when faults occur. And there's a trade-off here. You can't achieve both of these. And that's a real saddening thing because we'd like to have highly available systems that always tell the truth. But in practice, we can either have in the pathological case systems which are available or systems which are truthful, but we can never have both. Thanks a lot computer science. So that means in practice you're damned if you do and you're damned if you don't. Systems that require coordination, so those systems that want to be reliable will block when enough failures occur and you'll pay coordination costs all the time to keep things correct. Systems that don't use coordination, these kind of highly scalable databases will be wrong in some cases. They will serve you stale data, irrespective of failure or non-failure cases. So what we really want to do is try and maximize both of these to some reasonably practical level. We can make availability easier by potentially losing rights, serving stale or inconsistent reads. That might be okay for some domains, but it does compromise reliability. We could make reliability better by occasionally coordinating or occasionally ensuring correctness with ordering or consistency. But of course, coordination costs and it does reduce availability. Now, some folks like Google managed to do this at scale with clever hacks and clever workarounds like TrueTime, but few others managed to do this well. So how might we build distributed Neo4j where the graph that you want to store is so huge, it's exabytes huge that you must spread it over multiple servers. I'm going to tell you about now is the results of several years of research in doing this. Now, the version of Neo4j you can download today. In fact, today we released Neo4j 4.2 just a few hours ago. So please do go and download it if you have a few intographs. It uses coordination. In fact, it uses RAFT to safely commit transactions across multiple cooperating servers. And that means it's very safe. It also means it's relatively expensive because every transaction has to be acknowledged by a majority of machines. Now, RAFT is lovely. If you are interested in distributed systems, RAFT is in the same family as Paxos. It's a leader-oriented consensus algorithm, which means it gets computers to agree on something. And it's non-blocking because if the leader dies, another machine in your network can pick up where the leader left off and it can continue. So it's good for continuously availability systems. What's really nice about RAFT, it was designed to be understood by normal humans. Something which Paxos sadly didn't often pay much attention to. So the way it works here is that I've got a leader, in this case, server one, which is proposing our values to the other servers, which we call the followers. And once a majority of the machines have agreed, in this case, three out of five, then the leader considers the value proposed to be committed. And so it goes on. And what's lovely about this is that RAFT assumes a log. In fact, RAFT is logs tied together. And RAFT has this lovely inductive property that if the i-th element of any log agrees, then the i-1th element, all the way back to zero, will implicitly agree. So it's a lovely, friendly protocol. It keeps logs tied together. Those logs contain data that the user wants to store, as well as information about cluster membership. So you've always got an authoritative view of your cluster topology. Entries like values are appended and committed if a simple majority agree. Now, that's not necessarily the most efficient possible computer science thing to do, but it's really easy to reason about. So the implication is the majority then agree with the log as it's proposed. So a majority of machines have the same log. At any point, anyone can call an election. It's a logical clock. The highest term will win. So the server with the highest term wins. So that's the biggest number of election wins, followed by highest committed, followed by highest appended. And this is fine. Raft simplifies everything by majority decision. It's not optimal, but it's really easy to reason about. So I love it. Of course, one of the funny things about Raft, even though it's a protocol for strong consistency, is that because of physics, cluster members are always slightly ahead or behind each other. So this is technically from a God's eye point of view inconsistent, but it's also normal behavior. But we can deal with this. We can actually take into account this kind of behavior. So if I've got that app that we saw earlier, and I do create accounts, and then I do so my iPhone or my mobile device sends a message via REST or something to an app server. The app server has a Neo4j driver in it, which sends a message over to the database. We can then start to create a user. And then if I need to log in, I can take that user and, oh, no account found. Oh, that's a real Jackie Chan moment again. That's really surprising, right? Because I've just spent the last few minutes talking about Raft and how wonderful it is. And yet we still got the Jackie Chan thing. That's because Raft doesn't impose a causal boundary. There's nothing to stop any server at any client that's binding to a server that's behind the leader. That's okay. It's perfectly normal in Raft, as I mentioned, it's perfectly normal in distributed systems. In Neo4j we have the notion of a bookmark for this. And the bookmark is just a token that's in your session. And it's a string. It happens to be an opaque string, which represents the last transaction that the client saw. So what the client can do is every time it wants to interact with the database, it can provide this token and say, hey, database, process my query or update. Only if your view of the world is the same as my view of the world. So if you ever used eTags in HTTP, it's rather like that. It's a very simple optimistic concurrency control mechanism. Using it is absolutely simple. You simply, when you've done your interaction with the database from the session object, you just ask for the last bookmark. Now when I create an account, I flow those bookmarks through the system. On the bottom path here, when I log in with the user I just created, I provide the bookmark and the login query won't run until, in this case, the bottom server has received transaction 11. Trivial to use, we just pass the bookmark into the begin transaction method of the session. And then we get this thing. So this is lovely because now the database acts like a variable. There's none of this thread.sleep 2000 stuff going on as there was earlier. But we do have a problem. And that is that the leader in RAFT or a RAFT-like protocol always coordinates all of the followers. And so that coordination is potentially expensive. It's potentially a bottleneck. It's potentially limiting. If I want to have a cluster with 1,000 servers in it, waiting for 501 responses from those servers before I can consider something committed might be onerous. So we can improve coordination. Now, other databases you can download today partition their data to try to scale across many servers. There's a real snag here. And that's correctness that is correctly storing the data in a way that it isn't corrupting as you store it requires coordination. Bother. Worse still, there's this notion of reciprocal consistency. So if I have a graph, a tiny graph of me and my boss, Emil, and we're on different servers and there's a relationship between us, let's say a works for. Then if you query from my server outwards, you should see the gym node and you should see an outbound relationship to Emil marked works for. And if you're querying from the Emil side of things, you should see an incoming works for relationship. And that's a reciprocal consistency. It's reciprocity in English. And if you can't maintain reciprocal consistency, then your data is corrupt. And that's awful. There was a paper by Paul Lester Chauvin, is your metronome myself, which was a piece of theoretical computer science. And in that we demonstrated for any reasonably sized database with any reasonable amount of activity, your database with no faults, just because of the weak underlying week consistency semantics in a no fault database, your database is going to be corrupt within within months. Your database is useless within months. That is an absolutely awful result. And unfortunately means that many techniques we might like to use are not accessible to us. So we need some computer science. What do you think of me Alan Turing? Maybe I don't have the hair. I'm possibly more like Rick Sanchez than Alan Turing anyway. Maybe. Interesting then. So let's try to make our consensus better. Now as a researcher at the University of Cambridge called Heidi Howard, and she's really into PAXOS. She's really, really super smart about this. And she's done a bunch of really cool work on achieving consensus with smaller forums. Now if you remember, Raft has this very simple idea. If a majority agree, then that is what we will go with. But it doesn't have to be that way. You can achieve consensus with smaller forums. You could have trees or rows that create consensus. So you don't have to do an expensive majority consensus. There's an interesting trade-off here of course that while consensus is cheaper, it turns out that in the recovery case, recovery is more intricate. If one instance fails, it may be easier or harder to recover versus another instance that fails. But on the whole, if your system is working, not failing much, then actually this might give you some advantage because you don't have to ask a majority of machines. Now your mileage may vary. I pick Raft because it's simple. But if you're clever enough to do this stuff, it's going to be helpful. Another idea is simply to avoid coordination. So Bayless et al. and my friend Alan Fecker say from the University of Sydney here actually thought about whether you need to coordinate all the time because sometimes you may have two threads, two transactions, each touching completely different parts of the data set where they never overlap. And in principle, you can see that those things should be able to run concurrently. And what they found was in TPCC that you only needed to coordinate 13%, 1, 3% of the time to preserve correctness. Can you guess how much of the time a typical database coordinates? Well, it's 100% of the time or if you're running an eventually consistent database, 0% of the time. So some kinds of workloads simply don't need coordination because the results commute. Some workloads really do require coordination or you'll get wrong results. So why not just coordinate only where you need to? And here's a common theme that's emerging both from Hayley's work and from the Bayless work. The common theme is the coordinator. You need some agent on your network that decides who goes first. Is it Rosa or is it Carl or is it Antonio? So having the coordinator manage small forums is helpful. Having the coordinator do less work is helpful. So what if the coordinator could manage smaller forums and do less work? Well, then it means we get a lot more uncoordinated work than where we can, but there would still be the coordinator and that's problematic. So how do we make coordination scalable? I'd like to introduce you to a very old piece of work on chain replication protocols. Now chain replication protocols are very simple. You push updates into the head and they get propagated through effectively a linked list of servers and you read from the tail. And so all agent server needs to know is the next in the chain where it can push the update. Now our tail coordinates effectively providing strong consistency. It's got slightly tricky failure and recovery semantics because if a node in the middle fails you need to know how to repair the linked list but it does provide some usefulness. But in a way, this is sort of isomorphic to having a coordinator. It's the same situation as having a coordinator. In this case, all machines in the chain must agree for the update to reach the tail where it can be read. So again, you might be thinking, oh Jim, why did you tell us that? That's just more hard computer science stuff for no benefit and I'd agree with you. Chains are annoying. I think chain replication is not particularly a technique I would choose to use today. But there was a paper from the, I think again it was Cornell and Waterloo. Oh yes, it says so here. From these folks, one of whom is now very famous in the electronic currency and blockchain community. And what they built was a key value system that could do multi-key updates. And that's interesting because the graph is kind of a fancy multi-key value store. They're interested in this and they use chains to do it. And in the system that they designed, the chains give one copy serializable acid transactions with no spurious coordination. That's a bit computer science. What it means is that your data's safe and you only need to go through the expensive coordination path when transactions overlap. So suddenly now chains are cool. Not chain replication, but chain transactions. So a chain transaction protocol works like this. You go from server to server preparing what work needs to be done. This is the outward chain and it's analogous to prepare in two phase commit. When you get to the final server in your work, you come back along the servers and you make the work durable. And this is, I guess, analogous to commit in two phase commit. You might be prepared, you might want to say, hey Jim, this looks a lot like two phase commit, except each server does the coordination for its downstream server. Does it have the same drawbacks as two PC? And in fact, maybe it's less easy to reason about because with two PC, at least I have a single coordinator, not a chain of mini coordinators. And you may be tempted to be frustrated with me again. But I would say no. There is some underappreciated genius in the work that the war folks did. And it says here in their paper that transactions are ordered by servers where they overlap. Now this is really interesting because it means that if I've got a transaction running and it uses servers one, two and three and you have a transaction running that uses servers four, five and six, they cannot overlap. Therefore they do not need to be coordinated. But if we have a transaction that touches the same object, the same server, the same page, the same row, or if for me, the same graph object, then the server on which we contend records that dependency. And it allows one machine, one transaction to go first. What the genius is, it doesn't matter which transaction goes first, provided that order is preserved everywhere else, which means you can federate coordination into the network. So no longer do we need a coordinator. Let me show you how it works. So each transaction proceeds to the servers that it's using. And on the forward pass, it identifies other transactions that might overlap as a conflicting pair on their backward pass. So it captures these dependencies in a dependency graph and stores them so that it knows when the backward pass happens, it has to resolve dependencies before it can allow a transaction to go ahead. In this case, you can see that Rosa's transaction and Carl's transaction contend on servers one and servers three. Now on the backward pass, let's say for the sake of argument that Carl's transaction finishes its outward pass first and it begins its backward pass. Server three is the highest common server for the transaction, so it decides the ordering and it makes sense for it to be Carl here. So then Carl commits at three and then one and then packs the client. Later, as Rosa's transaction comes back, she will see that her dependency on Carl has already been met hopefully by server three and server one and can commit immediately. If Carl hasn't finished committing on server three or server one yet Rosa's transaction will be paused until that dependency is met. This is wonderful because it means in this case server three has acted as a coordinator, but you can imagine scaling this up any number of servers, each of which might act for a coordinator for the transactions it's currently processing. So you begin to see that this provides a bedrock or an underlay for a system which is highly, highly scalable. Moreover, in those situations where in this case Rosa and Carl don't contend they're writing to different parts of the data set there is an absolutely dirt cheap path for them to take for committing this data. They don't need to do the dependency management stuff they can proceed in parallel with very little coordination overhead. Now you might say Jim in practice you know transactions are always gonna transaction chains are always going to cross and I'd say well yes quite possibly but good modeling so enforcing good locality with your modeling or with your graph partitioning algorithms should help to keep these things separate so you maximize your chances of transactions executing concurrently. So I really do think this is underappreciated genius transactions are ordered by servers where they overlap should Rosa go first or Carl and the absolute wonderful piece of observation that the researchers made is it doesn't matter provided everyone agrees to it. You can choose Rosa or Carl based on the toss of a coin you can choose Rosa or Carl first based on the perceived rival order you can do it on the first of them to change direction I don't mind. This is a brilliant result though because it means we can have many coordinators and use them only when chains cross so this gives a scale out with good strong consistency properties. In fact we could go one step further and say actually if we observe long chains we can start to mix approaches so if for example we had nine servers which were splitting logically into three shards if we only had the chain transaction protocol we'd have to create chains that went across nine servers but actually we already know how a shard a group of servers a replica set can be replicated using leader oriented protocols do you remember a little while ago when we talked about raft or that's a really good way of making one server a copy of another so zero single photocopying bits over the wire so actually we can now begin to combine these two protocols we can use raft for copying bits across identical replica sets and we can use the chain transaction protocol the linear transaction protocol for actually coordinating our commits across those replica sets so the complexity of the overall commit the number of messages exchanged and the points where things can go wrong are drastically reduced we go from a chain length of nine which is unwieldy has hard recovery and it's slow because it's serialized to a chain length of three which is much easier to reason about you got much simpler recovery because you can recover each shard on its own right and it's now faster we've got kind of parallel commit going on per shard if you like you only need to get the majority of machines in this case to commit before you move on to the next step in your chain and this is really important I think in real systems when failures happen people are stressed and having a few simple dependable building blocks is a great deal of help so that we don't get too stressed out so where we're going we don't need broads or in this case bottlenecks we have this wonderful composable set of protocols which can enable us to create a horizontally scalable architecture with strong consistency guarantees we're using this leader oriented consensus for doing replication in shards or replication across replica sets and we're using these chains for the inter shard coordination so we've got a very distinct separation of concerns or separation of responsibilities now of course you could imagine scenarios pathological scenarios that can cause slowdowns through high contention but you can do that in any database and they can actually be mitigated a great deal with good modeling and sharding strategies if we take work like the taper algorithm by Hugo Firth that will give us good workload aware strategies for sharding then these kind of long chain contention problems simply won't appear so we're able to create an architecture that's fit for the future now look there's absolutely loads more computer science to come what I've shown you here barely scratches the surface of probably a 10 year program of work underway at Neo4j this deals with safe coordination in large scale partition distributed graphs but of course to build a database it's not just the transaction protocol we now have to think about distributed query planning do we send queries over to the remote shards or do we fetch data from the remote shards and cash it locally in a kind of distributed shared memory fashion are we able to use remote direct memory access RDMA to very cost effectively access remote shards as we can for example Microsoft Azure's cloud or are we running on AWS where we can't do RDMA and we have to do different things are we able to do responsive resharding are we able to understand the prevailing workload and look for those cases where we keep going across relationships that span between shards can we develop algorithms like taper which would take that remote relationship and be able to re-home it within a single shard so you'll get locality services for querying within that single shard and we'll keep kind of cohesion with it are we able to do things like automatic tuning are we able to use machine learning algorithms to learn about the configuration of our database as PelotonDB from CMU or indeed to learn about what a good partition would look like given the prevailing workload and be able to tune that partition in a way that it gives us the best locality rates and we keep as much data as possible on a single shard to get those locality bonuses and of course are we able then to provide even at the low level close to the hardware are we able to provide concurrency in IO in the database kernel which is a fine to the way graphs work to think differently about the way that concurrency models work that suits the way that uses view graphs now remember that computer science is a wonderful wonderful career choice if you want to look like Rick Sanchez here please talk to me we'd love to talk to you about joining Neo4j finally I just want to thank you all for listening sorry that we couldn't all be there in person this year it's something I'm missing I'm sure that you folks are too the stuff that we've talked about here is the future of Neo4j if that excites you please do ask some questions we'd be more than happy to answer if you're interested about getting in touch with the Neo4j community in Spain many larger businesses in Spain are running Neo4j there's a wonderful, vibrant Neo4j community in Spain as well so hit up the Neo4j community site and we can get you connected there but thank you all for listening I hope you enjoyed the talk thank you Jim we have a couple of questions but we only have time for one but I suggest the people that dropped you a line with all the questions okay so the question is from Lorena Iturra asking you which customer would benefit from this technology these days which kind of customers yeah that's a really good question I think Neo4j we are absolutely seeing a variety of customers in the early days we were very much social and telecoms because they're obvious networks but nowadays if you look at even say the Spanish market for example we have Santander and BBVA who are financial customers using Neo4j in a variety of use cases we have Inditext so that's the clothing store doing things like recommendations, stock management and so on we have the airlines using Neo4j, we have other tech companies using Neo4j I think today we have the tourney of choice asking for what's the ideal situation for graphs is hard because graphs are a very horizontal thing they're a very general purpose modeling idiom and so I think in those situations where we might have used relational database in years gone by often now graphs are a better fit no matter which vertical you're in thanks a lot Jim there was some more questions about the same topic so there's a enormous interest to know which kind of customers can use your technology so thanks a lot Jim take care pleasure thank you and folks if you need to contact me just hit me up on Twitter and I'm super happy to give you the answers to your questions