 Oh, so hello, my name is Ben. I'm over at Queen Mary University in London and I'm going to be talking about Rafty a system that me and my two advisors have been building which look at distributed temporal graphs and how we're building and maintaining them from the set of event streams So I just want to give you a little bit of an idea of why we can't go to this point So kind of some of the original distributed graph processing systems have this idea that you've got a big chunk of data on disk You've got some chosen algorithms So you load it in turn it into your graph you churn through in a couple of iterations and out pops your result And then if you say you want to see how things have changed throughout time You might have snapshots, you know once a day for the last sex number of months And again, you load all these in build them into graphs you get a set of outputs And then these can you know you can do some deltas between them to see how things have changed That's quite sort of course and you know if you only have snapshots once every day Then you kind of lose what happens in between and so on this has kind of improved with these Stream-based graph processing systems where you have some event source out in the wild So some of the examples we've been looking at like crypto currencies mapping data so people moving around cities and obviously social networks and Then changes in these event sources can then affect your in-memory graph So in the case of social network, you might have a user joins the network Someone follows their friend and so on so these can all be inserted in and then users of your system can then query this They request processing and get their results back So this is great if you want to do some analysis on the most recent version of the graph Alternatively if you've got some metric that you're interested in monitoring and then seeing how this changes over time So what we are thinking was is that well if you've got all these changes coming in and you have all these problems We're trying to keep all your graph in sync and up-to-date Why don't we just try to keep all of the changes and build a full temporal graph So this could this in some ways simplifies the way that we actually synchronize But then also allows us to do things like comparing the newest state to all of the previous versions of the state And then actually do proper temporal queries So something like if you're doing a shortest path It might be I only want to go out on edges that are younger than the one I come in on or Alternatively you might have say for I don't know planes flying around edges only exist for a certain period of time And you need to get there and get on that edge before it disappears So those ideas in mind we come up with rastery Our initial work was on kind of formalizing this temporal graph model and the update semantics So how we add remove vertices and edges as well as updating a set of properties that they have so a key value set of properties associated with them How we actually distribute and manage this graph in memory So we have a set of partitions which have a set of vertices and edges each and Then how we sort of stream all of these updates into these partitions and keep them in sync And then we also provide this sort of pregel like temporal graph analysis model in which the user can request to do some analysis on The live graph any point back in time down to the resolution of the actual timestamps on the data So you could say you know, what does it look like last Thursday at 3.02 in the afternoon and then actually look through ranges? So sort of hop throughout the whole history of the graph and compute these different metrics and see how they change So I'll just go over a quick run through the architecture So over here we have this data spout. So this is kind of how the user decides how to connect to the outside world So this is something like, you know, I want to read this file connect to this database Listen to this Kafka stream or something along these lines. This goes into a set of graph routers Effectively what it does is it takes a user defined function and what that raw input transfers into in terms of graph updates So what is a vertex? What is an edge if something comes in? This is actually an update to a property and so on and then they forward it off to the correct partition manager or partition of The graph which deals with that vertex or edge that are affected and then as this is constantly running and maintaining Users can submit analysis requests which talk to the partitions. So I'll go on to that in a second So if we dive into one of these partitions They'll have a set of vertices and edges as I said and all of these will then have some history appended to them So in this case this vertex was created at time 8 then had some update appended to it at time 14 And this edge was created at time 14 possibly why this vertex has an update and was then deleted at some point later on As we're split across several partitions. We use an edge Partitioning algorithm so this edge down here because vertex 1 and vertex 2 are different machines are actually split across the two machines And you can see they're in sync Right so once thing that's really interesting about this type of history is that now all of our updates kind of become additive So even if we have a delete happen first and an add happen after as long as we keep this chronological list We can just slot them into the correct position. So you always end up with the right graph So this is kind of nice because if you have this problem of updates coming in the wrong order for a lot of Other systems you either have to drop them ignore them or you get an incorrect state in one of your partitions So as an example of this we may have say this edge ad that comes in at time 14 Because partition manager partition manager one deals with because vertex one is the source node So insert that into the machine the edge gets created and the vertex one gets updated We then synchronize across to the other node to say hey I've got an edge that I share with you so that gets updated into vertex 2 and the edge gets created there as well So that's all fantastic. Everything's happened in the correct order. Everything's brilliant What happens if say an edge gets added before we get a vertex? Well in this case we can create both objects the vertex actually just becomes a placeholder and Then so again we synchronize do everything exactly the same and then if the vertex ad ad comes in at later point We just slot that into the history behind So then if this comes in with all the properties all this sort of interesting metadata about that vertex It can be inserted at that point and then obviously things can go completely haywire So for some reason some packets have been lost or you know The network's gone all over the place and in this case this vertex has actually been deleted before it's even been created So in a lot of systems you might find that this is just okay. Well, this is nonsense. Let's just drop it and ignore it And that's obviously not what we want to do here. So again, we have a placeholder object which holds its deletion When the edge ad in this instance comes into the other machine It does what it does in that machine and then synchronizes across at which point the vertex now gets its creation at the correct point And we can insert this edge and then because this vertex was deleted all of its incoming and outgoing edges Should be deleted so we don't have anything hanging and that can then synchronize back So even though this went sort of completely wrong, you still end up with the same state and the same temporal graph moving forward So we've stuck in some water marking So you kind of know when when it gets to the point where this is safe to do or if you want to you know You can go with the approximate approach of just give me what's going in memory now so on that point You can I know this is a bit sort of whistle stop, but we'll pop on to the analysis so the general idea is that The routers are constantly ingesting new information from whatever source you've specified assuming it's unbounded and then the partition manager Constantly keeping sync with each other and wait from requests from an analysis manager. So the user says hey, I want to run this analysis Can I submit it? So this goes off all the partition managers will then go through their set of vertices run this sort of vertex centric algorithm and Then return to the analysis manager And that is some magic and say okay Well all of my vertices have either decided to vote to HULT or another iteration is required And this will go back and forth until it's happy that it's finished and the result can be returned to the user So what can the user actually request? Well the first thing is that If we have this temporal graph in memory We can say okay will give me what the live graph looks like so this is The most recent version of the graph either watermarked as I said So this is kind of the safe live graph or alternatively you could ask for the bleeding edge Absolute most recent version in which you have some air of a proximity. So depending on what sort of use case you have there Alternatively you might say okay We'll give me what it looked like last week last month a year ago something like this and these tends to be sort of stored in memory So we'll build that view and I'll go over that in a second or alternatively if rafter He's been running for a very long time and you start to have to push the older stuff out of memory Then we can start loading some of those things back if you want to go back that far We're also looking for ways of sort of offloading very old queries into a different set of partition managers So that query that sort of doesn't interrupt what's going on in the most recent version of the graph Obviously is future work Cool. So if we then say we have this full history of everything that's been ingested From time zero up until time n so the newest update. We might say okay Well, I want to see what the graph looked like that t10 and a good way of viewing that Well view is it's kind of like a right-hand filter. So it's okay. Well, this is everything that's happened I'm not interested in anything that's happened after this point So let's just kind of get rid of that for the moment So that you kind of get to see now what the graph looked like exactly at that point of time And then that can be used for your analysis But one of the things we found was that you know, if you're if you're looking at very large Data sets that have been existed for you know years and years Then there's an awful lot of patterns that happen in the short term that kind of get hidden by this huge aggregate amount of data So we added in something we like to call also graph windowing Which kind of is like the light left-hand filter and in this case you're saying okay Well, I'm only interested in things that happen in this band of time so it could be from this time stamp for the last day the last week the last month and so on and So then you can actually view the short term pans as well as long term ones and so for that We actually offer Windowing batches so you can say okay Well start at this point and then just decrease the size of the window down Continuously until you've reached or you've done all the ones I'm interested in as Well as these individual views you might actually say okay Well, I'm interested in this sort of range of time So over the last year or something and I want to hop through at some set interval maybe an hour or a day Again, you can do this so we can say build a view at the oldest point So time for and then we can hop forward to time six and then the new view is generated time 8 and time 10 So again, if you're doing these ranges, you can have all this windowing and window batching on top as well so that's obviously sort of very theoretical and Sort of a concrete use case would obviously be very nice As I imagine a lot you're thinking so one of the things here that we had a looked at was a network called gab AI has anyone heard of gab? Good, I wouldn't think so so gab is a Twitter clone. It's kind of this like Right wing forum, but they had an open rest API. So I downloaded all of their posts So I think they're like the big free speech thing. So I don't know. I assume that's why it's open But so I scraped them all between the end of 2016 up until mid 2018 And we then had a look at if we said we set a query running for that whole range of time and hop forward an hour at a time What do we see any changes in something like just something simple like the largest connected component? So for this we then set several different window sizes So we said okay have a look at a very small window like an hour a day a week a month a year And then the full aggregate graph an interesting thing here is that even though you're running kind of the same algorithm You actually know it's very different patterns So the aggregate kind of shows all the connected component continuously grows whereas actually if you look at something like the month You have these peaks of interest. So this is actually Donald Trump's election This is a Charlottesville riots So there's kind of like these like peaks of activity when people join the network and start using it And then that sort of drops down again And then if we zoom in a little bit further down to the hour scale you can see that everything above Like a window size of a day the largest connected component is always like a hundred percent of the graph So everyone's always connected. It doesn't really change very much However for an hour you get this lovely diurnal pattern as people kind of go to sleep and wake back up So the largest connected components. So it's like 80% of the graph So almost everyone is connected, but then as people start to go to sleep This all breaks down into very small communities that are talking to traveling the wee hours And then that brings back up again as people start coming back online So yeah, so even doing the same query and running on these different lenses or views of the graph Gives you very different results. So we're kind of starting to explore this a little bit now And we're obviously interested in anyone that's Once talk about this sort of stuff. So on that point If you are interested in using Raftery it is available on github It's all dockerized and has some actually pretty dreadful scripts to run it by working on improving those But yeah, so you can you can run it in there There's the examples of that day actual gab graph that I just went over We've got loads of spouts for ingesting different data. So gab Twitter Bitcoin aetherium and loads of other random ones We have actually ingested the whole Bitcoin and aetherium graphs over a big cluster of machines And they're working with a couple of different companies to do some like know your customer Entity resolution stuff So yeah, and we also have multiple analysis functions. So things that connect components and page rank We're looking at information diffusional. So so this is like spreading taint across cryptocurrency and then simple things like a degree ranking and some So for the the future of Raftery, we've just been funded by the Alan Turing Institute in London for any of you guys that know it To turn this from kind of the initial researchy project into the natural product that researchers can use So we're partnering with the Leeds University to look at some very large Transport data sets mapping people moving around cities and then tell us to see how that changes over time If you know the council do something like they put in a pelican crossing How long do I have to monitor to see sort of different changes in foot traffic? And also we're now spinning this out of Queen Mary into a company called choreograph So if you do see this name pop up, then it's probably me or not someone trying to steal my name But yeah, if if you are interested, please drop me a line or and you know leave anything on the Get I'm always on there. So thank you very much for listening So the question is can we achieve performance improvements by taking snapshots, so do you mean For the actual processing side or for the ingestion sides of things So the on the processing side We're looking at this I this a little bit So all the in-memory stuff when you build a view all of the previous versions are already there So you just go to the vertex you pull what it would look like at that point of time if it was you know so you'd filter initially and You then sort of you can use that vertex as it exists in memory already For the stuff that's pulled back in from disk. We're having a look at a different Snapshotting slash replay mechanics to make sure that they work properly So they'll be cut this I kind of idea of okay every x minutes You take a snapshot and then you have the replay of messages using Acca Sort of the message replay to say okay Let's get back to exactly the point we're interested in and then what's the sort of heuristics around that So that's kind of the next step that we're looking at at the moment. Yeah So it's for everything that's in memory. It works pretty much as is Point of truth Yes So we're having a look at perhaps some sort of like I don't know Vector clock implementation or something where you get sort of different time stamps from all over the place coming in for the moment our assumption is that if you're attaching to an outside source that that's going to be a Time stamps coming in at the time stamps we're using so a lot of the ones or say for example on the social networks They tend to have you know, that's done within their servers for the crypto currencies that done when the block is published So for the most of our use cases have kind of focused on that it'd be really interesting to see how we would do it But I think for the moment, it's not going to be so much about I see what you mean And Yes, so yes, so that with a perhaps some sort of rollback feature would be like if a Sort of update that come in that shouldn't have come in was So yeah, so we could I don't know how we do it at the moment, but it's definitely something to consider actually yeah Oh, I will put in my notes of think things to add in at some point. Yeah, no, thank you very much What's the biggest challenge of Yes No, so at the moment the the kind of view is that when you build If you're building a view which is kind of safe So either it's been watermarked or is a previous point in time, then it's kind of just a static graph and that's fine When it comes to that say if you have got like you're doing analysis So obviously it runs in parallel, but if you're doing analysis on the most recent version You do have some degree of approximation. We're having a look at like if we can kind of Work out what that degree is or you know something around that But we've only really just started work on the analysis the last sort of six nine months So yeah, no, it's it's definitely then the next sort of frontier of it for sure Maybe like self-driving course, but also maybe like in games that have dynamic So I guess the question is on can you use it for pathfinding in in dynamic environment? So I think it depends on the sort of Speed at which you're interested in so And of course the size of the graph So I think you probably could have you're going for like a you know If you're interested in maybe around, you know a couple hundred milliseconds I don't know if you're if you're interested in sort of proper real time You know needs to be sort of micro second or sub millisecond. It probably wouldn't return fast enough or something like that But then it's probably it's been more optimized for sort of general queries throughout time like the ones we were showing So you kind of chunk it in you leave it running and it kind of goes forever But again, I if we can if we had the data to have a play around with it I'd love to give it a go. Thank you very much