 So thank you all for coming. My name is Ryan Wright. I'm the founder and CEO of VATDOT. VATDOT is the company behind a new open source project called Quine, and that's the topic of today's conversation. So streaming graphs, because we can't afford to query anymore. So very nice to meet you. The plan for the talk today, as I mentioned, is to introduce Quine, the world's first streaming graph, kind of like a graph database but meant for reactive systems, stream processing, and that data pipeline kind of world. So the talk today is going to start by kind of orienting us among some of the modern challenges that if you're building a stream processing system or a data pipeline and you're trying to do it in a reactive sort of way, these are the kinds of challenges that you're going to encounter that were some of the motivation behind Quine. We'll talk about where Quine fits in, how it works under the hood. I plan to give you a demo with some live coding so I can't wait to see what breaks there. And then hopefully we'll wrap up with some links and if we're lucky, have a few minutes for questions. So collect them as we go and please ask away when we get to the end. So that's the plan for the talk today. And let's take a look at the world real quick. We've got streaming data. And if you're just going to count streaming data and count how many times you see something that flows through that stream, that's not so hard. You can watch that stream and count how many things flow through and sort them and filter them and so on. But if you want to put those pieces together into something more informative and do complex event processing on your data stream, this starts to become a big challenge. So the goal is to take what's coming out of that stream, one record at a time, one event at a time, put them together in the right combination because when you have that combined set, it's more meaningful, it's more expressive, it tells you more. And that's the goal of what you're trying to find in the stream. It's built from many different events flowing through. So if you try to build a streaming pipeline to do that, pretty quickly, you're going to have to start building queues for where you store those events that are not ready to be completed as the whole object over on the right. So those queues of unmatched events have to go somewhere. They have to get collected, monitored, updated, expired out. So those queues of events are what hold on to each individual event until you've got the full set. So it's common to hold these in RAM, but this starts spiraling and spiraling because not only do you need queues for just the single events that are unmatched yet, but you also need queues for your partial results. If we're looking to build a three-event pattern, then you've got to stuff your two-item pairs somewhere so that when that third event comes, you can pull them back and combine them together. And then you've got to maintain those queues. The common approach for these things is to do it in RAM. So with as much memory as your machine has available, that turns into the size of time window that you can allow for holding onto these elements, these data records, and putting those pieces together. But when you run out of RAM, you have to expire them out so you can keep up with new things. And I guess we just didn't find matches for those couple pieces we were keeping track of so they get expired out. And if data arrives later on and it wasn't a part of the pieces there, well, that's just some data that we're going to lose. So time windows are sadly kind of the state of the art for stream processing and how you do matching and putting the pieces together for complex event processing. But they lead to lost results. So if you want to do better than time windows, you'll have to go the route of storing the stuff on disk. And so set up a key value store, save records in that key value store, fetch them back, pair them up. Same pattern, but now it's durable. And now you've got that second system to set up and orchestrate. That second system now means what started as this hopefully simple application turns into a complex collection of systems. And worse than that, the semantics of what you're trying to match, the shape of what that goal object is supposed to be, that defines the shape of your architecture. So if the team decides we need to match something a little richer, we need to add a field to this, well, you're going to have to rebuild your pipeline and maybe even redesign your architecture to support higher load, more queries, more read rights back and forth. Because if you try to use a relational store or something bigger for this, it'll never keep up with the volume in the stream. And so just putting these pieces together becomes, it's a simple problem that turns into this complex solution. And if you've ever built these in the real world, this is an architectural diagram for what they often look like in practice. These microservice architectures, they start off with lofty goals, and it's great that we can actually put them together. We can build them. It takes a while. It's a little tricky. But it is technically possible. We can do this. I've been a part of teams who've done this as well. Building these systems is the backbone of modern data infrastructure at a lot of different companies. It takes a small army to do it. But if you've ever run it, then you've probably also been a part of watching this go down in flames for some implicit expectation about how systems are going to deliver data and when it's going to be there and if it's fully consistent or not, because these are hard and they're error prone and they're really slow to build, but possible to scale and expensive in the long run. There's so many implicit expectations between the microservices that get built that if a couple people end up leaving the team, then you're probably better off just throwing it away, starting from scratch, building the next iteration of your data pipeline. But this is kind of the state of the art. So it's fine. That's the world we live in. Well eight years ago, this was the motivation for trying to do something better, something bigger. So we started the project now called Quine. Yeah, really eight years ago, I guess. It's currently run by the team at VatDot. We had major support from DARPA along the way for doing a lot of core research and development, applying it to some really hard cybersecurity problems for the Department of Defense. Quine is a streaming graph interpreter for complex event processing. Feels like a graph database, but it's a graph interpreter for high volume stream processing. And it gives you this ability to put those pieces together, like we showed in that first slide, to join items together to get expressive patterns out of your streaming data and then emit them into a stream in real time. It sits basically between two streams. So you plug it into a stream on one side and it builds the graph. Then we monitor that graph and stream out those results. There's lots of different options for connectors and kinds of streams that you can plug into from Kafka and Kinesis and SNS and SQS. You can do other things like server-scent events, web sockets, files, named pipes, and standard in and out. A lot of similar options for where data goes at the end. And in order to get over that bounded by memory problem, there's a stateful layer as well that is also configurable. Using plenty of other open source tools like Cassandra and RocksDB, that stateful layer provides the support, the structure for the graph that can persist over very long time periods, and be why we don't need to rely on time windows in RAM anymore. To use Quine is really this two-step process. First you just say, here's my streaming data, I'll plug it in there, and I'm going to define one database query for how to take that stream and build it into a little part of the graph. Quine then fits that together with other streams or other preexisting parts of the graph. They overlap, they connect, they build this rich expressive graph. And the second part of that is a second database query. In the Cypher language, you say, here's the pattern I want to find as a standing query. That standing query lives live in the system, in the graph. It pushes itself around through the graph. Every time that the incoming stream causes the graph to change in a way that means that query matches again, it streams a result out. So it's like live materialized views on that query built on a graph from an incoming stream and publishing results to an outgoing stream. The overall goal here is to take a high volume of data in and turn it into a smaller amount of high value data coming out. So why a graph? Well, we started with a graph because graphs are really everywhere. They are a very natural, rich way to express data. And at first, it feels like a graph is something different. That you have to have a certain kind of problem to need to use a graph. And graphs are slow, so we can't really use them in production. But in truth, what we found over the last eight years of R&D is that graphs don't have to be slow. They can actually be profoundly fast and be the key to actually scaling the expressivity of data structures that we want to find. So a graph gives us this flexible structure that we can represent anything, including everything else that you'd put in a relational table or a document store like MongoDB. That same kind of data fits very naturally into nodes connected by edges in a graph. The graph data model is as flexible as the human language. Whatever you could express in English, you could also represent directly in a graph. That subject predicate object pattern that shows up in the English language is exactly equivalent to this node edge node pattern in the underlying graph. So anything that you could say in words, you can express as data and compute over it in a graph. What's novel about Quine is under the hood, it is backed by the actor model implemented on top of Acca. Those nodes in the graph get backed by actors. And so the graph data model is paired with a graph computational model. And that's really where Quine veers off the beaten path. To my knowledge, nobody had ever done this before. That independent, scalable, asynchronous, event-driven, message-passing architecture is what's behind the data that we represent in a graph. What that means for each node and how a graph gets built is we take the whole connected graph, whether it's running on one machine or spread across a whole cluster, and we cut the graph up on the edges. So we do what's called edge cut and represent one edge with two half edges that point at each other. So from the perspective of one node, there is an edge connecting it to another node if that node has an outgoing edge of a certain name and type and direction. And the other node has a reciprocal incoming edge of the same name type and direction. And so that pair is what builds up the graph structure. Each node itself holds onto properties that are basically just key value pairs to represent the data. And that gives us the structure of a graph that we can chop up as needed, implement as a collection of actors spread across maybe a large distributed system. And each of those actors can then take action and trigger its own action, do its own computation. That becomes the building blocks for all sorts of crazy ideas. So from there, we said, well, what if we built this graph instead of in the normal way where you start with an empty system and you have to create nodes along the way, but before you can create them, you have to check and make sure node A is there before node B has an edge created to it. Your nodes have to exist first, then you create the edges. Instead of taking that traditional approach, what if we came at it from the perspective of saying, what if we just assume all nodes exist? Every node that you could ever conceive of just exists already in the graph. What if we design the system with that as the foundation instead of saying, well, a database begins empty, so let's start empty. Well, it turns out it gives us some advantages. We can avoid the operations for creating nodes and deleting nodes and instead just focus on individual instructions to an actor that say, add this property or edge, remove this property or edge. If you want to delete a node, remove all the edges. Because if you have a picture like on the right where there's a whole bunch of extra nodes in the conceptual graph that you're trying to work with here, those extra nodes don't matter for queries. They don't contribute to the result. We're still going to find our ABCD pattern in exactly the same way. We just have these extra nodes in there. So why would we do that? That gives us some really powerful primitives for working with streaming data, saving it and loading it and all of this. Because under the hood, each node in the graph saves its data using the reactive technique of event sourcing. So a node in the graph receives an instruction to set a property, to add an edge, to do something to change its state. And it materializes the result of that change in memory. But what gets saved on disk is actually just the change. So that change is added to an append-only log. The next time there's another change, it's added to that log and so on and so on. Whenever we don't need that node in memory anymore, we can expire out the in-memory side, let the changes on disks stay persistent. And if we need that node again to participate in a query, we wake that node up by replaying its history. And all nodes exist. But it doesn't mean we have an interesting history for every node. So any node that exists, but hasn't been referred to before, is just an empty list for its history. So there's nothing to replay, no interesting history to talk about, and the node begins in what is effectively an empty state. But we didn't have to create it. And that gives us a lot of advantages for stream processing and the out-of-order data problems that arise when working with streaming events. And this event sourcing approach also gives us a durable persistence model that is extremely fast for the right heavy load that we have in stream processing. Because it basically behaves like a write-ahead log. We've got a small little change that gets computed in memory. That change gets appended to disk. We can do that very quickly. We can do that very quickly in parallel. And since all these changes go through the same node, that order of that log for that node is perfectly ordered. And it gives us the ordered history through time of the changes to that node. So we can even go back and query how that data used to be back at some historical point and say, give me the answer to my Cypher query right now, and then here's a timestamp. Tell me what the answer was 10 minutes ago, 20 minutes ago. So I mentioned before standing queries are how we get streaming events out of the graph. A standing query is just an ordinary Cypher graph query, just written in the Cypher language. But when you issue that query as a standing query, it gets dropped into the graph and stays alive, stays resident in the graph, pushes itself around through the nodes in the graph. As the graph changes, the events that cause the graph to change also cause the standing query to test and say, do I match my sub portion of this query? If so, relay the next portion to the next node. Each node is an actor. It can do arbitrary computation whenever it receives an event. So it receives a query AST saying, continue this query. So in that way, standing queries remain live in the graph, push themselves through and find patterns somewhat automatically, or behind the scenes autonomously. So whenever we've got the full structure, enough data has streamed in to create a new match, that match gets detected and can trigger any action, including publishing results out to the next system, the next service, or downstream service, or it can even call back into the graph and perform other operations to update nodes or to query for extra data. I'll show you an example of this. So if I drop out of slide presentation and flip over here, hopefully this is legible enough from where you are. This is the Quine open-source website, so you can go to coin.io and check it out if you're interested. There's download links and everything there. If you download the jar file from this download page right here, then you could even follow along and do exactly what I'm going to do here. I'll show a link in just a moment for all the resources and everything that I'm going to show. So using Quine, I have real quick set up a Kafka server just locally on my laptop because conference Wi-Fi, you never know. So Kafka is running here on my local laptop. It's preloaded with a little data to help show this demonstration here. So using Quine is really just a matter of calling Java-jar, passing in my Quine fatjar here. And then I'm going to say dash R for something called recipes, and I'm going to feed it my reactive 22 YAML file as a recipe. So on the Quine website, there's a collection of recipes that have been built by the community around Quine, just demonstrating the configuration and everything needed for different kinds of use cases. So if you want to go try one of these out, it's really just a simple matter of picking a recipe and running the command with the appropriate name after dash R. In this case, I'm going to run a local YAML file as my recipe, and so it's called reactive 22. Let's just run it and see what happens. This is what my recipe looks like. It's pretty empty. There's not much here. This is the minimum template for a recipe. It does have some stuff in these last couple fields. It's just cosmetics, trust me on that. The real meat of this example is going to be in lines four and six as we fill those in, because that's the two parts that I mentioned. Streaming data coming in to build the graph, that's in just streams, and then standing queries monitor the graph to find graph structure, and then trigger action to send it out. So I started this empty recipe, and it started at the command line, and it did nothing interesting. So that's exactly what I'd expect. Let's kill it and do something interesting. So I'm going to edit this recipe and make the simplest possible ingest stream here. So this is a Kafka ingest from localhost 9092. I'm going to run this a couple of times, so we're going to start from the beginning of the topic each time, and the name of my topic is Endpoint, because this example is a cybersecurity example. It consumes data from two different kinds of sources. Endpoint data, and I'll show you one for network data as well. But when we pull those two sources together, we can find some really powerful patterns that indicate a bad guy doing something under the hood in the data. That's our goal to see here. But this is technically all it takes to build an ingest stream, so let's just run that and see what happens. So I'm just going to run the same recipe again now that I've saved it. All right, it started up. It's ingesting. As you see down there, there's a counter going up, and I'm going to go to localhost. This is what the Quine UI looks like. There we go. We've got some data. Technically, this is a graph. It's not a very interesting graph because there's no edges, nothing's connected. This would be a waste of time if you actually tried to build a graph just to hold some JSON objects like this. But you can see there's key-value pairs in there. There's an event type, object. This apparently represents a process. There's a process ID on it, a timestamp, some information like that. Cool, we've got some JSON data that was just come in each JSON object saved as a node in the graph. Technically works, not very interesting. So shut that down and let's do something more interesting. I'm going to add a specific ingest query. So this simple definition just used the default ingest query of taking every object and making it into a node. So if we add a little more to it, I'm going to say this is a Cypher JSON query and here's the Cypher query that actually takes that record and builds some graph structure. I'm going to take one JSON object and build three nodes out of it. It's a process, an event, and an object. And what's unique about Quine is Quine is very fast for streaming use cases if you know the ID of a node. If you have to go look it up and scan every node that ever exists, it's going to be incredibly slow. So Quine has some built-in tools like this ID from function. ID from is basically just a consistent hashing kind of a technique that takes in any number of arguments and hashes them into the ID of a node. And so if I take my incoming data, here it's represented as $ that. So this is one record we're reading out of Kafka and I can read a field off of that. I'm going to use the process ID in this case and I'm going to generate a node ID using ID from just by passing in the process ID. And so that's the constraint on my ingest query. So match these three nodes, pull them out of the soup of all nodes that exist, but constrain this incoming reference to the process to the specific ID created by hashing that process ID. That gives us a specific node ID. Down here we can set a few properties on that node. And the last line is just creating some graph structure, some edges. If I save that I can run it again. There we go. We're ingesting. Let's reload and take a look. I'm going to use the cipher command to call recent nodes. And we start seeing some more interesting things here. Oops. It's more here than I actually want to look at. But take for example this thing node pattern. That's what we're creating right in the middle. This is an incoming process. It has an event that is reading a file. This is sample data meant to simulate real endpoint information, but it's based on a true story. So in this case, that is one record, one incoming JSON that we just pulled apart some fields and created some graph structure to present that. The reason we would want to do that is because there's other connections that occur. So as we refer to the same node using the same values in our ID from case, we build some other interesting graph structure and we can see this process in the center here did a lot of things. In this case it read and wrote to this one particular file. And that file was also read and written from some other processes so you can start to see the graph structure materializing from the data that is streaming into the system. All right, so we built our ingest stream to construct a graph just to make it a little more interesting. I'm going to do a second ingest stream. It's going to work a whole lot like the first one, but this time we have network data coming in. That network data is also going to be built into a three node pattern and it's going to work essentially the same way. But the fun part is then we get into standing queries. I want to watch the graph that's being built for a certain kind of pattern and every time that pattern exists I'm going to define outputs. These outputs take a match and trigger some action based on that. I'm going to show you pictures of what this looks like because it's a little easier to glance at when we actually see the diagrams. But I'll show you the pattern that this is defining here to find. Every time it finds a pattern it's going to go make another Cypher query to fetch some extra data take that data and package it up in a new event that we're going to write out to Kafka. So we're going to watch for a complex pattern every time we find it we're going to fetch some additional data formatted into a Kafka event and publish it out to Kafka. So let's save that. Let's close that and run it again. Now down here in the corner I've got a Kafka consumer running just command line consumer watching for the place we're publishing which is this results topic. So you can see data is ingesting, it's streaming in and this is the UI is really meant for visualization purposes and inspecting small graphs but behind the scenes Quine can scale to run in a clustered environment and handle huge volumes of data streaming through behind the scenes it's building a graph it's monitoring that graph and if it finds a record there we go I saw something happen in the background our event got published to Kafka the consumer picked it up this is what that event looks like it has metadata in here about the result and it has the actual payload of the data result I was publishing out a URL let's take that URL if I open a new tab pasted in the URL it formatted a query as part of that payload that pointed me to exactly the point in this data that was the pattern I was trying to watch for so the pattern that we set as a standing query is looking for this pretty complex graph structure so this is a case where a process had a write event pointing it at some file and that file was actually a covert communications channel behind the scenes for another process called NT Clean to read that file and then delete it later but first it's going to send some data out to a specific IP address so we were watching in the stream the stream was building each of these nodes one at a time incrementally building the graph monitoring for this complex pattern here once we found it it collected the data, published it out as a URL that we could just click on to a browser and see the details of what we're looking at now if I were a cybersecurity analyst I could go further, I could say well here's Excel what files was it reading let me right click and choose the right thing what files were read well it read a spreadsheet that spreadsheet probably had a malicious macro in it and it turns out it also then read this private plans document so we can start to see what's happening here well where's this NT process NT Clean process come from well this was started by whoops there we go so it had a spawn event connected to an SSH process that SSH process itself received data on a local IP import that IP import had some external communication from the same IP address where it ended up sending data out to and so in the Quine UI we can bake in some of these right click exploration kinds of functionality to go explore this particular use case different use cases can support different functionality but in this case the event triggered found the critical result that communicated what we wanted and a human could come back in and explore more later on so that's an example of using Quine connected into Kafka to consume data, build a graph monitor the graph for complex patterns every time one of those patterns is found send a downstream event so that you don't have to build oodles of infrastructure and microservices just to do that complex event processing so with that I want to wrap up with a pointer to this URL Quine.io is the open source website you can go check out the code there's links to GitHub on there if you'd love to try it out there's a community to help answer questions and get started as I mentioned there's recipes to demonstrate different use cases and show how to get started this is supported by the team at that dot if this is your thing we are hiring right now so if you want to come work on this kind of stuff we'd love to talk to you but also Quine.io.demo is another demo that you can see of a different application but if you'd like to try and run everything I showed here check out Quine.io slash reactive22 you'll find the slides the files I was using the sample files all of that so that you could run this and try it out yourself thank you very much if you have questions we are out of time but I would love to hang around and take your questions if we did so catch me afterwards thank you very much for your time and attention