 Well, hey everyone Welcome welcome am I getting a little bit of feedback here. Maybe if I stand over here. I'm better So hey welcome to you know complex event processing made easy with Apache Cassandra and Quine streaming graph My name is Aaron Pletz. I'm a developer advocate with data stacks I'm fairly active in the Cassandra community if you if you have problems and you've asked questions on stack overflow There's I figured it out one day. It's there's like a 16% chance that I've helped you if since I've answered so much of a And the the questions in the Cassandra tag, but so yeah And then of course with me. I have a Ethan Ethan from from that dot so I can take it away Yeah, I'm Ethan. I've been at that dot since its inception just about and know the internals of Quine from every every layer and just love discussing it So that's what we'll be doing a little bit of today as well as playing with the general ideas of of stream processing So that I'm gonna just jump into it. So stream processing Want to give a quick layout of what we're gonna go over we're gonna cover the general topic of stream processing Kind of where complex comes into it where we start thinking about complex event processing The challenges you're likely to face if you're dealing with any kind of stream processing How the product that my company makes quine how that addresses a lot of those challenges how Cassandra really enables quine to address those challenges and We're gonna run through a case study for you of password spraying attack And how you might detect that using kind of a streaming paradigm And then we'll conclude with a few resources places you can go if you're curious if you're wanting to learn more or just You know chat with us. We're fun. I promise. Yeah. Yeah, we're not we're not totally lame So what is stream processing? stream processing is a kind of of data-oriented development Where you're typically trying to draw real-time conclusions from current information? I have a little asterisk on current because current can mean a few things. It can be real-time. It can mean This happened two minutes ago, and I want to react to it This can be it can mean this happened 14 milliseconds ago And I want to react to it or it can just mean current with respect to I'm reading a big pool of data And I want to be focused on the the head of that pool the the latest thing I've read and they're typically Stream stream processing tasks are typically kind of event-based. So you're thinking about things like Social media updates things like server logs Anything that is more about sort of small changes over time rather than big snapshots being relayed in kind of a loop and related to that one of the sort of limitations on What stream processing is in my opinion is that you you tend to think of less batching you tend to think of less large large blobs of data coming through and as a result of that there are some limitations and some strengths to stream processing and a couple of those are The loose consistency bounds. So you don't have the ability to say, hey This answer that I've computed is correct over all of the data because I'm not dealing with all the data I'm dealing with the latest data or I'm dealing with some subset of the data and error correction as well. It kind of falls in that vein of I'm not necessarily going to be totally Consistent in my answers with respect to the entire data set. I will be able to give you some kind of bounded Answer I'll be able to say you know based on this information. Here's a definitively correct answer so The thing with that though is that you often don't need these full guarantees and we'll get into that a little bit more so difference between Stream processing and complex event processing typical stream processing your data comes in you do some operation other operation other operation You're doing some filtering. You're doing some mapping folding whatever you want to do Complex event processing is where you start to introduce persistent state and this comes in in typically one of three ways Either you're dealing with more than one event at a time or more than a static number of events at a time And you're thinking about things here likes lighting windows saying I want to be Computing some answers in real time based on the last 20 events or the last 400,000 events or the last two days of events you can get time based in here The another common way that you'll get into this complex or stateful space is when you're joining multiple streams And this is because you have to worry about what happens if those those sources of information That talk about the same sort of entity. They talk about the same user But they're coming from different sources and therefore coming at different times How do I join that information together without losing anything and while still retaining sort of bounded resources? Bounded resource huge usage. Wow And then Going one step further than that you have the problem of joining a stream with itself Which is all those problems of joining streams and sources as well as all those problems is dealing with more than one event at a time where you've got something like I Want to take action I want I want to change the action I'm taking on incoming data based on something. I've already computed about that data So now I've got this problem of well feeding back in that decision is equivalent to joining multiple streams as well as I'm reasoning about multiple things at once like I've got to deal with this part of the stream and this part of stream and the connection between them That can be really difficult But it also enables far more powerful computation you get back some of those those trade-offs that you get in stream processing in the first place of Consistency you can start to say hey, I am going to inform my decisions I'm going to make maybe a heuristic initial decision and then look back at my existing data to ensure that My answer my computed answer is correct for everything that's come in since then or everything in the past That's again an example of kind of a stream joining back on itself One concrete example on this you might you might use a Alert to update your alerting rule you might say hey this account looks suspicious Let me set a new flag in real time to watch everything related to that account or everything related to everything related to that account You can do some kind of tag propagation if you will So stream processing can be hard You have to especially when you're dealing with this complex event processing You have to choose some things you have to choose your window size your time interval You have to handle things that are bigger than in memory and therefore kind of how you Dan down sample That's the case for all of these constraints When you have more data than you can deal with right now, how do you deal with it? That's not always an easy answer. It's not as simple as oh, I'm gonna keep the latest 700 events because what happens if something really important was the 701st previous event and you lose it You don't want to lose results like that if you can avoid it at some point you have to make some trade-off decisions But you can push that out a little bit further with some of the technology we're gonna show today and then Other challenges latency is a big one again with kind of the nature of streaming being usually You're dealing with some notion of real time and some notion of current where again current can mean current based on a Pool of existing data or it can mean current based on some live like coming from the world data that consistency aspect as well as a Specific problem in consistency is result in validation. I computed an answer based on Positions 5000 through 6000 in the stream, but something comes in at position 6502 and says actually I was a little too eager with my data at position 5000 Please correct this please go back somehow even though you've already computed this answer in a streaming fashion Please alert me that that decision might have changed now and Then the general problem of kind of distributed systems of if you have any kind of multiple data sources Then you want to make sure that your reasoning is consistent with respect to some notion of time over those data sources So this is the part where I say we've got an answer Stream processing was hard, but now we've got quine Quine is the product I spend 99% of my time working on at that dot the idea is that we take a lot of data and Allow the user to declaratively configure how they want to intake that data how they want to work with that data and How they want to act on that data including actions taken To affect the actions that they're taking which is a little abstract. I'll get back to it Put another way. There's kind of three steps here You're gonna where we have a property graph data model, which means we represent data as a mathematical graph where Everything is either Edges or nodes nodes of vertexes vertices same thing A node has properties like key-value map of properties and labels edges can have labels Then the usage experience looks like hey, I've got my data. It's not necessarily something. I would think of as a graph How do I want to structure as a graph? What structure do I want to know about in my data? And how can I think of that as a graph? As well as how do I want to handle that structure being found? I Would just add to that You know Quine as a you know as a graph tool isn't a lot different from and then the other you know graph databases or graph tools that you might Already be used to You know, I know that things like like Janus graph and Neo4j. They're all property property, you know based graph models And likewise too, I think I think they both allow Janus graph Maybe not but I think they both allow you to use Cypher which is kind of becoming a standard in the in the graph World as well Which Quine does and we'll show you a little bit in Cypher coming up as well. Yes So let's let's step back a second and I've been talking a lot about data and computation in the very abstract sense Let's look at a specific example. So a real-world problem Password spraying I've got some malicious actor that has made a list of user names somewhere And got a list of passwords somewhere and they want to find where those user names and passwords overlap very classic easy problem You think What happens though so so typical solution of this problem is Look for someone to fail their password entry three times and then require them to reset the password kind of a bad user experience usually as a streaming if you look at this through that streaming ones has a question there of rate of Three times within what window if it's indefinite. How do I remember that it was indefinite? If it's in a bound then what happens if someone goes just slower than that bound what happens if someone is is spraying passwords or In this case what happens if someone is trying the same password across a bunch of different users That's another option and that's one that traditional heuristics for detecting these logins often miss on That coin handles no problem so This is the kind of structure that we're gonna be thinking about again going back to step one Look for or define the structure that you're looking for so we are looking for a Repeated sequence of accesses over any time frame Where someone tries to log in and fails someone tries to log in fails fails fails fails success Sequence of failures from the same user agent from the same client for the same account that Ultimately ends up succeeding But only after a certain number of failures a suspicious number of failures that could be over again any time frame maybe months We can encode this pattern using that cypher graph language This is again kind of the current industry standard for interacting with Graph-shaped data. I'm not gonna get to in the weeds on it But the the orange nodes is our our victim our potential victim the blue node is our potential attacker Corresponding to the top and bottom of this structure And I'm gonna say again where I have a sequence of Nodes that represent log in attempts that were all six were all failures followed by a success If I find that pattern, I'm gonna return a location that that pattern exists so I can do some further processing on it So then step two How do I respond to that? How do I respond to a detection of that pattern in this case? I'm going to Basically put it back in the graph I'm going to add it to my event stream and this is a case of feeding back in to the source event stream And I'm going to say hey, I found this pattern Therefore I'm gonna flag the victim and say here's the time they were likely compromised at I'm going to flag the attacker and say hey this user agent is a likely bad actor and then I can Chain that into further patterns Again feeding back into the the source stream Let's me define any number of patterns I like and Act on them independently so I can act on the pattern of a malicious A malicious user agent I can act on the pattern of someone who has logged in after they were likely compromised I can do any of that all from the same stream all in this kind of declarative way And that just leaves how do I actually get the data into the system in the first place? How do I make this graph because I'm looking at something that's a log of Attempted log-ins. I'm not looking at cipher queries coming in from my Apache log. I'm looking at you know log files, right? Let's say they're JSON because it simplifies things a little bit and it's not too hard to imagine encoding or decoding from that Where each log-in record just has a user ID success Timestamp and the log in IP to represent the user agent That's all I need to write a query like this where I say all right identify That user based on the ID in the log identify that user agent based on the IP in the log Set the data I acquired make a note of it in kind of a a persistent way And then using those pieces of information just link them together at the end Just create some some graph structure, and it's as simple as that to do the ingest And at this point if you're at all engineering oriented you might be thinking like okay But you just told me like all of these shortcomings of complex event processing and how these things are really hard So why didn't you mention windows? Why didn't you mention? Time-bounds why didn't you mention this problem of out-of-order events and the answer is I Didn't I kind of did I said coin and that's the entire premise of coin That is what I do spending 99% of my time on is making it so that coin handles that so you don't have to and the way it manages that is through an intelligent in-memory cache, so it keeps The important nodes nodes that you're interacting with frequently keeps those rapidly available and any others It persists via Cassandra in a very scalable way And in a way that allows new events to still influence arbitrarily old events and between these two technologies they both scale kind of unbounded horizontally and linearly To create a really robust Really robust system and to get a little bit more into how that works. I'm gonna hand off to Aaron now To talk about Cassandra a bit All right, and of course Cassandra. I mean really that's that's you know one of the reasons why we're why we're all here I just just quick pull of the room who how many in here consider yourselves. Um, you know, maybe maybe new to Cassandra Yeah, yeah, all right. I got a few. All right. Don't worry Carter. You're not alone. Oh, I just I just called you out there I'm sorry, man Okay, and and everyone else has used Cassandra a bit. Okay, that's good. That's good But of course the Cassandra that we all know and love You know as as Ethan mentioned scales linearly and horizontally It distributes replicas geographically, which to be honest is probably my my favorite feature of Cassandra To you know in all the years that I've been using it, which is like 11 now I haven't found another database that does this as well as Cassandra does I mean, maybe you could make an argument for like Google Spanner But when you actually when you also own the network that you're running your database on There are some liberties that you can take to you know make things make things run a little run a little faster So so I don't know if I count that And then of course maintains high availability You know, that's that's really where I where I think Cassandra gets like like a lot of it's You know great engineering around performance and you know and you know keeping things up to date and you know A lot of things to run really really well When you're using coin with Cassandra on this is this is kind of sort of how it looks where um, you know You have your you have your Data coming in through like you know being ingested it's streaming into the graph Graph engine actually interacts with Cassandra directly There are aspects of of coin that I don't know if we've touched on too much in terms of um, you know Having things like like a standing query And and what was the other one up here? Oh, yeah, just standing query sure Standing query is a piece of terminology To refer to that define that structure that you want and just define the action you're going to take in response to that structure That's all a standing query is it's the combination of those two concepts So when you see that in that graphic Come on clicker. Okay, there we go. So of course too, you know We have um, you know just a just a general overview of Cassandra, you know works in the you know peer-to-peer Architecture every every node can handle every single request That's really what allows Cassandra to scale so well And of course, you know that it's kind of contained like this in each individual like like You know, I'd say it say like region or or data center Then they can be they can be communicating communicating together You know by the by the gossip protocol and actually with Cassandra five one the gossip protocol will be replaced with I'm trying to remember the name of it. I can picture the Cassandra enhancement process number Hey transactional metadata. There you go. Thank you our role I'm like I know it's CEP 21, but yeah, okay All right, and of course, you know also important to mention that you can run quite on top of AstroDB as well So, you know if you if you decide you don't want to have to manage Cassandra or keep people on staff Who who know how to do that? You can you can also run it that way? and I think Quine Also runs on is it is it map DB and Rocks DB we have Rocks DB as a as a local option Rocks DB. Yeah, yeah getting into production You definitely want to benefit from that that horizontal scalability And actually, you know when I was um when I was working at target I got a new an argument with somebody once about is Rocks DB actually a database because I don't think it is You know when I think of databases, I think of like a completed Lego set and Rocks DB is a box of Legos You know it's it's take it and build it and put it together and manage your file system access I'm in to me. That's not what a database. I mean it does do that, but it that yeah anyway I digress So hey, we have a demo where we can show you this um, you know this password spraying attack I apologize for the slide I copied this one from another presentation and I was on a big cyberpunk 2077 kick when I built this one So that's that's why the yeah anyway So essentially we're gonna start Quine and Cassandra locally and you'll be able to watch that and then we're gonna run this password spraying Ingest which will which will simulate all of these different user interactions user log-ons user Authentications and the idea is is that we're gonna throw like an alert when we see one that kind of deviates from that standard Pattern of access that were that we're used to seeing so let me just Plow through that Well Aaron's working on this I can emphasize The reason this all works from that perspective of hey, there's these hard problems in stream processing Is careful usage of the strengths of Quine and Cassandra together in both cases. We're dealing with very clever Work distribution models. We're dealing with a very sort of holistic approach to data management, I would say Where events are first-class this is something you don't see in a lot of traditional systems But that ability of Quine and Cassandra to treat your your operational events your Individual pieces that end up building that big picture as first-class citizens rather than having to wait for that big picture to be built up That's the that's the secret sauce that makes all of this function So effectively hey look at that we got a hit So, you know, we have our standing queries here You know that that kind of keep track of one keep track of like, you know normal activity and then our second standing query He looks for anything that deviates from that and as you can see we have a count of one on that that second standing query there So what I'm gonna do is Try and yeah, I try and decipher where the URL starts here And I think I think it's just that de cipher. Oh Yeah, de cipher Oh, you know what? I bet I don't want that trailing quote. So I'll backspace that okay All right, so if I paste that in you're rid of that double quote This should bring up the Quine graph with the with the problematic Anomaly that that we detected. Oh, look at that All right, excuse the the physics engine here for a second but if I take some of these and Kind of line them up a little bit better here Oh All right, so as you can see we have It looks like one two three four bad attempts at logging in for this particular user And then on the fifth one they finally made it in now kind of like kind of like what Ethan was was saying in that This this type of this type of like an architecture layer allows you to pick up problems in flight as they're happening Because otherwise you'd be you'd be running a report like this, you know two weeks later and go Oh, I think I think we had a you know a malicious a malicious attack here And we didn't realize it until now whereas this will allow you to pick that up right away So yeah that and and by the way, yeah I'm I'm super glad that Ethan had the suggestion to run this all locally on my laptop I was originally not trying to do that and we were at the mercy of the the conference Wi-Fi So that that whole talk about latency that was that was very real earlier All right, I can break out of that You know, I just wanted to mention too that this this collaboration between you know data sacks and quine kind of kind of Originated as a workshop that we ran together You know, I don't know how many of you follow our data sacks developers YouTube channel But about a year ish and a half ago We actually put together this workshop and the get repo is still out there I have the the link at the end, but it's got step-by-step how you can go through and And actually recreate this process and kind of talk you through like what's happening and you know How to make it work how to configure it? Yeah, how to get either Cassandra or Astrid you be running on the back end and Yeah, oh Oh, it works of quine 1.3.2 that might be a little old Substitute some version numbers appropriately So real quick I want to touch back on back to the generalized Where might you apply this where might this? Quine Cassandra Astra combination would be really helpful for you as part of your production data pipeline if you ever deal with real-time data or large bodies of Event-oriented data anytime you're combining data from multiple sources anything that's self-referential That's somewhere you should have something in the back of your mind going on remember that client thing This might be applicable, but also just as a lightweight utility. We have this thing called recipes That's actually how this demo is implemented. You just give it a file that is a bundle of here's how I Here's how I want to do those three steps structure creation monitoring And run it through I love using this in pre-processing pipelines for AI work for any kind of of data engineering where I just want to do some Data manipulation that's a little more complicated than what I can encode in something like jq or xsv or you know standard sequel It's also got that UI which is good for data set exploration If you want to just click around your data and kind of see how that manifests in a very sort of real way With that in mind We've got some resources and I think we've got time for Two questions. Yeah, how was the host of the fun showing? Oh, it's not terrible. Okay. Okay. Oh, that's a good feedback These these slides are also available on the this the sketch app That the conference is being run through Any questions? Yes, I know you've had your hand up for it Oh Yeah Yeah, so the question was I We saw us some ACCA related logs. Is that what coins built on? Absolutely Quine is built on an actor-oriented. Oh, it's yeah Sorry, the coin is built on a actor-oriented model ACCA PECO kind of equivalent Where each in fact each node corresponds exactly to one actor. It's kind of a enhanced actor It's a smart actor that knows how to Shut itself down and persist itself to disk and then how to rehydrate itself from disk where necessary where disk again means Cassandra Or a Cassandra equivalent data store But yes, ACCA is absolutely some of the some of the magic here Yes Yes, yes, how do you actually construct a graph so let's say you have logs and then you have different columns and then You have the same client and same actor and then you have different attempts to access the password so if you have something like a Regular IP network you have routers and I'm getting events from each or some of the routers So in that case that the the structural issue between the routers is already inferred from the network Here you're dynamically inferring it from from the log files based on the columns Is there any way I could adapt this for event? Coalition a network where you already know the structural issue between the the nodes. Yeah, absolutely It's helpful to think of that initial topology as a a static Static stream of events kind of a an initial stream You can combine any number of of data sources into coin So Starting with something like a topology You can ingest each of those routers as a node or as whatever Representation you want I would probably start with one node per router make the edges each make the edges correspond and The trick there comes in with how you resolve those nodes. So we would recommend using a form of deterministic deterministic hashing to compute the node ID sort of in advance based on some unique identifier that's going to be consistent across your different data sources. So something like the the MAC address or the the the router ID exactly And you can do any kind of combination of this if you don't have that available in all context In what you might want to do instead of representing each router as a single node is represented as a little cluster of Related nodes and just kind of build that cluster As you can as your data streams in and include in your in your pattern that you're looking for include that larger shape and and Yeah, use any data from anywhere in that shape you want in your response All right, then One more Yeah, it's just probably a practical question out of all the projects you use that the coin Let's say you have event which has 400 properties and your events coming at 3000 per second. Is it something that it can handle? Absolutely. So Cassandra back and I'm sure Yes, absolutely Cassandra does give us a lot of that power quite on top of Cassandra is a relatively light-weight layer And it scales very very well and scales in a way very similar to Cassandra in terms of using that peer-to-peer Model between the different hosts We have scaled quine up to processing just short of a hundred million events per second on a ridiculously large cluster naturally But on a on a single host on a single like the MacBook, right? you can pretty easily get around 9,000 10,000 events per second and The reason I'm referring to events here is Because there is some component that is taking the the event decoding at Jason DC Realizing creating the query to interact with the graph that sort of thing and then there's also the component of well Each node is doing some computation as well So back of the envelope about 9,000 Events per second per host is a is a reasonable reasonable Estimate for what the capacity of a coin cluster is going to look like How big are those events? It's a great question. It doesn't matter much is kind of the thing If it's 500 properties, it's 500 properties You start to run into a couple issues if you have a single property that's particularly large And that's just due to Choices in data modeling that can usually be worked around without any kind of trouble Especially if you want to just kind of normalize your your large payloads into an external store in kind of the I guess Kind of the canonical way Does that answer your question? awesome, all right Yes, the the bulk of the latency in in every case we've seen is deserialization and serialization so the The time it takes to write and read from the network or from a file Exactly yeah And in practice a structure like the one that we looked at here. We have Here it is a structure like this Matching all parts of this has a round-trip latency of about 13 milliseconds Something at that level Even if you're using SSD. Yeah, but this is across a network Anybody else? All right. I think that's all we got huh? I think so Thank you everyone