 All right welcome everybody. I'm Charles Herring chief technology officer and co-founder at WIDFU. We're going to talk about building global cyber grid sharing data cross clusters on Cassandra up on I'm gonna get off the stage for a second so I can see my screen but up on the site we have CharlesHerring.com is my personal blog we have this talk some other talks all reference are also up there my social media things if you'd like to connect please do and we can Continue the conversation. So a little bit on the agenda We're gonna cover for the scope of the project that we're working on. How do we distribute a configuration? Across our clusters the idea or theories of predestination of data. How do we configure key spaces? Dealing with natural language processing graph processing and federating data ops So a little bit about the scope of what we do in the WIDFU project It's taking in tons of data from different organizations different users customers storing it processing it For a wide array of different personas. So your standard security operations center folks are the guys that are knocking Bad guys off the network stopping responding dense in response Auditors that are doing audits for compliance regulatory legal issues insurers law enforcement. So reporting crime Investigating crime dealing with national security. So it's about bringing everybody together In utilizing data sharing data sharing intelligence in a way that can draw drive down cybercrime So we're talking about hundreds of organizations that are bringing in hundreds of gigs or tens of ten or more terabytes of data each day The data may have to be Stored for years So you're talking about a lot of data stored a lot of data in motion And we want to coordinate some of that data if not all of that data across multiple clusters So from an input output perspective in each cluster. We're looking at different data sources coming in So syslog messages from servers from different applications networking data coming from the network infrastructure. This also includes Flow logs from AWS VPC. So any communication any record of metadata of conversations Agents on different endpoints shipping that data in and then connecting to hundreds of difference of APIs and pulling that data into each cluster The clusters communicate in an upstream or a downstream capability so If the cluster wants to push data to another cluster it can if it wants to pull data from another cluster It can we'll talk about how we do that along the way So inside of each cluster. There are different components that are responsible For different different parts of the processing. So we have data ingestion that's going into a Kafka topic And so this is just a raw message. So we have transported different types writing to a topic That the contents of that topic are subscribed or consumed by a natural language processor That's analyzing those raw messages to comprehend. What does that signal mean? Where did it come from? What are we supposed to do from it? How's it what product created it? How do we parse it and then that's put into a separate topic? That data is then consumed by a graph processor To understand the relationships that are inferred from the signals that are being received Then we're going to analyze on a different topic What we're learning from that graph and we're going to create units of work reporting different types of analytics Those will go into Kafka and those will be written as different streams into Cassandra With 5.0 coming out we're working on vectorizing or using vectors and that data for prepping for generative AI and other LLM so Another issue that we sort of have is it deploys everywhere, right? So we have some that are deployed on Raspberry Pi some that are deployed in multi Multi-cloud multi-data centers so each node can deploy However, it needs to be deployed So whether that's a piece of physical hardware some in some place inside of a public cloud private cloud Across data centers so each node can be configured Whatever on whatever fabric it needs to be deployed upon So handling that how do you connect different? Data nodes processing nodes Kafka nodes in a cluster that are potentially geographically dispersed Different network topologies So to do that we use an agent a demon that Receives an input from either a user that's manually spinning up a new node or an orchestration script that tells us where it is So what data center are we in what rack are we on the license key which helps identify? What what customer organization what user cluster is this thing a part of and what role should we spin up? Is it just going to be a date new data node or we're going to do Kafka? What is its role going to be Then the agent uses that data goes to Consolidated Service which we call library or with food library and says give me the basic information I need to join this cluster And so it comes back with the seed nodes that it might need from Cassandra We use some brokering for shared secrets so that it can join the cluster and so it has the secret It has a configure configuration and then the agent pulls the docker images that it needs So if it's Cassandra pull a configured docker Image of Cassandra and spin up that container or if it's Kafka or if it's one of our NLP agents Whatever needs to be configured it does Launches that container it joins the cluster and then it ships metrics up the library on My website I do have another talk called metric driven DevOps It sort of goes deeper into how all this coordination happens and how we ship Metrics feel free to grab that you don't even need to give me an email address and then all those all those issues alerts metrics boil up to library that come to the wit-hoo Support personnel they go to the cluster they go to the end user and customer So this idea of predestination of data is how I describe the overall philosophy for processing multi-cluster data for many types of persona so in Data in particular or data broadly in cybersecurity in particular We've taken a philosophy of putting all of the data into data lakes and With the idea that we're going to figure out the questions later. We'll re-index the data We'll figure out what we're going to do with it, but this This philosophy that we leaned into is when we receive a signal We need to know everything that's going to happen in the life cycle of this piece of information How will it be transformed into new information who will query it? How long will it live so giving at a time to live when it when it dies and so Understanding sort of everything that could possibly happen in the entire life cycle of this piece of information It's critical to how we handle information Which means we're constantly having to interview the different personas looking at different end games of the evolution of that data so in building out the schema or The tables in the key space or the key space configuration We tend to start with we tend to start with a simple replication strategy on replication factor three, but when we have things like Multi-data center and so we network topology we can move You know move that as it needs to be done But the key part here is how we construct tables so the partition key and I'm going to walk over here for a second, but the partition key is Are you on you are on laser is the org ID and Field we call partition and so the org ID is a unique identifier For that organization and normally a UU ID Or not usually it is a UID that's validated from library So when you join the whole cyber grid your your user your customer data is Generated as tagged with this UU ID and Then the partition is generally a time new UID But it doesn't have to be and it's our way of limiting how much data goes into a partition So how do we avoid hotspots anybody that's ever run into hotspots knows that they should be a much more graphic term for what that is It's a painful horrible situation so Having this allows us to have non collision in how we're shipping data. We're able to keep a given organizations data separate from a query perspective And in previous instances, we use a separate key space per org ID but Cassandra does have a limitation as you start scaling out the number of tables and key spaces the cluster becomes and degrades right as Impacts performance negatively so by putting it all in a single key space the overhead in the mem tables and Those pieces is much easier to deal with The cluster key is just a time you ID of when we inserted that row So that's all that we need to know about that and then all of the data goes into The object field which is going to be a JSON object So when we're talking about a raw record coming in that we are extracting fields That's going to create a JSON object that JSON object Will be stored here. So we're not doing any indexing Inside of Cassandra. I'll talk about how we do queries and indexing in a second But so these two form our partition key our cluster key and you know combined. This is our primary key So when we're doing scan, so if we for instance are wanted to do a query on a Time-based so everything that happened the last hour we can use do a table scan Using the time you ID on the partition as criteria If we want to what we commonly do, which I'll talk about in a second is do a full fetch of the partition, which is a very kind query to Cassandra We also use compression which blew my mind because I'm an old guy and compression used to mean that makes everything slow But it turns out because our bottleneck is on IO on disk IO and network IO Shrinking that down in compression has a huge impact on performance and then TTL Telling it when we're going to die always set a default. There is a default where he said or not of course, but managing that on the inserts So when we're inserting data, there's really three level three types of Principles that we take in this one is no hot spots. No hot spots. No hot spots So counting how many rows we're putting into a partition. So Whatever the inserting code is it's counting. That's one. That's two. That's three. That's four and then when it hits the max level It's no longer going to use that partition key. It's gone And then we're also going to while we're creating the inserts. We're going to index What is in that? What is in that given partition? So in the case of artifacts we're tracking things like client IP username so it would be an array a JSON array of Client IP and every client IP those observed in that partition and then we're going to save that in the Partition summary so we have one row for the partition in the partitions table and then we have The artifacts table that or different types of objects And then also setting TTL, you know One of the greatest things in Cassandra is the freed elite. So Putting a death date. When is this row going to expire? When is it going to naturally be garbage collected? out of the out of the system One reason we do it this way where we put the indexes inside of a JSON structure instead of using a natural Cassandra index is We're looking at 100 200 500 different indexable fields and when we start doing that in Cassandra, we start running into non-linear scale So each node gets more expensive if we go that way so by not adding the indexes our scope is much smaller I am sort of excited to see storage attacks indexing. It's coming out in five. We'll be at that talk later today But that's currently how we do that. So when we're fetching We start if we want to get a specific records that says for instance, give me all of the rows that have user name John Doe What we that existed in the last Day or 10 days, whatever we start by fetching the partition summaries So give me the partition summaries in the time range. Give me all of them and check do those summaries have The criteria in them. So is John Doe in it if it is that partition Is a candidate for pulling for doing it for doing a pull full pull on the partition So then we just once we have that list We just go by each partition and fetch the pull partition and then we just do a reduce of the rows That met the criteria and it sounds like a lot to do compared to you know, select all where That's where the thing but it's very effective when you're talking about Multi-petabyte clusters of data across hundreds of indexes It's it's a linear search that's predictable in the amount of time that it takes based upon The time range and it's also very friendly to Cassandra. You're not asking for to stress out Merkel trees and all of that I'm going to time so Real quick on this on more receiving messages. We're using approach from natural language processing That has a database of semantic frames. So what where does this message come from? How do you extract it? What is this IP address? What is this timestamp? Everything there is to know about the message is loaded in the memory on the processor and Then it outputs in JSON format the artifact if it doesn't know it goes up and we start doing our research So we have machine learning we reach out to the vendor we figure out what the thing is and we update The frames and the reason we do that I'd love to do LLM here But it would be bonkers computationally expensive to do that on you know message rates of a Million a second right this you couldn't do that many inferences currently So once we have the artifacts we build a graph out of them So this is just sort of a pseudo of some extracted field This is a DHCP lease renewal this machine exists. It has this IP address at this time. That's its MAC address Get another artifact that tells us that user one is logged into that machine file z's present on it You get some more nodes some more relationships Another one tells me that the IP address belonging to client a talked to a thing called server b So we're tracking all of these relationships these artifacts are building a contextual graph of what those artifacts are describing And then we're also understanding. What is the intent? Why did it tell me this it was telling me this because there's a malware exploit or whatever So now that we have a graph that's sitting there. We're persisting that to Cassandra We're able to analyze the graph in this case look for this is a data theft a multi-stage data theft attack or crime that occurred and From this we can analyze this graph and Generate metrics how many of these have occurred what types of tools were used So you're doing graph level a unit of work level analysis Also, this is the standard JSON structure and you know rendering this is pretty easy. There's a great library License under MIT called side escape JS that we use for that And so now we're just down to moving JSON around to make this magic work, right? Jason is great compresses great It has great entropy it's you can't do rest without JSON So you already have now the way that we share information across the internet is in this JSON structure We can nest data so that artifact can be Generated the node in the graph can be a child of that it can be a child of the incident It can be a child of the campaign and you can sort of go from the system to the atomic inside of a JSON structure without limitation and we're also not having to go and Do an alter alter key space or alter database and we need to change or augment a JSON structure and so There was another talk it is I gave in Grand Rapids earlier in the year on How all this works if you're looking at the operational side, but you know, we're talking about the sharing JSON objects of Incidents threat intelligence publication of bulletins You can publish a job in a JSON format to search all your data Then give me the results and then shift that back up to a different cluster when you're subscribing, right? So we're just doing Get JSON put or post JSON and that can be really any type of information and then we're just inserting it With that same partition key so the org ID was org ID is unique the partition is unique from this other cluster So when we're doing an insert Into a different into a different key space On a different cluster. There's no collision, right? So you're able to do the same type of querying that way and then you're left with the ability then to have this federated system to where you can have maybe Universities working inside of a conference. They're able to share data and operations with More expensive more experienced personnel up here. Maybe shipping that up to law enforcement when crime is be reported so Pretty simple way of shipping data. Once you just once we think through how do we create the data? How do we collect it evolve it save it as JSON ship it as JSON? so I'll take any questions, but you want to find my social media or these this talk or any other talks That's me Charles herring. There's also the with food blog If anybody there's also educational stuff, so if you want licensing to play around with The software we have that that's free and then there's my email. I think I am out of time I have one minute does anyone have a question? You may applaud oh wait there's a question. Yes All inserts so it's a configurable setting so in a cluster you may so Some organizations some users may need to keep the raw record for a year or five years So that would dictate what the TTL would be on the insert of the partition or the artifacts was a configurable setting So it's that's not a universal TTL. It's configured per cluster and the same thing for the units of work the incidents How long do we want to keep those those are being inserted? They're also Immutable so it is configurable and are going to change based upon Whatever the requirements of the given cluster is You know that's a good question. It's the It's that it's that whatever the default compaction strategy is because we have Because of the TTLs. It's a very simple compact compaction. There's no whole not a whole lot They have to work around on that one. It's the nice thing It's the way I still had we talked about in the team is whatever Cassandra likes. We're gonna do it that way We're gonna make it as easy or easy on her. I don't know if it's her or her pronouns, but As we can but thank you very much for the time. I look forward to chatting with you later