 And it's recording. Hello, everyone. Welcome to our top spectrum, an end-to-end framework for ML-based threat monitoring and detection. I'm Vincent Pham, and my co-author, Nahid Farhadi. We're both software developers at Capital One. We're going to talk about a platform that targets insider threat. So for those of you who are unfamiliar with what insider threat is, we have the formal definition here. So insider threat is the potential for an individual who has or had authorized access to an organization asset to use their access, either maliciously or unintentionally, to act in a way that could negatively affect the organization. This unintentionally is also important, because we're not looking at only malicious or suspicious people, but also people who have naively or unknowingly leaked data from the company as well, since they are potential threat. I think this comic on the right side pretty much clarifies who are insider threats. Basically, when you think of the entire companies of who are possible threats in the world, it's pretty much the entire population, and then who are insider threats. That's the people who work within a company. Even if you have zero risk or very minimal risk, there's still a potential threat there. And just to give you a little bit about the insider threat stats, 90% of organization feel vulnerable to insider attacks. 53% of companies had insider threat attack against their organization the last 12 months. 37% of employees have excessive access privileges. This could be important, because think about your most important data and what would happen if that's your company and who has access to that data. 94% of companies monitor their employees' digital footprint, so this is the data sources that we could use to track where the data is going, who has access to what, and then 86% of companies have or are building an insider threat program to date. And then also another important fact is the average cost of insider threats is currently about estimated at $1.6 million. So we mentioned that there's 86% of companies who are thinking about or are currently building an insider threat system. You might be thinking, should you buy a platform that is currently have a solution developed, or should you build your own solution to insider threat? We list some of the pros of each of the different buy versus build. So if you think about buying a platform, the pros of that is that a solution is already developed. So it could possibly be simple as a plug and play, although in most cases it'll take a couple of months or maybe even a couple of years just to integrate that platform into your system. Also support is provided, so you don't have to build up your own team. You could pay for support to install that system and grant you the analysis to insider threat. Lastly, the data management and herring solution is also provided by the company as well. So once again, you don't have to build your own team. The pros of building your own system is that technology is always changing. So if your company is thinking about changing this technology, changing its database, or just changing the way everything is handled, moving from maybe a server or a local server to the cloud, you could quickly adapt to that need. If your system is constantly being audited, you have full knowledge of the system. So you don't have to wait for a vendor to give you back some information or maybe they're not even going to release some information. When you're building it yourself, you already have that information. And then once again, due to ever changing technology, you might have changing data as well. So then you could adapt your platform really quickly to that changing data. So in order to combat insider threat, we built a system called Spectrum, which operates on two types of entities in the relationship. The entities are the subjects and the objects. The subjects are identified as the types that take action, such as in the very simplest sense for insider threat are the employees. But you could also consider IAM roles in AWS, which is like a group of profiles gathered together. If you're doing insider threats or I guess external threats, you can also get our customers as your subjects. And then the objects are entities in which actions are taken. And those actions are recorded as event trails. So you could think of every time someone logs into a VPN, what time they log in at, if they batch into a building, if they open the laptop, events such as that are recordable and provides evidence to whether an insider is happening or not. So the proposed framework for Spectrum follows such that if we go from the very top, we define a set of entities. So in this once again, in a separate case, that would be the employees. We find out the different data sources that are linked to the entity. So we might have, in this case, three different event source. And then from each of the individual event source, we have futures for that event source. For example, let's say you're using your email data set. That email data set will have features such as where the email is being sent to, where it's being sent from, the time of day the email is sent, the subject and so on and so on. And then from the futures, we can construct patterns. For example, has the email been sent within the past 24 hours? Does the contents of the email display a positive or negative emotion and so on? We could construct rules from it, and then we could also do anomaly detection on the future itself. So there's something suspicious away from the baseline. And then we can also build machinery models such as classifier to identify, once again, whether that, whether particular events or set of future, shuffle up something as suspicious or not. So once we construct all of these rules and anomaly classifiers, we tie them back to the dependencies to extractor, which ties back to event source. And then that's how we build the entire pipeline from the bottom up. And then when we run the pipeline, it then reconstructs this entire scheme and tying back all the data sources so that if you turn off one pattern, then it might turn off a particular source just to save some costs in the meantime. Our data is then enriched to the role, the entity. And that is stored into storage with the entity and the enriched data of whether a particular role is suspicious or not, or whether, I guess, a classification is marking it as zero to one probability of suspiciousness. These information are then used to flag a particular entity, maybe another due slack, or maybe open up a case management so that analysts can look into, or it could even be displayed in a UI so that analysts can go to this rent order of suspicious entities, suspicious events, and see whether this is really a true positive or false positive in terms of insider threat or not. So in that shell, this is what our system of spectrum looks like. All right, so this proposed framework had some techniques being done on it in order to effectively save money on a cost of data handling or making sure that we don't lose any data. So using modes of operation and also dependency orchestration. I'm going to go through some of the techniques that we use to build this whole framework. The first one is that we have unified the source of the data. Basically, when we look at different sources of information, like digital footprint, it could be any digital interaction that the employee has with the outside world. It could be proxy monitoring mail servers, different sensors that we have on endpoints or user machines or on cloud accounts that monitor all of these traffic and the communication that we have with the outside world. We can have different tools or event monitoring like windows, event logs, sysmon, snort, et cetera. And each of these resources provide a specific type of information for us. But the important thing is that when we want to run the job and we want to detect whether there has been any insider threat actions over the past 24 hours, it is important to only query the past 24 hours for that specific data. Now, at the development time, your developers are going to develop different features and they are going to each of them query against all of these data sources, which is very costly. So to our proposed solution to have a cost efficiency in a spectrum development phase is that we actually had all of these data sources, we process them and we sessionize them and we extracted the information that we really need from each data source into a specific table. And then we had a material job running on a timely basis, like five times a day, six times a day as your customer requests you to extract that information, process them and keep them in a table for us. The size of this table is an order of magnitude, a smaller compared to the size of the data sources that you originally have. Plus your developers wouldn't go by themselves and each of them query those huge tables. Now you have unified all of this information in one table and it is going every time you want to develop, you want to test, you want to add a new feature, you are going to query the smaller table. So this actually saved us 90% of the cost of the querying. We use Snowflake as the source of the data where we keep our information, original information and then we use Matillion in order to have a schedule for running jobs on extracting information and processing them from Snowflake. So here is an example of how we keep the data. The other benefit or other feature that we proposed in implementation of a spectrum is that we unify the data, this brings a second benefit for us. First one was that, okay, we had some cost savings. The second benefit is that this data frame is like an object and it is moving along the pipeline and different actions are being executed on this data frame. The important thing is that we don't complicate our design by separating different data sources in different tables, having each of those tables have different number of rows. For example, in email, you have a data type, definitely a hash or a unique ID. It is related to an entity ID, it could be an employee, it could be an IAM role or whatever entity that is doing the action and it has a timestamp. But in Meta, you have two from number of attachments, et cetera. The same information for a proxy data is very different. Of course, you have the first four columns but in Meta, you have probably the URL that has been accessed, the type of connection, is it HTTP or what type of connection is it? Is it blocked by the company or not? Or et cetera, other type of information. So imagine if we had different data frames for each of these data types and you can have up to 20 data sources and having different tables for each of these is going to make your design very complicated and very inflexible against making changes. But with the proposed method, we keep information related to each data type in a JSON format in just one column and that column is named Meta. This way, our data frame can have different types of information but all of them are in the same format, all of them are in one data frame and that data frame is going through the pipeline and each of the modules is responsible for doing the specific actions on it. That's why we had extractors. Those extractors actually extract features and stuff from the Meta field. Another design thinking that was behind this proposed framework is modes of operation. So as you know, like any other machine learning system, we have a normal job that is running which is usually in the POC phase, we call it the test one, which is actually running the trained model against your real-time data and extracting results. We call that mode modeling and we run it six times a day or more per user's request. We have a training phase. In that phase, we only train the models. We have several baseliners. We have several classifiers. Those baseliners and anomaly detections are for user and entity behavior monitoring. So we compare employees against their own behavior or against their peer's behavior or role-based type of anomaly detection. So another phase of operation is training. In training, obviously we extract more or longer period of the data and that's actually where the Matillion job and having a unified table helps us. In modeling, we only look at the past 24 hours of the data. In training, we look at, for example, 90 days of the data or a longer period of time. So in this case, now that we have two different phases and we separated these duties from each other, we make sure not to run really long queries for training. We have performance monitoring. In this performance monitoring, we monitor the thresholds and metrics for our machine learning models, for classifiers and anomaly detectors and make sure that they are acting normally. They are not any problems with the sources of the data or with the baselineers. Or if we generate alerts off of these, we go back to training and we fine-tune our model. We also have a backfill mode, which is something that happens usually in production is that sometimes your sensors don't work. Sometimes you don't have the source of the data available. There are problems with running jobs in order to filling out those tables. So in backfill mode, we run the same modeling, but this time on a longer period of time to make sure that we haven't missed on any of the events due to problems in production and providing the data. Another important part of this modeling job is a dependency orchestrator. As I said before, we have different, and Vincent mentioned before, we have different phases of operation. First, we extract stuff from our data sources. Then after we extract those data, we have three main techniques in order to find out if an event is malicious or not. We have rule-based techniques, which we call pattern matchers. We have baselineers, which is anomaly detection type of job, and then we have machine learning component. Now, each of these might depend on each other. For example, on a baseliner, you might want to run the baseliner on a feature that is extracted, or you want to run a baseliner on a specific data type. For example, on print operations. That doesn't necessarily need to wait on extracting information from email, right? It's just a baseliner on print operation, so there is no dependency between email events and print events. So in the dependency orchestrator, we come up with a plan that what is the most optimized way to run these modules based on their dependencies. You may have machine learning components that depend on several baseliners or several pattern matchers. You may run different baseliners that depend on two features. And finally, you may detect the maliciousness of something if a specific feature is extracted and it is violating an anomaly. So there is all these dependencies. Here is an example in the module dependency. For example, here, we read the data from our original database and imagine that each of these colors is an event type. For example, this is email, this is proxy, and this is print. We extract information from email, we extract information from print, for example, number of pages, printer that is used, and we extract information from proxy. Then we may want to run a baseliner on the email events and see that, for example, what is the normal amount of emails that someone is sending on a daily basis? Or we may want to look at the number of proxy, what is the amount of upload and download for proxy that we use in a baseliner. Then later on, we call these as detectors. These detectors might say that I think this person is malicious if they printed something and if they had some sort of anomaly behavior in terms of proxy. So the dependency of this proxy is both on the baseliner and on the extractor of the second event type. So basically the job of dependency orchestrator is to come up with a plan and find out what is the dependency for each of these modules and make sure that they are ordered in a way that each module meets its dependency when it's time for the module to be executed. Finally, we have a store operation where we store all of those results in Postgres database or any other type of database. And we have a list of traits that we have detected for each of the entities. Another capability that we have in this implementation in the proposed framework provides us a scalability. First of all, for the processing engine, we use Spark on AWS EMR and it is composed of fleet instance and we have some spot instances, some on-demand instances on EC2 as soon as the number of jobs increase, the number of data, the number of events that we are monitoring or the period that we are monitoring, it can totally scale out and perform the job successfully. Another scalability from another point of view is that we have added our back restricted views for our events. For example, if you have different customers with different levels of access control to the entities that you are monitoring or to the events that you are monitoring, you can use our back and role-based access control in order to make sure that, for example, the head of insider threat monitoring can have access to all of the threats, can see all of the activities of, for example, executive or leadership of the company, but maybe a junior security analyst doesn't have as much capability. So we added into our detectors, we have a field that we set a level of access for each of those threats, so that or it can be based on role, based on the threat or based on the activity that has been executed. This will give us the capability to provide service for multiple customers as well. The same framework that is looking at the same exact same data, it can be used for insider threat detection, but if we have other customers, we can see that if there was any network intrusion, so this is basically CSOX interest, if there is any website abuse, or in case of financial companies, we can see if there is any insider trading or any fraud done by the employees. This doesn't necessarily have to be looking at a specific entities or limited to any relationship between those entities as long as the application is monitoring and looking at anomaly detection, we can simply change the source of the data and modify the dependencies and actually extract the results. Thank you so much for your time and attention. Here's our emails. We are happy to answer any questions.