 Okay, with that, I'm very glad to introduce Celeste from Lorenz-Slimmore National Laboratory, and she has a long career with Lorenz-Slimmore National Lab, which is a national security lab in the East Bay, California, and she is a computer scientist with 30 years' career with National Lab, and her research primarily is on the data science, applies the data science and analytics technology for different global security issues. About 10 or 15 years ago, she started and applied the data science technology into the cyber security, and she is a PI, a principal investigator and a researcher in cyber security situational awareness, and also she speaks frequently at the different workshops and technical conferences, and especially to help promote the diversity and inclusion for women's data science and women's cyber security. And the one last thing I have to mention is that she is an avid backpacker and the backlist, and she has done the transcontinental bike journey from San Francisco to Virginia Beach while in 2005. So with that, let's welcome Celeste to give the talk about cyber security research from a data scientist perspective. Celeste, I'm going to hand it over to you. Hi, sharing my screen. Welcome everybody. It's a pleasure to be here. Are we showing the screen? Are we good? Okay, good. I'm very glad to be here today. Lawrence Livermore is about 50 miles to the east of Stanford, and this is our campus here, and we are a national security laboratory, and I've had a very fun and exciting career there, and I'm going to share some of my some of my insights here, but I'm going to start with a bunch of taxonomy. So you have the context for which I'm coming from today. And first, I know the word data science is a little overloaded, so I wanted to make sure you knew my perspective of it. And here data science is really getting scientific insight from data, but it's the discipline of bringing multi different approaches from different types of scientific methods and algorithms and the approach to extract knowledge and data. So it's really the whole journey from the data, from the data analytics, and the algorithms and methods as well. Most of research includes data analytics, and here I just want to make sure that when we talk about data science, we're talking about the whole process of using the data methodologies as well. And the approach really here is that I've been advocating is really data driven computational data analysis for cybersecurity. So from the research side, I just want to give the perspective of what I'm thinking about when I talk about cybersecurity research. Now I don't want to, I wanted to make sure you understand that cybersecurity operations and cybersecurity software development and activities are essential to moving this forward. Research in my opinion complements everything we learn from the cybersecurity operations and development side of the house. The things I think about from a research point of view is how to think about a more generalized solution. Solutions that are not only about the existing current technology, but say three to five years out. So trying to anticipate and understand the implications of our activities in the three to five year term, where many of the development and cybersecurity operation people are really addressing current needs. So one of the differences in the perspective here is that research is trying to think three to five years out. At least that's what I mean research here. I don't mean that it's not applied and I don't mean it's not practical. I just mean that we're trying to think about how to understand the foundational understanding of the principles behind why things work the way they do. And for me, I think cybersecurity is like physics in the 1900s. It's we're really at the foundational point of trying to understand how things work in a fundamental way. And research is really trying to make sure we don't lose the forest or trees that we also are looking about how things work, not only how to get things done and not complimentary. So a lot of ways I think about research here as a more strategic view and operations and development activities are more tactical, both very essential and complimentary. So with that, I want to the rest of my talk is mostly a survey and a discussion about research perspectives. And we'll give some I'm really hoping to give some examples of the research thinking around things that maybe you've heard about or are practical concerns for you and sort of explore research things. But from a data science point of view, cybersecurity is a fascinating place. One, there's lots of data, just massive amounts of data. And that's actually not necessarily always a good thing. There's so much data, but maybe not a very strong signal. If you think about the amount of data that transits a network at the Livermore, we have petabytes of data that just comes across the enterprise boundary. And a very small fraction of that is actually malicious in any way. So there's a huge amount of very, you know, you know, useful data. And we have to find those little things in all this other data. So there's lots of challenges. The other interesting thing about this topic is that it has a temporal domain. So it's dynamic. Whereas you may find a solution to one threat, but the adversary moves very quickly. And so there's a new threat. So there's always opportunities to explore new challenges. Just because you get it fixed once doesn't mean stays fixed. And just because you found a solution for one thing doesn't mean it applies everything. So good job security for cybersecurity. And that's a good thing and a lot of fun and always very interesting. There's never a dull moment. The other thing, I think we are more aware now, especially as we're all home doing our, you know, teleworking the importance of cybersecurity because we are so, so much connected to our networks. And the fact that the adversary is getting very sophisticated. Not only are they more sophisticated just because of skill level, but the advancement of tools for the adversary for the criminals, they actually have a whole set of, you know, sophisticated software engineering tools that make it easier for them to cause malfeasance. So there's always new threats and new things coming along. And the rate of change, you know, is very high. You know, you just think about this past few months, you know, prior to March, the amount of people using Zoom was probably a small percentage. And now after March, you know, all of a sudden this massive change. So behaviors change rapidly to address new needs, to new interests. And so that and the amount of applications that are out there, just think about the number of apps in the app store at any one time. If you had to do a full software assurance process to approve every single app at the rate of change at which new apps are coming on, it becomes a pretty daunting task. And then finally, you know, in the old days when, you know, we used to talk about defense in depth and it was very clear boundaries, you know, the castle analogy of you have your moat and you have your perimeter. That's not so clear anymore. These networks are blurring when I can sit, you know, in my office and have my phone with me or and have, you know, multi different approaches to getting to different networks. I'm bridging, you know, corporate network environments and OT environments and mobile environments. And so there really is no clear perimeter or boundary and the sort of, you know, everything outside of my enterprise or outside of my operational network is outside and everything is inside is no longer a valid way of protecting. So I think we have to think about that everything is one big now we're at home using our corporate computers, we're using our home computers, we have our all our smart devices and all our OT and IT devices. So these networks have merged. So it causes us all these challenges because it's really things differently about how to solve this problem. I think this kind of thinking of bringing data science in, we're seeing it being very popular around for other disciplines and I just want to pursue it and explore some of the topics as it applies to cyber. Any questions? If you have questions, please click the button in the middle which raise your hand then I'm going to unmute you because we try to make this more interactive with each other. So we're allowed to speak over the phone. Okay, I'm going to go through and give some more examples. So we have a common context and that we talk about some of these ideas a little bit together. So you may be in research, there's a lot of discussions about behavioral analytics and prior in cybersecurity, people talked about, you know, signatures and signatures are the things at which are easily identifiable for which you can make a rule for and for example, maybe you know that a particular computer address, an IP address is where bad things come from. You can make a signature that says I won't accept any traffic from that IP address and those signatures work very well and they're very effective when you know those kinds of well-defined things. But there are two sort of shortfalls of those. One, the adversary is clever enough to change so that if they get blocked on one file name or IP, they move to another name. So if they called it malware one, they changed the name to malware two and the rule that F said no more letting in malware one won't no longer work when we have malware two. And the other thing is the rate of change is difficult that it's not always easy to find those kinds of simple rules. So people are augmenting and I'm saying really make it clear augmenting the things we know how to look for by rule by doing behavioral analytics. And I know I've walked around many of the cyber conferences and I've walked to many of the vendors and I asked them what is behavioral analytics and I get like many different answers. So I just want to say here when I talk about behavioral analytics, you're trying to understand what's going on and how the systems are operating. And an example could be as I illustrate here that maybe we have a historical view of our networks. You've been collecting and monitoring over our network and we know that computer A regularly talks to computer B so the finance department and the procurement department and they have a likelihood of interacting at about 90% and that computer A and C over our historical view of the world they interact you know 75% of the time. But computer A and computer D which is the supercomputers the finance department and the supercomputers they don't interact really at all less than 5%. So what we could do is if we monitor this and we see all of a sudden in the finance the computer and finance department if it it's probability or likelihood of communicating with the supercomputers goes up to 90. We can say that's anomalous or that behavior has changed the communication between those kinds of systems have changed. Now what we can't say is that it's not easy to just because it's changed it doesn't necessarily mean it's bad. There are many reasons why computer A may have been reallocated to a researcher and that their role is to communicate with the supercomputers. So it takes a little more interacting but the idea here is the behavior here is the sort of connectivity and the likelihood of connectivity and there are other kinds of behaviors like which URLs you go to or you know use of modeling in terms of what applications use and processes. So these are kinds of things we talk about when we talk about behavioral analytics. And here's an example there's another thing in research people are talking a lot about machine learning applied to cybersecurity. So I want to give you just a quick example here you want to if you wanted to explore how to determine from data that you can identify web attacks maybe you want to know whether you have an attack or not attack. So these two classes is something an attack a sequel injection or cross guy scripting or one of these kinds of attacks you actually could process the URLs in the traffic and process them in such a way that you can build a model to be able to make this two sets of classes and we've done this using term frequency approaches that is like bag the words and looking at the frequency of bad the words and the interesting thing is the these kinds of web attacks use programming language concepts so you know if you're doing a sequel injection there's actually sequel commands and so you can classify them if you have supervised learning or you have some labeled data that you know what kinds of commands are bad and kinds of good you can actually use machine learning to build a classifier that's pretty accurate and can help you use these approaches to detect you know these kinds of threats and the good news is in this in these kinds of approaches you can even get very fine-grained you don't only have to have two classes is it an attack or not an attack you can actually ask questions like is it a sequel injection is it a cross side scripting and these approaches are very good so they augment the signature approach the challenges approaches like this is they require something called supervised learning and in this particular panel it means that you need to have data that somebody has labeled that shows you if you want to do the multi-class shows you what a sequel injection looks like what a cross side scripting looks like so you have to have labeled data and in most data science applications in the world that I live in a national security it's very hard to have labeled data it's hard to have people who have able to label it or in the case of cyber the time it takes to label it may overcome by the fact that the threat has moved to a new kind of threat so the the dynamic nature of it means that the efforts to do labeling doesn't always help you because then things have changed enough that that that that kind of threat doesn't exist the example I gave here cross side scripting is still very well used so these kinds of examples would be well warranted in having supervised learning but not all cases do other kinds of so again I'm giving you a sort of survey of the trends in research and then we're going to dive in with some use cases and examples so hopefully I can make this more concrete and practical for you so there's lots of other approaches in terms of data science approaches for cyber security if you wanted to do user modeling or you wanted to do understanding of how hosts should behave you can do role-based behavioral models that and that means you know you don't have to have the host it don't have to know our priori that this is a computer in finance you can observe it through data acquisition and decide that it's a the role of finance or the role of a scientist and those kinds of approaches can be a way to learn from the data rather than have people rather than have people tell you and and this way if roles change you can use that as a way to alert your cyber defenders to maybe explore a little further there's also instead of the supervised models I talked about before there's also ways to do anomaly detection which don't have the problem of meeting labels but have the problem of actually having to generate the model so there's many different generative models that one can use to build on maybe sort of the graph structure of the network and you can use these for different kinds of detection problems in many cases and in the world to say of OT worlds you may be constrained by data collection and so these may not always be relevant or you may have to use other kinds of networks to build the original models and then apply them in OT and I think we'll talk a little bit about this transferring knowledge from your algorithms you create in one environment to another in a slide going on but there are a lot of challenges about the sort of transferring of knowledge here okay so I'm going to now dive into a few specific examples and hopefully cover as many examples as we can in the next few minutes to help reinforce this idea of what is it we look at from a research point of view a data science research point of view any questions the one thing that's hard in the zoom is it could be you're all asleep but I can't if you have any questions you have a two way to raise the question one is you can click the button which is the raise the hand then I'm going to unmute you to ask a question okay we do have a couple people still have a question here let's say John Fox let me unmute you a lot John thank you it would help I think understand some of the material on the previous slide if you gave an example in other words you you talked about a couple models one was oh you could detect by attributes of the IP or attributes of the message could you expand on some of these methods just a little bit with an example maybe go to one of your earlier slides and say here would be an example of a problem and here's what this approach does to help us understand it so would it be okay if I go through a few examples that I have upcoming and do try to do that specifically and then if I haven't if I haven't accomplished it then you can ask me again yeah I was just trying to imagine I mean those words were kind of theoretical for me and I was trying to attach them to some kind of context of okay practically what would this be but if you're going there that's great thank you so much okay thank you any more questions Lea that's mobile okay so one of the practical things people talk about for cybersecurity it's a phrase that's come out is continuous monitoring and so that sounds sounds great and we can all agree that continuous monitoring would be a good thing because periodic snapshots have their you know their weaknesses for example it relies on you know being able to catch things at the time at the right time and you'll have gaps and you know as far as I know you know adversaries don't necessarily only attack at the time you are scanning their network right so they can attack at any time so the department of dhs even had a whole program post continues to diagnose this and monitoring so there's a lot of things that one can think about from a practical point of view you know you want to do in acid inventory and resource in resource constraints and do always modeling but really what does it mean when you have to do continuous monitoring right so the problem could be there's just lots of practical concerns and from a research point of view many of these are open-ended questions really I mean if you think about continuous monitoring you think about huge amounts of data and in many I know in the laboratory at we're in people use their computers not just for the monitoring course because they have real tasks they need to accomplish so you can't put load on the network that would be substantial to continuous monitoring we have to understand that how much does the network change and how not only do we need continuous monitoring but we need to be able to explain these changes to a cyber defender and we're going to you know so some of the challenges we have to think about is how to figure out how much data when to collect how to make it understandable and knowing that in any practical situation you're not going to have perfect visibility and so how do you explore using automatic data approaches to consider some of these popular topics like continuous data monitoring so a particular case that I worked on was this case of how to understand either for the purposes of knowing what you know a managed computing environment I have to know what number of devices are on my network and what kinds of or if a rogue device is on my network and and and known that we're in a resource constrained environment so simply one of the things we did here is we used our normal vulnerability scanning approaches but we changed the way we did it and your organization may have you may do vulnerability scanning and it may be well known by your organization that vulnerability scanning happens every Wednesday and if you if you are well known about that that could mean that you miss the other days of the week also could mean that you people can be you know can learn how to game the system so one of the things we did here is we connect that collected we monitored our network we found what were the we did an analysis of all the data and found out that there were periods throughout the day that we could collect different kinds of activities so during the day user behavior during the evening a server behavior like backups and that we could vary one our data collection strategy and our vulnerability scanning so that it was randomized and not deterministic and that it improved our ability to detect the threats that were coming in by a significant amount it also actually improved our ability of not having users gain the system and what I mean by that there are people in the who legitimately want to get their job done and they may have a filemaker pro database on a windows vista machine that goes back to you know 2000 and under the current practices those operating systems and applications are no longer supported or allowed and they're out of compliance but they learned that with scanning happening every Wednesday they just had to shut their computer off and they never got outed right they never got caught but by having more randomized approaches and actually you didn't have to continually monitor you could actually pick you know periods of the day that got the broadest behaviors and that's by monitoring in advance and then building a model using the analysis of what hours of the day could you see different things like all the different users connecting which is other and and for example at livable we found that if we regularly collect the data around 11 a.m between 11 a.m and 12 p.m for user behavior and between 2 and 3 in the morning for like server behavior if we snapshot those periods of time and then and then periodically sample through the rest of the network during the rest of the day we actually had a good useful model of our network so here we address the amount of data and we address how we did it so the con operation the operations of how we did it and that made a huge difference in our ability to identify threats and in particular know about all the assets that were on our computer and and then build the model for how we and and build the models for how we could find non-authorized or unmanaged devices on the network but when we did this work there it seemed very straightforward to ask these questions about how do I identify the important changes in my network so it seems it seems straightforward I want to understand my network and I want to know how it changes that seems really straightforward but it actually ends up being a very open-ended question so from a cyber defense point of view change what does change mean what is important changes and the reason why this is important is because our cyber defenders don't want to be alerted constantly because then it becomes useless right they want to have we want to be able to triage for him the most important changes so if you ask yourself to understand what what are important changes there are lots of different kinds of challenges that come come to bear so going back to an earlier question we can we're going to talk about what kinds of things can change and maybe for our cases we have to get specific about what data we collect so network properties can change and then what are the changes in these properties could be for example what are network properties network properties could be the communication volumes so the number of links to the number of hosts to number of devices the communication distributions like what protocols are being used and where and the communication patterns who's commuting with computing with communicating with whom and these are features we can collect and understand and build distributions for and then use these distributions to help us identify changes now there are other other interesting things and what we've been approaching and there's lots to learn here and i don't want to let you want you believe that this is all done there's lots of things still to learn but we have to start coming up with approaches to understand how do we determine normal and how do we build so maybe our normal distribution isn't the right model but there are different kinds of models we could use to determine normal and maybe in some cases there's decision thresholds we can use to find the interesting changes right but itself the concept of normal changes over time so maybe one of the things we have to think about it's not just building a network model once and using it forever how do we build into our approach that we not only have to do data collection but we also have to regenerate these models so the idea here from a practical point of view you've done some monitoring you figure out what kinds of things you can collect and they are and in each network you may have to decide what you're the most you know the top 10 things that you want to collect or the top three and you can do some feature engineering to help you do that and then you also have to think about it what cadence or at what rate do you actually have to rebuild models for which if you made a decision threshold it's too valid and so these are and i think for cyber security in general and for this kind of research the one of the things that we really don't have a good idea about is do say i learned by these kinds of questions i can answer these kinds of questions for lauren slivmore's network does it apply if i wanted to apply it to general electrics network another big corporation so this idea of if my model is only good for me or is it only good for me at this time or is my model also generalizable to be useful for other people of this idea of transfer learning is still very much an open question there's some thinking that maybe like organizations maybe the model i built here at lauren slivmore would be suitable for an organization that's there a sister organization that's like us they will sell most but maybe not so much for an organization like apple who may be different than us and so i think these are things we have to start thinking about the other thing one has to think about and we sort of worked hard about and i mentioned a little bit is how to accomplish your data you know data monitoring in the concept of very different time behavior so you can have gradual changing in the network that you know just in general changing you know and maybe all of a sudden the use for the use of zoom when it wasn't popular before is more of a sudden but now persistent change so there's gradual changes like people move away from one technology to another there's instantaneous sudden changes and one has to think about data collection and data analytics with these timeframes in mind and a lot has to do with how to build approaches to do that and then i just want to reinforce because at lauren's we're very much rooted in making sure our research is applied and we work very closely with our operations and infrastructure teams so that we are not in some ivory tower and many of the things that we have to do is not only that we can we have these algorithms that can detect these important changes but we actually have to be able to explain you know where what how and that really drives what data we collect i mean explainability becomes important and so often things like i mentioned before the behaviors of using a protocol distribution and communication patterns these tend to help us be explainable not all features we would collect lead to good explainability and we may not choose to use them because of that and that's a trade-off space that one has to look at in terms of accuracy and trade-off so that's one example that was that whole continuous monitoring but i want to also give a few more examples here around sort of standard best practices for cybersecurity and many of these have heard about but i want to take a public service announcement here first so my public service announcement is that we know in cyber that user training is one of the most important things we have to do and that's a well-known best practice and i want to reinforce that but now i want to talk about some research ideas around other well-known best practices so many of you have heard or have experienced that patching systems or patching your applications is important and is a very effective cyber security best practice but let's dive into this idea of having to patch and so it sounds again it's one of these things that we all agree it makes sense to patch but if you were to be responsible for a very large-scale operational network or OT network that had hundreds to millions of devices and each you got a notification of a critical firmware update that had to happen and each critical each firmware update took a minute it took 10 minutes whatever time it took it takes some amount of time to get done so you can instantly do it because one it requires downtime for the device and two just saturation of networks and the resiliency issues can't make it then you ask yourself well what are the most important devices i would have to patch first and uh and so i actually um worked on this project here a little more what is given uh resource constraints like not not being able to take the whole network at once not having enough resources to to do the patching instantaneously as well given resource being trained what networks what devices would i patch first what are the most important devices and so i've meet naively started off with well i'll go ask people and so i did i went and asked all the organizations to give me the list of their most important devices and uh of course when i went to the finance department their their servers and devices are the most important and when i went to hr there so but nobody really had a way to intuit were there underlying things that people didn't know about that were important you know were their services and uh devices that people didn't know about were important so again this idea that everybody just gets this mandate that we should patch but the the operations folks don't we often or most of the cybersecurity controls don't often tell you how to go about doing it right and so we're often struggling each individual organization has to struggle with what's the right way to do this so uh we we approach this from a data driven point of view we uh did data collection and uh in a way similar to what i said in the previous continuous monitoring discussion where we observe the network for a period a short period of time to figure out how we could monitor it within a resource constraint which we didn't have to do the zillion's amount of data collection over a long period of time and then we uh looked at things like uh uh these those communication patterns and we call that centrality measures and we found ways to identify are there nodes in the network of this large collection uh that we didn't know about that everything is relying on or some some important things are relying on and then we were able to create through a data different approach a bunch of important uh ordering a ranking of importance and we didn't actually just do one to end we just we decided you know to use spinning like a critical high importance a medium and low importance and it is context dependent sometimes the nature of the threat will factor in there and through the spinning we are able to validate with the the subject matter experts to actually help improve to have a strategy for which we could get critical patching done in the most effective way to improve our security posture against that threat and so here's this idea that you know everybody says go and patch the research side of the house is well how do we go a patch go and patch and make sure that we're doing this in the most effective way and having this data driven approach has given us a way but I do want to let you know that it's not once and done we actually have to continue looking at the data the network in this way so that we can as in the early work we can understand how things change so that we can kind of get a cadence of how often one may have to think about re-evaluating the importance of devices should a new patch come out all right I'm going to move on to another example any questions so far okay if you have a question you can raise your hand there's a button down in the middle okay well maybe I'll go to the next example and maybe this will give you the next example and we can see if that generates any questions okay so one of the most important controls that the NIST mentions for managing the computer network or OT or IT network is something called segmentation and that's the dividing up of the network into different subnet or enclave sometimes they call the idea behind that is you prevent spread of malicious behaviors you you lease privilege you make sure only the people who need to have that access to that subnet have access and stuff so I've read all the controls and all the documents and and I do know that segmenting is important but there's nowhere in the controls that tell you how to do it and actually there's nowhere in the scrolls that tell you that if you should just have to do it once you have to do it often and so the challenge we learned and explored here is from a research point of view how do we give advice to our cyber defenders in order to figure out how best to segment a network for improved security now you could choose to segment a network for many reasons you can say I want to segment because of physical proximity and have all the network for one building together you can you can segment a network because so who owns the equipment there's many reasons but if we wanted to segment the network for security we wanted to be able to give advice to to our cyber defenders and actually be mindful of the cost I mean it is non-trivial the cost of some of these approaches although as we move to more software defined networking and new approaches to network management some of these things actually become programmable and controllable such that the cost is lower but in an imagine in an OT network where things are pretty stable or could be pretty stable if you were told to resegment it could be you know a major effort so you want to be mindful of the cost so we had to understand and collect data to be able to do some experiments from a research point of view in a lab so that we can think about how to do the segmentation and again we found that the things things like the protocol distributions and the time of day were foreign connections were important and what we also learned was that we used this metric for the amount of connections not only the the the distribution of protocols but the amount of connections within each segment and we developed a threshold that said if the amount of connection the communication changes or gets beyond the threshold from say the amount of people who work within an organization or a segment they most the reason they were in that segment is because they work more with each other they communicate their hosts or devices communicate more with each other than they communicated outside the segment so the intra communication was higher than the inter communication and when that balance changed we looked at the cost and the implications of resegmenting and what we found and again I want to just tell you that this is research and we have not done enough experiments to know this is transfer our experiments transfer to others is just that what we found is that segment the need for segmentation does change over time and that there's a natural sort of organizational movement enough organizational movement that warrants periodically looking at your segments and considering resegmentation so it's not something that has to happen all the time but it definitely warranted looking at changing and the idea is that it's the segments you want to all you know want to make sure that they're tightly integrated and often the way segments are created it's their firewalls to form these enclaves so they're protected by security devices so you want to keep as much traffic inside the segment that has that extra layer of protection through those firewalls and then once that that starts to fall apart that reconsidering when segmentation needs to happen so we use the data-driven approach by collecting figuring out these clusters of people using this measure of reachability and amount of communications the volume and the number of connections and use that to help decide which were the original segments and then we were able to do periodically reevaluate which segments need to be like clustered different clustering of hosts or devices over time and how to do that in a way that we could provide to the operational teams practical and understandable reasons why it was worth considering resegmentation. Now I have to say and this goes to all the research I'm talking about one of the most challenging parts of being a researcher in this field is this ability to validate and so whether it's because you're doing cyber threat detection and the threats are variable or you're doing things like I'm talking about in terms of organizational changes like segmentation you know how do you define and I think part of what I go back to what I said earlier these concepts of what defines more secure what measures we use are still in their nascent and early stage so it's hard to have ground truth to know if the approach that I'm taking is better or worse and it is hard to do you know these these assessments and so that's one of the challenges in this field is you know getting robust data sets and getting you know access to ways to evaluate it and in many of the cases of the things I talked about here our initial evaluation was through subject matter expertise and turns of and then and as we as we were able to conduct these experiments we were able to measure things such as the amount of lateral spread of a known adversary in a network when we had segmentation but these things are very hard and so we are still in the early stages of trying to make sure we can do all these things but there are many different models there are many foundational questions we still have to understand I think we're sort of in the baby steps kind of phase here I think there are some real practical things we as research can do that can help make things explainable and signed approaches that are enhancing to the things like signatures so that we can use anomaly detections and other variations of things like that and and and make progress so I want to recap and and then I'll take questions but you know really this idea of situational awareness and where I sit in is what we believe and the sort of common understanding is that continuous understanding of your network is important that is not as easy as it sounds it's a lot of data there's a lot of practical concerns about how much you collect and how do you evaluate all this data there is lots of data in cyber which is fun for me as a cyber as a data scientist but sometimes that's not a good thing because we have to be able to make use of it in a timely way because time to solution is very important in cyber research is needed and to advance the science of security we've made a lot of practices but as I was just saying we've made a lot of advances but there's still a lot of foundational things about how the network's changed the rate of change from the network many constraints in terms of resources and that we as a group of researchers really have to push on these ideas of how to make sure we understand evaluation metrics and the metrics of what does it mean to be secure and the cyber security community has to be thinking about these things and really input from the operational community and the cyber defendants themselves and from everybody working together as a researcher this is so exciting there's a ton of interesting research questions that are well grounded in practical needs and lots of fun so if you have an interest in cyber research happy to be a research resource for you and hopefully I gave some examples here I know this was a little bit more of a survey than a deep dive and I will make available some papers if you want but there's a lot of benefit from being able to combine multiple sources of data together and multiple data from the operational point of view data from maybe more just quality of service kinds of things data from cyber security tools and combining together and just I want to reinforce that it is important that we make these activities and the research we do yield actionable information for cyber defenders that the goal really is to make an applied research approach and that the sort of data driven approach provides new insights and abilities to provide a way to explore the space that is not easily intuitive given that computing networks and the behaviors of humans in computing networks or devices in computing networks is very complex with a very complex system that hopefully these data data analytic approaches gives us a way to augment the knowledge that's in the subject matter expertise with new insights that wouldn't normally have been able to be inferred and you know it's not easy to intuit some of these but these you know it's very hard to intuit if there's a critical device in a network because it's a back end server and nobody knows it's there with that I'm going to stop and hopefully release some time for questions and a first priority to the person who asked earlier question and did I give enough examples or to do not need to go back. Thank you Celeste. Terrific. So John Fox you asked the first question then let me unmute you then say if you have more questions or Celeste has addressed your question. John? Yeah thank you very much I mean I'm coming at this trying to absorb this maybe in a more practical sense but I think one example you had was very helpful you said you can look at the type of network traffic whether it's peer-to-peer or you know web and you get an example of several categories and so you could see that as a metric you could just look at all the packets going out and sort of keep track of this and if it changes in an anomalous way you would say someone is using this in an atypical way. Did I capture that aspect of this? Yeah that's absolutely correct. Okay but then I want to just be clear just because it changes it doesn't mean it's bad. Right but it seems to me I mean these wrappers now again I don't understand all these protocols that well but it seems to me it's pretty easy in the same way that I can get spoof emails that claim to be from coming from places they're really not coming from. I can wrap my message with false headers and I can wrap it up with all kinds of stuff to make it look like things other than what it is and it would seem to me maybe this is just the game you play is that each approach has a counter measure that somebody can get slightly more sophisticated but you could use a variety of protocols or wrappers using some you know pseudo random method to choose when you're going to do them that you wouldn't see these as anomalous changes at all. You just see this as you know kind of white noise in the distribution so I don't know if that's too specific a question but I'm just thinking how easy it would be to spoof some of this stuff. So let me address that specifically I think one of the things that is important to address that is it may be more the machine learning approaches that whereas signatures it's too it's very hard to have multiple different parts a rule that has many different parts it gets too unruly but with a machine learning approach or an automated algorithm that is looking at it you actually can parse maybe instead of just looking at the header you can look at the details of the header maybe there's 56 different things or 75 different things if you parse say the header that you can now keep track of and make a more complex algorithm about and so that's what we're seeing happen is that it is true that there are wrappers in different ways you can use these protocols but there's only so many ways that will actually work and that if you have an algorithm look at the not just one or two three different metrics in that you can start seeing that if people actually people give themselves away because they do the craziness sort of consistently because only like the developer will find a way to take apart the protocol and use it sort of in a unique way and even if they vary it a little bit the machine learning algorithm can find statistically significant things so the real issue there is that now with machine learning or algorithmic approaches to cyber defense we can look at many more data elements and make our equations that look at anomalies a little more complex and that's really what's happening to help with that case of now people trying to spoof you by making things look alike okay thank you we still have a couple of questions coming and people can type your question from Q&A or you can raise your hand and there's a question from the Q&A part of basically ask about end-to-end encryption which is kind of better practice for the industry for the security purpose to what extent is use of encryption at odds with data science approach to detect cyber threats and more general to identify opportunity to improve security so you know end-to-end encryption definitely makes things better and more secure and definitely a best practice but there are also kinds of things that one can look at from a data science too so if you look at end-to-end encryption you still can see that that somebody is connecting to some some devices connecting to some that so maybe it's not just you're looking at the content but you're looking at the structure or maybe some of the things I talked about those communication patterns and even with encryption you still are able to figure out some of the things like the protocols and there's some header information so I believe and I and I think encryption makes it more challenging and good in terms of protection but there are also you know just different ways of looking at in particular the structural or topological connection more like a graph analytic problem that comes in and we have a paper called guilty by association that sort of explores these more structural elements of using the aspects of what you can see when you can't see the content to help you protect your network thank you we are now approach 230 before I close and wrap it up I have a last question for myself is Celeste I know that you that or maybe still leave the data science cyber security summer internship program can you tell a little bit about that program especially for students because 90% maybe 80% of our students have taken data science or machine learning one-on-one I think they will be very interesting about to the internship opportunity so I will point to their suit your students to the data science summer institute called DSSI at Lawrence over more this year the program is already full and the program we fully remote but for future reference and it doesn't only have to be during the summer we we are we can have students working with us throughout the year so we that data science summer institute and the cyber defender program are programs of which to apply data the kinds of data science challenges I've talked about to cyber security and it is a cohort group where students get together with their colleagues to explore and when students come to the laboratory they have access to the lab data which we can't get out normally so they get to be because there are some their employees during the summer or when they're interns they get access to all this fun stuff I was telling you all about so I would again that data science summer institute DSSI and there's a website about some existing projects or previous projects and that would be the best resource and then I did put my contact information on the slides and I'm happy to entertain questions about the stuff I talked about and or any internship opportunities that Lawrence over more thank you for that really really appreciate I think this is slightly different as traditional power system talk but it's really open eye and give us some additional thoughts how we can do the continuous monitoring network segmentation type of techniques for power system thank you very much appreciate thank you for every one