 So, yeah, that was my bio, I'm basically an imposter here, I'm not a data scientist, but I come from the telco world, from the internet service provider or networking world. So this is also a story of a multidisciplinary sort of research that I did with my colleagues, which is the kind of thing that you should look at in machine learning and AI today, I believe. So all this starts from looking at the statistics, what is so-called a dwell time, which is basically a measure of how much time passes in networking before you realize that your network has been compromised. And if you see reports, usually they claim something like around 100 days, which is three months, which is crazy, because it's like saying, you know, you operate a corporate network and you don't realize that for three months your network has been compromised. There is an attacker that has access to your network, which is crazy. So typically, the main reason we see today with our customers is IoT. And by IoT, I mean, you know, a random internet of things, devices of any vendor in this example, a smaller Raspberry Pi, which as you must have heard, is a small computer, basically, which can be used for many things. So in this example, let's say that this is a temperature sensor or humidity sensor or whatever, and it's periodically sending data back to the cloud, to a cloud backend. But it's also doing other activities, for example, synchronizing the clock or sending statistics or telemetry or updating the firmware or whatever. So you know, up to here, if you have a small deployment, you could just follow the traditional approach of setting prescriptive firewall rules to determine what can these devices do and what cannot, what type of network traffic is allowed and what is not. But you know, typically things tend to expand in the world, right, entropy always increases. So after a while, you find yourself with a larger deployment, with a lot of devices, and you know, at some point, some of them might be at remote locations, they might be dialing in your organization via VPN or via some weird data center tunnel. And also some of them might be from that particular vendor you bought several years ago and they're not updated anymore. So you know, this is why you might not realize that one of them has been compromised and now it is an insertion point from an attacker into your network. And typically, this is very hard to identify using traditional firewalling rules because maybe this device is doing very low traffic, very little traffic because it's just sending a keep alive to a remote server controlled by the attacker. So we looked at this problem with several of our customers and we tried to think, why don't we look at the problem upside down? Why don't we instead of trying to control what kind of traffic these devices can do, why don't we allow everything? We allow every type of traffic and instead we try to determine what's the typical behavior of each individual device and see if they change behavior over time. So obviously, it's not something we invented. This is called user entity behavioral anomaly, sorry, behavioral analytics, or UEBA, which is a whole market. There are a lot of startups working in this field and a lot of mainstream vendors already there. We are a late entrant. But if you look at what most solutions are doing, they typically ask you as a customer to operate some sort of big data backend. So you probably have to store network flows or syslog entries or even network packets for a while and then these are processed in batches. We instead want to differentiate ourselves by looking at an approach completely based on online learning where basically we tell our customers we will look at the data only once and then we will throw it away. So we will ingest network data live and we won't require you to store anything at all. And beside being much cheaper from an operational perspective, it's also much simpler from a legal or privacy-related perspective because many of our customers are in highly-regulated industries such as healthcare or finance where even the fact that you have data is toxic, it's expensive. So using online learning is a real facilitator for them, keeps the opportunity cost down. So the inspiration for this work comes from natural language processing and particularly from all the work that has started from 2014, 15 and over the last couple of years starting with World2Vec and up to the recent advancements. And probably I don't have to explain the basics to this audience, I guess, but the core of modern natural language processing is the idea of word embedding. The idea that the meaning of a word can be determined by looking at the context in which the word appears most of the time and what is known typically as the distributional hypothesis which dates back from the fifties with the idea that you shall know a word by the company it keeps, meaning you cannot determine what's the meaning of a word in English by looking at what are the words that fall next to it typically in English books or sentences and so on. And this is the canonical example you always see in tutorials of World2Vec where you see that you can use word embeddings, so vector representation of words, to do analogies, to do grammar conjugation of verbs or going from masculine to feminine and so on or even to do higher order analogies such as going from a country to its capital and in fact this word used in the past a lot even for things like basic artificial intelligence for chatbots. But you know I come from networking again so my context is not spoken languages like English or Italian or French, it's networking. So can we model, can we use this recent advancement in NLP to model made up languages such as synthetic language of networking? What if we say that a network flow for us is a word and the ensemble of all the words generated by or from an individual device is a document, right? Then we would be able to determine what's the typical topic of conversation of that particular temperature sensor and determine whether it changes over time. So what we did was a building a small prototype which is using a set of features from layer 3 and above in the networking stack to generate synthetic words like in this example, taking things from lower layer in the stack and up to the packet inspection which we operate in real time on the traffic we see on our access point and switches and routers and then we apply a sort of advanced word-to-vec where we implemented a few mathematical tricks to adapt it to a synthetic language, a made up language like this one. In particular because we are not using English or any spoken language is the size of the vocabulary is not naturally limited. So we do a few tricks to try to bound the maximum size of the vocabulary otherwise it will grow to infinity. And also we generate we don't only learn a vector representation for each word but for each subfield in the word you saw earlier so that we have actual meaning or vector representation of the meaning of individual networking concepts such as a packet being TCP rather than UDP. So what we end up having is a batches of flows per each device over the last few minutes across the network. So let's imagine device A in this example is a temperature sensor. And let's say over the last few minutes we saw four individual flows. So we pass them through our modified word-to-vec approach and we in output we get vector representation. So basically large arrays of flows for each flow. But in practice many flows you will see on the network on this synthetic language will be very similar in reality. So what we do is we cluster them into a manageable number to keep the number low and we do this by using cosine similarity with a particular implementation that we did so that we can run them in a few millisecond across all the flows that cross the network over the last few minutes. Then we look at the similarity between any pair of flows and we group together those that are more similar than given threshold which basically saying all the flows that are almost identical let's consider them as one. We also keep a tiny state for each individual device which describes the typical behavior of the device which is basically the last clusters of flows that these devices has generated over the recent past. And these clusters are basically moving average of flows that are similar together and also you will notice that these clusters are time-stamped so that we can update the behavior of a device by pruning away the clusters that have not been seen for a while. So what we do then is taking the flows from the recent batch and compare them in similarity with the cluster that we have for this particular temperature sensor. We then take the maximum and we feed those maximum into a quantile sketch algorithm which is a fancy way of describing an algorithm that keeps track of what are the percentile, the quantiles for a particular distribution while keeping the storage bounded in size. So what the user sees from a UI perspective is a single slider like this one where you can set what's the minimum confidence you want the algorithm to have for the allers that you want to receive. So it's like saying if you slide it all over to the left you will receive an alert for every flow where we are not really certain that is an anomaly. If you push it all to the right you will only see allers for things that we are absolutely sure that they are anomalies. What this means from a data science perspective is that we are extracting the corresponding quantile from the sketch which we call beta in this example. So it's from 0 to 1 basically or 0 to 100. And what we do then is basically taking the last row in the comparison table you saw earlier which is basically the maximum similarity between the new flows and the existing clusters. And if there is any previous cluster that is more similar than the threshold then this is not an anomaly. I mean we have seen a behavior similar to this one so we are fine. But if there isn't then this is a new behavior from this temperature sensor. So if it's farther than beta then we should alert the administrator and also create a cluster for the future so that we know and keep alerting the user over and over. So we also in our UI we also generate a very simple dimensionality reduction visualization where you see here a dot for each individual IoT endpoint in the network. And they are colored based on basically the type of device, the mac address of the device. But this is not an information that we pass to the algorithm. So the fact that you see clusters that are similarly colored lying next to each other on the 3D plot, that's not something we told the algorithm. That's something that the algorithm has determined based just on the behavior by looking only at the flows. So if you were to see, for example, a red dot coming from the cloud of red dots which are, let's say, temperature sensor moving away from the cloud of the other temperature sensor, that will also tell you that something is different for this individual device compared to the rest of its peers. And this visualization is live. Actually, in this example, it's taken from a data set of a customer of ours. This is operating a very large scientific park, like a very large laser accelerometer, where they have tens of thousands of custom built scientific piece of equipment. And they don't even know exactly what these devices are doing because it's a very large deployment because they are all custom devices using all sort of IoT protocols. So this visualization not only is useful for security, but also gives the network administrator an idea of what's going on in their network. At least you get an idea of, you know, I have a sizable crowd of devices that are behaving similarly to this one that I know what it is doing in reality. So that's all nice and fun. And this is going through early trials with customers. But what's next? So as future directions, we are looking at multiple things. Ideally, what we want to build is a zero footprint, an augmented, federated, edge-wide speed behavioral modeling. And that's a mouthful. What does that mean? So when I say zero footprint, as I said earlier, I mean, we always want to build a solution which don't require the user to store any data in the back end for operational and for legal reasons, and also that keeps the opportunity cost down. We don't want to ask the user, hey, you know what? You have to install a rack full of equipment for running this application. We want to be able to say you already have a virtual machine in your data center, which is controlling all your access point in your venue. You know what? You can also install this other lightweight virtual machine, and we will do the machine learning magic on there, basically. When I say unaugmented, what I mean is we want to skip the feature engineering part altogether. That slide I showed earlier with the feature that we are using to generate the synthetic words, we want to drop it completely. Ideally, we would want to ingest raw byte data, raw packet data. And even if nowadays most of networking data is encrypted, there are still plenty of metadata, for example, during the TLS encryption phase, handshake phase, that are either unencrypted or they represent entropy, for example, related to certificates, which doesn't change across connections. And this approach ideally would allow us also to automatically learn long-term dependencies between flows. This comes from an approach which takes ideas from computer vision rather than NLP. So it's a fusion between these two fields that we are looking at right now. Another approach that we are looking at today is federated learning. Because if today you buy or you install this software, it will learn everything from scratch. So we are not doing any sort of transfer learning whatsoever. Every new deployment starts with an empty model with complete random weights and biases. So what if we found a way to share these weights between customers while preserving anonymity of the type of data that goes on in their network? That would be ideal, right? You would get a higher accuracy sooner. And also as a byproduct, we would get a higher order cloud-based repository of networking concepts, which we could use, for example, to transfer information about security attacks between customers and provide early alerts about ongoing threats. And finally, up to here, all our implementation is pure CPU x86 commodity-based. We believe this will hit a wall if we keep on this road, obviously, because most of what I describe in this slide cannot be done in real time on a CPU, despite all the optimization we are attempting. So we are looking at embedding Edge TPUs or embedded GPUs directly into the Edge hardware. And when I say Edge hardware, I mean actually switches, routers, or access point on the roof. As you know, most of these pieces of hardware, they come from either Home Assistant, like Alexa or Google Home, or from the automobile industry, for example, for self-driving. So this is a very novel field. And there isn't really anyone setting the road for us. But we believe that because the cost of this device is going down because of the economy of scale, we will eventually be able to have those directly on access points. So that's all I have. Thank you very much. And if you have any question.