 Hi. Good morning. Welcome to this talk. In this talk we are going to, it's, Mark is mainly going to talk about using salameter data to detect, not just fraudulent, we will talk about fraudulent activity in an open stack cluster, but this could also do other things as we will see. So it's a joint collaboration with Mark who's the main person and who did this as part of his master's thesis and while he was at Cisco as an intern for, and Professor William is the computer security lecturer at University of Kent and I work for Cisco. And originally this, the whole space of problems started as a special project within Cisco where we have a small advanced technology group looking at few next generation problem that open stack will face. And our problem was how do we optimize big data and IO intensive workloads on open stack. And we've had a series of talks in Atlanta where we showed initial results on how we do that and we've talked about smart placement, interactive visualizations, gathering insights from salameter data, log data and a bunch of things that we did in Atlanta. We had four talks in Atlanta, so I wouldn't want to repeat all those things we've done. But essentially what the high level question as a part of investigating all these optimizations for big data on open stack, the fundamental question is what is happening in my cloud? And this is important for security purposes, for operational purposes, a bunch of use cases. And actually Mark then went back to his university and he decided to pursue follow up problems and say that okay, so if we know how, if we can answer the following question that how do we figure out what's happening in our cloud based on logs and metrics? How do we use this knowledge to detect fraudulent activities? So what is a fraudulent activity? There's a whole space of activities that could be fraudulent. First things that come to your mind would be DDoS attack and then figuring out if somebody is doing something wrong like running a bitcoin mining job. And in general you could even use the same techniques to discover noisy neighbors, which is a big problem in shared virtual infrastructure. So with that I'll hand over to Mark. Thank you Debo. It's my micro working. So the first step here was to decide what we consider to be acceptable workloads and what we consider to be fraudulent activities. So regular jobs would be using distributed systems or running web applications in the cloud or even scientific calculations or hosting databases. But is this everything that's happening in our cloud? No. There's many other stuff that can happen. Not all of it's so nice and so acceptable. We can also have DDoS attacks. We can have bitcoin mining which is quite popular and it's actually quite bad for our classes as it reduces their working life quite a lot. So we really don't want this. So the next question is how can we attack these activities? We came up with a very, very simple method which tries to use a component already existing in OpenSci with the Selometer. We use Selometer to get billing information. So we use this data and we are going to try to reuse it for something else. So we started, that's the formula we use. We use Selometer. We added some magical machine learning and with that we got real-time fraudulent activity detection. Real-time is in parenthesis because of course you can do it offline, you can save the data and do it later experiment, but it's also possible to do it in real-time. So we divide the method in three steps. It's collect, then classify and then counteract, take in action. First one, collect. That's where Selometer comes into play. Selometer provides us with many different meters. We wanted to use the minimal amount, only the default ones, not changing too many configurations. So we get with CPU utilization, the network information and disk information, which it's privacy-friendly data. You get aggregates, you get numbers, you don't have access to the user's data, so you are not spying on them, which is quite important in this. And also, for the purpose of the investigation, we changed the collection rate to five seconds. So we could experiment with different rates. In production, you probably have a bit more than that. So this is the data once it's processed. So we've got, of course, the headers. The first column will be the type of activity, the class. And then we have each line represents five seconds of activity in our cluster. Each line represents also one bit of resource. So let's say one VM or one volume. So what kind of jobs did we use for our experiment? We tried to keep it simple, not used to many jobs. So we divided them in three categories. First, general workloads. We chose Hadoop running high bench. High bench is a benchmark suit, which is quite complete. It has real-life situations, such as page rank. It also stresses all the different components of Hadoop. So it gives us quite a lot of data, quite diverse. And then, since we decided to focus on detecting bitcoins, which is quite CPU-intensive, then we added CPU-intensive accept operation, which could be a scientific calculation. In this case, it was just the Linux stress tool, stressing the CPU. So then we have further activities and failures. We had failures because we wanted to see if we could detect something there, see if it worked. So we had an internal DDoS attack, which was a very, very simple ping float DDoS attack between two VMs, well, several VMs from inside the same tenant to overload the network a little bit, and then mining Bitcoin and leadcoin. And for failures, we added the physical network failure. So once we had decided all these jobs, we collected data. And once we had the data, we were able to plot it. So let's see how the data looks like. This is the plot for the Hadoop job. And as I said, high bench has quite a diverse range of activities. So as you can see, we are plotting the CPU utilization versus the network going package rate. We have quite a lot of points. It's quite sparse. So it can be, we can get guests from here. It can easily be confused with all the type of jobs. So let's move to the next one, the CPU-intensive one. Here, of course, I don't know, yeah, I think you can see it. The CPU is extremely used. Actually, since OpenStack is a very high-default computer to over-allocate resources, we are using more than 100%. And we are not using too much network. We are using some network, but not too much. And disk as well. So now the DOS attack. Of course, network is extremely high. And we are also using a fair amount of CPU, although not much. This one seems to be quite focused on up there, so it seems to be easy to detect. Next one. It's the Bitcoin mining and lead-coin mining. Initially, we tried to treat them as separate classes, try to distinguish between lead-coin and Bitcoin. But this proved to be impossible. So we merged them into a single glass, treated as cryptocurrency mining. And as you can see, we are using quite a lot of CPU. And also, if you look at the vertical stripes, we can say that there's network patterns. We are repeating the same size of package several times. So finally, we've got the network down, which is the least important. There's actually nothing to see here. Zero network and almost no CPU. Just a little bit of CPU because the VM is running. For that, we just unplug the network cable. Very simple. So let's see how they look like when they are together, together in a single plot. So that's all the data together. As we can guess from here, it's going to be fairly easy to detect the DDoS. Hadoop is going to probably present some problems as it overlaps with Bitcoin. And also, Bitcoin overlaps with a very CPU-intensive operation. So I'm going to hand it to Julio now, who's going to talk about step two. Thanks. So as you saw there, you know, it's, I mean, of course, we're only showing here two dimensions. We have a little bit more than that. But the problem is not completely trivial. In some cases, yeah, it's pretty easy. But in others, you know, it's far from trivial. So we have to apply some sort of machine learning algorithm that will, you know, a smart task classifying. But which one? I mean, we have plenty of opportunities here. We can use some sort of neural network support vector machines, which is, by the way, my favorite one. And the one I was betting my money on, you know, performing better. We can use some of the approaches like random forest, naive bias, or whatever. So, and of course, the idea is to connect this, any of these, the best performing of these with, you know, the data data collection and, you know, to generate some sort of pipeline there. And for that, we, in this particular case, used, well, relatively new tool, machine learning tool, which is free, developed by a group of academics in the University of Ljubljana in Slovenia. And it presents some advantages of, you know, other, perhaps better known set of machine learning algorithms. Perhaps you're, you know, better, more familiar with WECA or scikit-learn or something like that. But orange was, or, you know, set of algorithms of choice in this particular location. And why? Because we can do this. It's really very, very easy to use. And, you know, we wanted to propose you to do something that really had almost no entry level prerequisites. And this is really easy to use. It's something, it's the first, I would say, powerful enough data mining tool that offers some sort of drag-and-drop abilities. So you just pick some of these and you drag and drop them and magically they work. Well, not magically, of course. I mean, underlying you have a lot of Python and a lot of C++ and, but it still is pretty good. And this, basically what this shows here is, well, the entry, the data entries in here, we use a number of different, you know, algorithms because we are, you know, algorithm agnostic. We just try them and hope for the best. And we have used some of the natural approaches like SVM, support vector machines, principle component analysis, random forests, knife bays and logistic regression, neural networks and so forth. And you can test and compare them and produce very nice looking, you know, output, you know, checking the advantages of them from rock curves to confusion matrix. So it's a pretty nice tool and it's free. I mean, I'm not related in any way to that, but I highly recommend it. Anyway, so after some trials we got this. It's interesting to see here that all of them perform, you know, quite well, I would say. But there was one that really outperformed the rest. And this is the random forest algorithm. Random forest is some sort of ensemble learning algorithm that uses multiple decision trees. And that outputs the mode of, you know, the classification by these decision trees. It's kind of, you know, hot algorithm in data mining. And it consistently, I mean, for a variety of problems, you know, gives you good results. Not by any means, the rest do poorly. For example, you have here, you know, just a simple classical all good neural network, achieving nearly 80%, which is pretty good, particularly after, you know, after only five seconds of data. Just to let you know, we use here 10 full cross validation, which is a kind of, you know, technical term in data mining to say, basically, that we are not cheating here, that we are using data, but not memorizing it, that we are not overfitting or, you know, overtraining on the, you know, the data we have. So that, you know, we, we use some sort of cherry picking, not the case with 10 full cross validation, meaning that these results are can be generalized and can be, you know, proved to be useful. So yes, this high value here is exactly 97.5 accuracy, which is, you know, pretty good. I was myself surprised. Let's put it mildly. So, okay, we can't classify the stake action, you know, once we got this mostly correct classification. So the thing, of course, is after running this, deploying this and running this and getting some results, the idea is to trigger some sort of an alarm for, you know, further investigation, you know, something fishy probably is going on. So you need to put some extra time and checking what's going on. Perhaps, you know, you can automate some responses, like decrease the quota or even stop the, the resource or even block the user altogether. That might be a little bit, you know, perhaps of an overreaction, but, you know, depending on how you have configured your, your system, it might be the correct one. And you will have at least a, you know, a good base, a good case to make for these actions to be taken. Anyway, so the good thing about this approach is that it works pretty well with a small number of workloads. Of course, you can, you can say, hey, basically you're classifying whatever enters your pipeline in five different classes from DDOS to, you know, cryptocurrency mining to network failure, only five. How will these, you know, escalate if you, instead of just classifying into five different labels, you want to use 20 or 100. Of course, you won't achieve the same results. That's what, that should be pretty obvious. We haven't test yet how the performance decreases with more labels. But, you know, I'm pretty confident that it will decrease gracefully and it won't just, you know, collapse. The good thing about this is that it enables very quick and very reliable detection of something we are going on, on my cloud. Can we improve these results? Yes, we can improve that. A little bit, at least, but using something of a metaclassifier. I mean, the war is probably too much of a war for the basic idea, which is repeating these many, many, many times after, you know, an hour and taking decisions not every five seconds, but, you know, collect yourself after an hour and see what's going on. With a 97.5 percentage of, you know, accuracy, after five seconds, you can really approach 100 percent, get very, very close to 100 percent after an hour. And particularly taking into account that many of these activities won't be running just for five seconds. I mean, if someone is abusing your cloud and is mining like coins, there's really not much of a point for just, you know, mining for five seconds. That really wouldn't generate a lot of profit in general. People will be mining for hours, days, weeks, and so forth. So, you know, 90, you have to see these figures, 97.5 in context. Of course, random classification, you know, will give you 20 percent because we have five labels. 97.5 is pretty good. There's also a case to be made that we don't need more labels. You know, getting more or slightly better classification will be probably a little bit of, you know, a little bit creepy for our users. We will know a little bit too much what's going on and probably that's not healthy, particularly in this, you know, post-Snowden era. So, I'm talking about privacy. I think this is very important. This technique, with all its, you know, pros and limitations, the good point is not invasive at all. We're just, you know, milking data that we will collect anyway because we need it for billing purposes. Okay? So, we are just reusing this kind of minimal data required for billing. We are not spying on our customers in any way. That's, you know, the minimal set of data that we need. If we can mil that and put that to good use as we prove in here, that's great, but we are not inviting anyone's privacy here. On the other hand, that also has some advantages, meaning that if everything is encrypted, we also can use this technique, of course, with some, you know, slight modifications because, you know, the data, the amount of data sent and received, if you're using encryption, could be slightly modified by padding, but, you know, it will be a really small difference. And also, you can try your, you know, your machine learning algorithms with these encrypted data instead of with plain text as we did in here. So, I mean, many advantages are not to be too privacy invasive, I would say. Of course, if we use more meters instead of this minimal set of standards, we will get a better classification. That's for sure. But we believe, I mean, in our humble opinion, this is the right trade-off. Not to invasive. We are not doing anything particularly, you know, creepy. Something that will put off our users. And at the same time, we're, you know, getting or gathering as much intelligence as we possibly can. So, yeah, I think that's all. So, if you have now any questions? Well, in the real world, what you would do is when you detect something's going wrong, you would further investigate it. You can use more tools such as networking special tools, then you can invade his privacy if necessary, investigate more, and then take action. Yes, it's an easy way and a cheap way to detect, to raise a flag. There might be something going wrong there. And, yeah, another point I forgot to mention is, I mean, we can, you can see these as a, you know, practical approach that you can implement yourself in your systems. But, you know, you can't see these as customers and saying, hey, why this particular company is getting more data? Ah, they are claiming that it's for security purposes. No, I will say no, because these guys, you know, show that with the minimal billing data, they should be more than capable to, you know, to do at least a first specification of writer or wrong. So, I mean, hopefully you can use this to say, no, I'm not disclosing this data or I don't want to engage into this contract because it's too invasive or whatever. Also, let's say you are a user of OpenStack. You don't need to rely on your cloud provider to do that. You could get your own salameter data and run this locally on your company or in an extra VM and monitor your own services. Let's say you have a website, you could monitor for zombie machines using that. And you wouldn't rely on your cloud provider offering this service or having to pay them extra. That's a good point. Sometimes your services would have been compromised without you knowing, your pain for computation, hopefully achieving something which is, you know, profitable for you, but then your system has been compromised and without you knowing, you're mining bitcoins for someone in whatever the country I don't want to name countries. That's not really very good. And this could offer, if not, you know, the final solution, at least some hints that something fishy is going on. Oh, and not for this project. We've done that in the past, but not for this project. We're trying to use only the default data you could find as Icehouse in salameter. I'm not 100% sure, but I think that from, if you use OpenStack, you can access to your tenant's data in salameter. At least, I think you come from the dashboard. I'm right. Yeah, so if you, you, you can pull the data every minute, every two minutes, or every hour if you want, get the data, analyze it, or you could even, yeah, you could automate it, but you'd have to download the data if you are running it outside the cloud. Training set was, it was about 36,000 sample. It was 50 hours worth of data. Yeah, I do not in the slides. I can show you afterwards if you want. I have the confusion matrix. Okay then, thank you everyone, and that's the end.