 Hello. Hi. I'm Yuval. I work for Red Hat, and I prepared that together with women that sadly couldn't stay for this Friday, so I'll do the presentation myself. I work on the self-storage project, and let's see what we have today. So a little bit about the agenda. We're going to be, I'm going to be pretty much all over the place. There'll be a lot of things I'm going to touch upon. Eventually everything would hopefully converge into one nice demo. So first the background, a little bit about the ransomware. A big disclaimer here. I'm a storage developer, not a security expert. So everything I'm saying here about security, please take with a grain of salt. A little background about information theory, and then the architecture. I'm going to cover SEF. I'm going to cover RUK. I'm going to cover what I did with Lua, and then hopefully tie everything together, explain, and then demo that. Okay. So first ransomware. So ransomware is here to stay. Information data is one of the most important assets that organization has, and the criminals or whosoever's want to take advantage of that will do that. So this is such a powerful tool that it's not going away, anyway. If you think you're going to build strong walls around your organization, then you're wrong. The first line of defense will break, because either via social engineering or somebody from inside the organization or a day or zero day attacks, the malware will get in. So you cannot assume that the malware will just stay out. The important thing is to think what you're going to do once it's in. Now, one important thing about ransomware to know, it's not like when the malware hits or crosses into the organization, then everything is suddenly encrypted and you have no access to your data. There's an incubation period, and incubation period is actually how the, well, it's for many reasons. First of all, usually the ransomware gets into the organization through one point, and then it needs to spread in the entire place, and that takes time. The encryption process itself also takes time, and more important, they know that people have backups. They know that you have snapshots, and usually an operator for their ransomware, like a real person that controls the things from somewhere, will check what are your backup systems, what's your retention period. So let's say you have a one month retention period, then they would find that out and would wait for one month and encrypt your data in a way that it is still accessible to you, and then after they know that all your snapshots are encrypted as well, this is what they'll turn the plug on. This will be in a time where all the data is already encrypted, all of your organization is covered by their malware, and then your snapshot doesn't help you at all because they're encrypted as well. So during this incubation period, this is the time that you can still save your data. If you detect that there's malware going on, this is the big question, because they're going to do everything that they can to prevent that. They're going to make sure that whatever they're running is going to be well hidden, so when you scan your hard drives, your antivirus or whatever will not find that, they'll make sure that they're not going to spike your CPUs, they're not going to spawn some suspicious looking processes all over the place, they're going to make sure that they hide. Now, it doesn't mean that they don't do anything, they have to do something, and there has to be some behavior around somewhere that we can detect during the incubation period, and that would help us to find that there's something bad going on. Now, the characteristics of this behavior has to be something that they benefit from, because they're going to make sure that any behavior that is not mandatory for their operation is not detectable. The one thing that they have to do is encrypt your data, so the encryption itself, the encryption process has to happen, and another thing that we would like to kind of see in this behavioral analysis is something that is an animality. So if business as usual, if we don't detect anything that is different than the usual operation of our system, then we have no way to know that the advancement is going on. And so we're looking for animalities, we're looking for something that they have to do, and ideally, and that's the last point, we would like to have the mechanism running not on the infected machines. So usually the infected machines are the desktops and maybe some other servers of the system. I won't say always, but in most cases, something like the storage servers are usually not infected. So the storage servers are administered by fewer people, they have stronger passwords, nobody is like clicking on the wrong attachment in an email on the storage server, and the attack surface is much smaller. So we can hopefully assume that the storage server itself is not infected and is not controlled by the malware. So these are kind of the assumptions around the behavior analysis of storage that we're going to leverage here. Okay, so I'm going to track back for a different topic and then tie that back in. So entropy. Well, by definition, entropy is the amount of information or uncertainty that a random variable has. Now, a random variable is just something that gets different values with certain probabilities. So something that has different events and each event has a different probability. And when we talk about the uncertainty of information of an event of a random variable, then the definition, this is the log of the probability of this event. Log in the base of two, why the base of two? Because the events that we talk about are either having zero and one. Those are binary events. So when you talk about an event, it's either something gets a zero, something gets a one, and it has a certain probability for zero, certain probability of one. Minus log of this probability is the information of this event. And the entropy is the average information of this random variable. So I go through all the events, calculate the information, which is the log of the probability, and create an average, which means multiply them by the probability. So it's like a weighted average. So this is the definition of entropy. This is pretty theoretic. And as an engineer, I don't like those kind of things. So I'll take an actual example. So let's look at a random variable, which is just a string. Now, there is one thing that is a bit different here. This string is considered of one of the 26 letters of the English alphabet and a space. So my different events could be different values, like different letters or a space. Those are not bits. Those are not zeros and ones. So I'm going to normalize this. So I have to, in theory, I have to use the log in the base of the number of variables or events that I have. But instead of just using log in the base of 27 in this case, I'm going to divide the whole thing by the log in the base of 2 of 27. This is a logarithmic identity. It works, trust me. So let's look at one of those strings, hello world. So I can calculate the distribution of this hello world. So each letter, a to z, and a space has a certain probability to appear in the string hello world. I can calculate it. It's pretty easy. And then I'm just using that in the formula and calculate the entropy of hello world. And this is about 0.6. Now I'm going to take a different string, a different random variable. And this is the string, the quick brown fox jumped over the lazy dog. The reason I picked this string, because as you guys probably know, it has all the letters of the English alphabet and some spaces. It's not like every letter appears exactly once. Some letters appear more than once. But this is quite close to what we would consider a uniform distribution. So in a uniform distribution, the probability for each of the events or for each of the letters in this case is the same. Here it's not exactly the same, but it's pretty close. I mean, the couple letters appear more than once and the space appear more than once. But if you apply the same formula here and calculate the entropy of this random variable, this sentence, you see that it is much closer to 1. So I hope this kind of made more sense than the theoretical definition of what entropy is. So how can we use it? Well, we're not talking about strings. We're talking about files. So a random variable now is a file. And the file contains bytes. So it's not 27 values, 256 values. So this is my space of events. So one of the 2256 values. But everything is the same. So all the calculation, everything is pretty much the same. The closest the distribution to a uniform one, the higher the entropy is. Compressed files will have higher entropy. An encrypted file will have even higher entropy. Why is that? When you encrypt something, you want to reduce the amount of information that you convey in whatever you're encrypted. Because if you have the information, then it's clear what it says. You don't want that to be clear. You want that to be encrypted. So you want to make the information as close to a uniform distribution as you can. This is why strongest encryption is close to a uniform distribution, which means that a much higher entropy or entropy much closer to 1. Now, if you wrote some ransomware and you want to encrypt, you would like to have strong encryption. Because if your encryption is weak, then the organization won't pay you money. They will pay money to somebody to break your encryption. If your encryption is strong, it means that you have increased the entropy significantly for those files. Now, you can say that the heart of the problem is how to distinguish something like a compressed file, because a text file will have a very low entropy. You can see this log file here at 0.6. You would see that compressed file, like the JPEGs and other files here, have a pretty high entropy. So the main issue is how to distinguish compressed files and lots of the files that we have, whether it's PDF or docs or stuff like that, they're all compressed to compare them to encrypted files. So you can use file types. But file types is difficult to detect, especially if the malware is kind of making it harder for you to detect that. And also, an absolute entropy threshold, saying above that this is encrypted, below that this is either compressed or nothing, is very hard to come with. But what I'm going to have a look at is the changes. So it just bites. I chunk the file to bytes. So each byte can have, those could be binary. Yeah, those could be binary values like it's not text files or anything. So what happened when you're encrypting? So I've kind of encrypted those files. And you would see that for most of the file, there's a significant increase. Now, for example, the log file, of course, it's a huge increase. The JPEGs and the PDF, well, it is an increase. Does it file? You can hardly notice the increase. So here come into play another thing that I have here. And the fact that I'm looking at a directory. So when I move to the space of objects store, that would be called a bucket, but it's the same concept. And the thing is that I don't have to be right for each and every one of the files. But if something is going on and encrypting all of my directory, then I'll see a difference. Maybe I won't know that for all the files, but I know that for a couple of them. So the fact that I'm using the difference in entropy and the fact that I'm using, I'm looking at an entire directory really helps me here. Okay, so that was entropy. We'll go get back to that in the future. Seth storage systems. So Seth is a free and open source storage system. It's software defined storage. It's free open source. It's free from vendor lock-in. It's also software defined. We don't have any specific hardware requirements for the actual storage devices. It's not an appliance or anything. It's open source. You can change it. You can read it and so on. And the last bullet is what I really want to focus on later on. And this is that it's also open-ended. We're working towards that to be open-ended. Why is that? In theory, you can fork the Seth repo. And if you are an experienced C++ programmer and you're not afraid from a huge project with more than two million lines of C++ code, then you can change it. But if you are afraid of that, you're correct. Especially if you are a user, like you are a user. So you're good with writing Python code. You don't really want to mess up with Seth. So us as developers, we have the... We need to make sure that you can tweak our code. We have to have a mechanism of ways to change what we've done because, I mean, we have to give you some power here. So this is why we've worked towards open-ended. I'll elaborate that in a second. Another point about Seth, just to focus us where we talk about, like, in which area of Seth we're discussing, Seth is a unified storage system. So the same back-end called RATUS is used both for files storage, for object storage, and for block storage. Now, file storage and block storage, I'm not going to talk about today. Only object storage. And there are certain characteristics in object storage that makes the whole thing easier. The mechanism or the algorithm are going to make it easier to implement. So I'm going to focus on the object storage. We have something called the RATUS gateway. This is an S3 slash Swift compliant front-end for Seth that gives the object storage functionality to Seth. Now, circling back to the open-ended. Okay, so we need a couple of things in Seth to make that more open-ended. We created something called objects classes. So you can write C++ or Lua code, kind of compile them or upload them, and they can be running co-located with the storage itself. This works for all types of front-ends, like file block and an object, because this is really at the back-end running. We have something called back-endifications. Back-endifications is more for, like, out-of-band processing. So whenever you're uploading something, I'm going to send over-endification to some external system. And this external system can do whatever they want. This could be like a serverless function in Knative, KDA, AWS Lambda, and they can go fetch the object, do whatever they want, lots of processing. The thing is, this is out-of-band. So once they've got the notification, the object is already in the store. So for a use case like I'm going to talk about, which is ransomware, it might be a little bit too late. So it's great for tons of applications, but we need something that is in-line, running in-process in-line, and like the object will not reach the actual storage system before it crosses the execution of this script. And this is the object we were scripting. Another thing here, Cefi is a big beast. It's complex to maintain. It's complex to install. That's a big thing. But for that, especially if you're running in the OpenShift or Kubernetes environment, and like, I guess, if you're in this conference, then you probably know about that. We have Rook. Rook is an operator where you can install, deploy, and even more important, you can manage Cef. You can create buckets. You can even configure bucket notifications. So there's lots of things that you can do. Everything is easy as a YAML. Everything is declarative. So you just write the YAML. There's no order in which you need to deploy things. And eventually, everything is configured and falls into place, except the Lua scripting. Because it's a new feature, I usually, like, first develop that at the actual C++ code, and then I switch gears and I write some goal line code so you can do that nicely in Rook, but I haven't done that yet. So for that, you'll have to actually manually upload the Lua script and do all that work. But I'll do that. Okay. So first of all, why Lua? Well, hopefully, I convinced you that we do need some mechanism that is not C++. You don't need to recompile and test Cef and still be able to change in-line behavior, in-process behavior, of the Red Escape way. So Lua is a mature and powerful language. It's not a very common language. I didn't check, like, on the Tiobi, whatever, ranking. It's not ranked very high, but it's very easy to learn. So, I mean, even if you don't know Lua, in no time, you can figure out the things and write Lua code. It is very lightweight. This is very important because of concurrency. Like, we're managing tons of requests at the same time. So we don't want to create, like, one Lua VM and have locks and all kinds of problems like that. So we're spinning tons of Lua VMs. It's very lightweight. It spins in no time. It's resource consumption is very low. It's very efficient integration with C and C++. So I don't know if you ever integrated with, like, between languages, like, I don't know, C++ and Java is in JNI, you have marshalling and un-marshalling and all kinds of problems like that. In Lua, you have nothing. You just pass pointers back and forth, and you have zero copies of any information. Lua was first invented to do something in the energy industry, but in the past couple of years, it is very commonly used in the gaming industry. So World of Warcraft, you can script Lua into that. I don't know if you have young kids or if you are young at heart, you know Roblox. It's a great game development platform that uses Lua. And the reason it uses gaming, because of the low overhead zero copy characteristic of the interaction between Lua and C++ or C. This is an example of code. Later on, I'll show, like, the real code is just a simple example. So it's all about the context. So it's not like the script can run anywhere. I have to create a certain context with certain bindings between my C++ code and Lua. And the context that I have here is the request context, which means that whenever I upload an object to the object store, that's a request, and it has fields, and those fields, some of them are read-only, some of them are writable. So for example, I'm checking here if the operation type of the request is a put object, and if it does, then I can change something in the metadata of the request, as if the client have sent me some new piece of metadata, and write that. I have an RGW table. This is like a global table in which I can share information between requests or between different contexts. And here I'm also showing that an increment stuff in the table, decrement. I mean, I have a couple of tools here, and this is evolving. Just as an example, some community user did, he kind of posted that, and he said, yeah, I want to change the storage class based on the size of the object. I'm like, okay, you can do that in Lua. So he's checking the size of the object, which is a field that he can read, and he actually made the storage class field writable, and he modified it. So it's a large object. You put it to cold storage, small object to the regular storage. So this is a simple application that somebody wrote, like three lines of Lua, and it has this application. So this is what we do with Lua scripting. Now let's try and tie everything together. So this is kind of a description of the algorithm. Please read the yellow kind of cloud at the side. It's probably wrong. Again, I'm not a security expert. Even if I were, it takes lots of research to get that right. But the beauty of it is that changing the algorithm is like writing a couple of lines of Lua and uploading a new script into our server. You don't need to recompile. You don't need to take the server up or down or anything. You just upload a new script and it works. So here's the idea about the ransomware. I have this global table, and for each of the buckets in the global table, I'm holding a state whether this bucket is infected or not. Well, if the bucket is infected, and I'll later on show how we figure that out, if the bucket is infected, then for each bucket, I have a quarantine bucket. So the reason I have a quarantine bucket is that I have false positives. So if somebody is uploading an object, I don't want to block that. Even if I think that the bucket is infected, maybe I'm wrong. And this is information. I don't want to get rid of that. So if I think the bucket is infected, I'm going to put that in the quarantine bucket, and then later on, somebody will figure out whether this is okay or not. So the information is not lost, but I'm not overriding the good file or the good object with the encrypted object. If the bucket is not marked as infected, then I'm calculating the entropy of the object being uploaded. Now, objects could be used like an object could be a VM image. This is gigabytes or a movie or whatever. So I don't need to calculate entropy for the entire object. I'm just calculating entropy for the first chunk of the object, usually four megabytes, but it could be smaller. Now, I need to see, well, do I have this object already? If this is a new object, I have no idea. The entropy could be high, it could be low. Who knows? Like, I'm just going to write the object, and if it is encrypted, well, tough luck. If I do have the object, then I've saved the entropy of the object before, and I can compare them and see if the difference crosses the threshold. If the difference crosses the threshold, it doesn't still mean that I'm infected with ransomware, because it could be one object out of a hundred or a thousand or a million objects, and then it doesn't mean anything. So I have to update something in the global table and then figure out if I've crossed a certain rate or a certain percentage, I mean, whatever. You're going to change in the algorithm saying, okay, 50% of the new object in this bucket are suspicious or had an increase in entropy. Maybe something is wrong. I'm going to mark the bucket infected, and the next object going to be uploaded is going to be quarantined. And anyways, I'm going to also update the entropy inside an attribute of the object, so next time I have something to compare it to. So that's overall the idea of the algorithm. And as I said, it's probably wrong, but it's easy to change. It's easy to play around with, and this is really the real power I want to show here. One, maybe another step, that's right. So if there is a, well, maybe I'll repeat the question. So the question is what if we're on purpose encrypting everything? So if we're encrypting as part of the storage process, and this is also possible, you can do encryption as the object of being uploaded to storage, then you would see all the information prior to encryption. So this is fine. I mean, the Redis gateway can encrypt stuff, but that is happening after that. So here we're okay. If somebody is encrypting everything on the desktops because they decided that everything has to be encrypted, well, somebody has to know that they're encrypting the organization, and therefore this is not a malware. I mean, I don't have any way to tell the two, right? But I mean, one thing that I haven't mentioned here is you can send emails from Lua. You can do all kinds of stuff from Lua. So I can send an email to the assistant main telling him something is wrong. But then you say, oh, my goodness, I'm running like a full encryption on the whole system now. This is fine. So I mean, there has to be some kind of involvement in this case. Thanks for the question. So maybe as for the next steps, as I said, those thresholds, it's very hard to figure them out. One thing that we can do, we can use bucketifications. So some external system reads the object and do some offline processing to figure out, for example, or especially if you can feed the information into it, whether there is a real infection or not. So do some machine learning and figure out better thresholds or better algorithm. But this is really outside of the scope, just an idea. So still have a couple of minutes. I'll try to do a live demo. Yeah, you would look at that later. I mean, I didn't invent all those ideas using entropy. There are a couple of papers. I mean, it's under debate. People say it's good. People say it bad. There's research around that. So you're welcome to read those papers and decide for yourself. Okay. So I have here, I know that in the presentation, I said everything is nicely done in Rook. Running a Kubernetes cluster on the machine demo looked too risky to me. So I do everything with SF without Rook. But I don't, if anyone is interested, I can send you like there is a repo with the all the details of the demo and everything. So you can reproduce that using Rook like in on your machine. But I'm not going to do that. So the first thing I have a stuff cluster running here on my laptop. So the first thing I'm going to do is that I'm going to create a bucket. So I have a bucket called home. But I also need to create another bucket, which is the quarantine one. Now I have the two buckets. And now I want to put the scripts. So first of all, let's have a look at the scripts. So the, the first script, hope this is visible. Maybe I'll increase that a little bit. Okay. The first script, the quarantine script is the one that check whether the, the global table with the bucket name dash quarantine is true, which means that I've set it to true previously in the other script. And if it does, then I'm going to change the bucket name to the new bucket, which is same dash quarantine. And then the system is going to write that to the new bucket. The whole thing works because this script is going to be running in the prerequest context. So it's kind of done backward. I first need to figure out whether there is ransomware, set the, the flag, and then the next bucket coming will be redirected because after I've already read everything and did all the processing, it's too late. I cannot switch buckets. So this is why it's done. So this is the simple one. Now let's look at the one that actually implements. Okay. So the first function here is the calculating the entropy. So this is pretty much an implementation of the formula that I've seen. It first create the distribution or the capabilities and then go through them and calculate the average. And this is the calculate, the entropy calculation. Now I have the more complex logic and this is to detect the ransomware. So first of all, if I'm writing something to the quarantine bucket, I don't need to calculate anything. It's already quarantine. This is fine. I mean, not fine, but it's there already. If something is not a put object, I don't care. I just need to look at put object. I have something here called an upload ID. This is like objects are quite often not uploaded as a whole. Like they're uploaded in chunks and I want to do the calculation only on the first chunk. So this is why I check this upload ID. So I'm just going to do the calculation on the first chunk with being uploaded. I'm going to calculate the new entropy. Sometimes there's no data in the object or whatever. I don't care about that. Then I calculate the current entropy. If there is current entropy, which means this is an update to an existing object, then I'm checking the difference. And if the difference is crossing some threshold, then I'm incrementing a count in the global table. And if the count crosses a threshold, it means that this amount of object in a bucket are suspicious. So I'm marking the bucket as quarantine. So if you look at the light 86, this is what the other script is checking. If I don't have a current entropy, it means it's a new object. I don't care. I just store the entropy. And that's it. So this is pretty much the code. I mean, it is 100 lines of code. But if you look at that later on, you would see it's not that complex. And even if you don't know who at all, you can read the code and pretty much figure out what's going on there. So let's upload that. So I'm going to upload the quarantine code into the prerequest context. And I'm going to upload the ransomware code into the data context. Data is like when I have the data of the object, and I can read that. Now, I have everything set up now. Okay. So I have a directory here called home. It has a couple of files. You can see them. I don't know, images, docs, some text files, whatever. Alpine, Linux, ISO. And I'm going to upload them into the home directory. So that's going to upload them. Now, I have a WannaCry script. Not really. Just a simple encryption. So I'm going to encrypt all the stuff from the home directory. Usually the encryption would override the files, but in this case, I want to keep them. So I'm just doing that in a different directory. And then I want to upload from the encrypted second. Let's see now what's going on there. So I'm in the encrypted. And I'm going to upload to the same, I mean, it's the same file, same names, encrypted, into the same bucket. And let's see what I see. Okay. So at some point here, you would see, I mean, it could be that some object went through, maybe the first one, and then at some point, I reached 22% of the object may be encrypted. And then when I switched the flag, so from now on, you would see I'm scraping entropy calculation because I'm already in the quarantine bucket. I don't need to calculate anymore. So, and, you know, just for the sake of this is the encrypted bucket. So it has the same file, but you see there's no preview because they're encrypted. And now the most important thing, let's say that I know that something bad happened and I want to recover. So I have a recovery directory. And in this recovery directory, I want to download from the good bucket. So I've downloaded. And you can see that in the recovery directory, well, not everything is 100%. You know, you do see that some of the some of the objects are still encrypted because they went through before I figured out that I'm in a problem, but some of them didn't. So, and this is just a handful. So in a high scale, or when you have a hundred of objects, then you would lose some percent of them. But overall, you see, according to the previous case, you see some of them are actually okay. Yeah. So that's the demo. This is the presentation that I have. I'll be happy to take questions. Yes. I'm using, yeah, just to repeat the question, the question is, did I play around with the threshold? Do I as percentage or absolute? Yes, I'm using percentage of change. I mean, it's, sorry, you can see that in the code, actually. So what I'm doing, and as I said, I mean, I play around with the threshold so it would work. It could be that in real life, you need to play around a little more. You probably need to do some more proper research in order to get the right threshold. So for example, here, I have the, yeah. So I'm checking whether something is more than 0.5% increase, but I'm also checking whether it's the absolute value is more than certain because, I mean, if somebody had a text file, right, and changed the text, it could be that the entropy would change, but overall, it'd be small. So it has to be over some threshold to even be considered as encrypted because, I mean, encrypted files usually like 0.99 at least. And you do also need to see the change. Also, you can play around with the bucket threshold. It's like 20% here. This is pretty arbitrary. You should be able to find the better numbers. Yes. This is if you know your payloads. This is true for some payloads, but not true for other payloads. Like, for example, many of our deployments are for like data lakes, right? So you have tons of sensors and logs and things that gather data, and they're all poured into this huge data lake which Chef implements. So you would see tons of uploads. Maybe not so many runovers of file, but also that. I mean, this couldn't also happen. If you know your, so if you know your payload quite well and it's pretty static, then you can write simpler rules. By the way, you can also implement them in Lua if you want. But especially in environments like where the Chef storage is data lake, you have lots of very, very different users. It could be like that it's a data lake for an organization and different departments are using it for completely different applications. It's very difficult to write one rule or simple rule saying, okay, this is how it should behave because it's all over the place. So the question was how computing intense it is. It is. So this is why I'm just calculating the entropy for the first chunk, but I must say that my next step, and this is really, I was really happy to hear all the stuff that went on here in the conference, my next step would be to be able to run irreversibly inside the registry. So once I have irreversibly, I can implement something which is really CPU intense like this entropy calculation. I can implement that in Rust or C++ or whatever language that I would like. And then that would help with performance. So Lua is easy, but and it's it's pretty quick, but it's still a scripting languages like they're faster languages than Lua. Yes. I didn't do the proper research, I must admit. So in my example, I had it right, but this is just because I tuned everything so it would work. The one thing, so because of the quarantine mechanism, it's not that bad to get false positives. To figure out the false negatives, I would probably use something like bucket notifications. So I would shoot a notification to some external system that has much more horsepower and that external system, some serverless function probably would fetch the object, maybe crunch through the entire object and run more processing, which is not in line. So to solve the false negatives once, I would probably use an external system using bucket notifications. And to fix the false positive issue, this is because I'm using a quarantine mechanism, I'm not so afraid from false positives. Any more questions? Okay. Thank you very much.