 Can everybody hear me? All right. So first I want to start by asking how many of you actually are aware of what a distributed denial of service attack is? And how many of you have actually been hit by one? All right. We have people who know what it is. So the way the talk is going to proceed, I'm going to talk about what a distributed denial of service attack is, different kinds. And then I'll talk about mitigation. So first I'm going to talk about myself. Who am I? I'm an operations engineer at Flipkart. I've been doing the internet since 2005, and I've worked with a whole bunch of companies. If you want to reach me, I mean this is the LinkedIn address. So what is a D-ROS? I think this picture is a very good representation. I mean, if you want to consider the train as the service that you're hosting or a website, and all the people are trying to get on board. So the D-DOS can either cripple your service or cause a complete outage. What are the different kinds of attacks? First we have the volumetric attack. It's mostly an attack which is actually trying to consume all the bandwidth that you have. It is not necessarily causing an outage of your service. It's just making it unavailable. Very common methods are DNS amplification, SNMP, NTP, and SYN floods, and all the fragmentation. DNS, as you know, is based on UDP. So one of the biggest challenges here is there is no connection that's established. Like there's no three-way handshake, et cetera. So an attacker, what he does is he spoof the victim's IP address and sends a request. So ISC is a very common target, not a common target, a place where the attacks originate from. So if you do a dig, I forget what the exact query is, the response that you get is about 3 MB in size. So the query that you send is about 100 bytes, and the response is 3 MB. So the attacker sends a request with, suppose, my IP address. So I get like 3 MB of data, which I never requested for. It does not hurt me in any way, but the thing is the request come to the firewall and not drop them. But they are choking the internet pipe. Second is the SYN flood. How many of you know you can stop it using SYN cookies? Or we'll likely suggest using SYN cookies. But the thing is, if it gets fairly large, you start having a problem with SYN cookies. Because you run out of the limited set that you have, and the server then decides to randomly start dropping sessions. And fragmented packets, well, these are a problem with firewalls or systems because they require extra computing power. The system has to reassemble the packet, and only then it can look at the content of the packet and can decide what it needs to do with it. Moving on to application layer attacks. Application layer attacks are slightly more complicated. Volume metrics was limited more to layer 4. Application is more towards layer 7. Recently, one of the attacks that we've seen is a WordPress ping back. I'm not 100% sure what it is, but from what I've gathered, it is a ping back from the WordPress server. So if you link a site on your WordPress, WordPress keeps sending a request, a ping back, to the address. So we saw a couple of thousand requests coming in recently, exploiting HTTP. This is a hybrid attack. In this, what we'd seen was we very aggressively compress HTTP responses. And we saw users had sent HTTP requests specifically with the compression disabled. So it caused a huge spike in our bandwidth. And I think we got to like 90%, 95% of our internet connection and the ISP start dropping packets. So not a good experience. Excuse me. The last one is incomplete request. So as you're aware, in HTTP 1.1, you need at least two parts of data. One is the actual request, like a get or a post. And the second is the host header. You can also have other details, like cookies, et cetera. But that's optional. So in this case, what an attacker does is he will send the first line of data. He sends the get URL, and then he'll just wait for some time before he actually sends the host header. So let's say he waits for like five or 10 seconds and he sends the host header. So in HTTP 1.1, you have to finish a request by having two new line characters. So I can send one and wait for another 10, 15 seconds before sending the last one. Now, the problem with this is it's consuming sessions at the web server. It's not causing any other problem. And this attack actually is a very sophisticated attack. In this, we had seen, I think, we seen a couple of 1,000 requests. And the increase in bandwidth was about less than an MB or close to an MB. So it's very hard to detect attack if you don't know what you're looking for. And it took us quite a while to actually figure out what was the problem. What are the steps for mitigation? So we'll start with the mitigation of volumetric attacks. For us, we own our own address space, IP address space, and we advertise it with different ISPs using BGP. So if, and this is actually a very specific use case, if the attack is localized to one particular ISP, in our case, we are multi-homed. And we are peering with more than one ISP. So if ISP is getting, if the attack is coming via ISPA, what I do is I will disable my advertisements there. This only works if the attack is not directed at me, but at the ISP. I mean, when you're establishing peering, you also have the ISPs address in there. The second is working with upstreams. Again, this is for large volumes. We contact our upstream and have them specifically black hole our IP addresses for certain providers. An example of this could be, say we have an ISP, or a lot of requests are coming in from an ISP in Ukraine. We can have our upstream provider black hole us for Ukraine, but it's a little dicey because you might not just get blocked for Ukraine, but maybe the entire Europe. So there's a little gray area there. And the last one is use of scrubbing farms. So when all else fails, I mean, India has limited bandwidth connectivity, and the ISPs, they don't want to give it to you if you don't want to pay for it. So what they'll do is they will black hole you very, very fast. And we've had ISPs black hole us a bunch of times. So in that case, what we do is we switch over to a scrubbing farm. This is basically a third party DDoS mitigation service. We change our DNS and let our traffic flow via the scrubbing farm, where they mitigate the attack and send it back to us. The other is application layer attack. So for this, we have two solutions. One is a homegrown solution. I'll talk about it more in detail. And the other is, again, falling back to the scrubbing farm. The homegrown solution that we have is, again, limited by the amount of bandwidth that we have, the capability. So if we can do up to medium levels of attacks, or like I talked about the incomplete request attack, these kind of attacks can be mitigated by us because they don't require a lot of bandwidth. So wherever bandwidth comes into play, we have to fall back to scrubbing farms. So before I actually talk about the solution that we have, I want to talk about why we built our own build versus buy. So we built our own because we know our application better than anybody else. We know what the flow is supposed to be, what the user's journey is through our website. So we are the best people. And we can identify malintents very, very quickly. Secondly, we want to understand the attack and evolve with it. But this, again, is subject to the complexity of the attack. At times, we've had the attacker using only one vector. So it's not really a sophisticated attack. I mean, it could be them trying to just do NTP. And once you take care of an NTP application, they stop. But at times, the moment you stop NTP, they hop onto a different vector. So in that case, we want to know exactly what is going on. And hence, we want to evolve with it. The third is the service providers have a very genetic solution. They all want to sell you an application firewall. Application firewalls are good, but they only work when you talk about XSS or SQL injection. They're not aware of how our application works. Most of them do not have enough intelligence to build a profile of my application. So it's not a feasible solution. So what do we do? We do real-time log analysis. We aggregate all our logs from all internet-facing applications and firewalls. Using LogStash, we get them all into Elasticsearch. And once they're there, we identify standard patterns, just things that we've looked in the past. We use those. And then we also work with the application. Every application does its own request profiling. So within the Flipkart website, you have a checkout or a commenting service or different services. They will profile the request and see if the user is following a pattern. I mean, if you are, suppose, just continuously searching on the Flipkart website, and we see 500 requests, you're not really a customer. Because you will search and actually go somewhere. 500 requests, just search request is not something that's a positive sign. So what we do with this is, once we have this, the applications give us feedback using a custom error code. So instead of, say, HTTP 403 or 404, we will use something like 999. And we are, again, doing real-time log analysis. So once we see that, we pick that data up, and we go ahead and block it on the firewall. I mean, there are a couple of steps here. We use different error codes. So you can either block it on the firewall, or if you want to start rate limiting it, we can do that as well. And the third option is the request profiler itself can present a capture to the customer. That actually happens in the case of people who are behind a NAT, mostly for offices or college campuses, et cetera, where we have a large number of genuine users who we do not want to block by blocking the IP address. What are the challenges that are currently associated with this? What we've seen is response times. In the last, I think it was about two months ago, we had a very interesting attack pattern. We saw attacks which would last for about five minutes. So by the time we got on VPN, we logged in. And to figure out what's going on, it was all gone. So it was a cat and mouse game. It was just very difficult. So we wanted to automate it. Hence, but not everything can be automated. So response time, again, is a challenge. The second is a maturity of mitigation solutions from ISPs. All the ISPs we've talked to, all the third-party mitigation providers we've talked to, everybody is reselling the same solution. It's basically an ArborBox. And they're just selling a subscription for that. So and the solution that they have actually works very well for layer four, but does not do anything at layer seven. And the other thing is even the Arbor solution that the ISPs have in India, they do not have enough knowledge about or hands-on experience with the DDoS mitigation. So I mean, it's not really effective. An example of this is we were getting hit by an NTP amplification attack. And the mitigation provider decided to go ahead and enable all defenses available in Arbor for our IP address. And one of the mitigations is also HTTP authentication using HG digest. That is to basically differentiate between a machine and a real user. So what happened was they, I mean, while trying to mitigate an attack, they actually started one because about, I think, 70% or 80% of our users just dropped off once that happened. Then the other problem is the Indian ISPs are not flexible. They are still operating in the past. I mean, like we have the cloud model where we can go swipe our credit card and we have machines ready and we have everything available. I can't have anything like that with the ISPs. I mean, I can't have them provision FAT pipes for me and I only pay for them when I use them. If they're provisioning a FAT pipe for me, they want me to pay for it regardless of whether or not I'm using it. So that is a very, the cost is a very big barrier. The fifth point is, yeah, fifth point is blackholing of our traffic. The ISPs are very quick in blackholing your IP address. If they've tried their mitigation solution, it's not helping. They've tried, their knowledge of DDoS is reaching a level where they can't do anything. And the attack on us is actually hurting our neighbor also. And I mean, even an attack as small as 10 GB hurts the neighbor. I mean, we've been victims and I'm sure we've affected other people also. So they're very quick in blackholing traffic and they do not inform. I mean, communication is a very big problem. At times, I think it was about 15 or 20 minutes later, we found out we had been blackholed. I mean, that was not a thing we were looking for. Later on, we became more active in monitoring our availability outside of India. And the last is there are no scrubbing farms in India. So when we utilize scrubbing farms, you have things like Cloudflare, Ultra DNS, or even Arbor has their own scrubbing farm. They're all located outside of India. In Flipkart, we very aggressively monitor end-user latency. Hence, we are multi-home. Hence, we actually decided to host in India versus going to Singapore or US or something like that. But once we go to scrubbing farms, the latencies are just out of the window. And there is no telling. At times, we've been told we are actually going to a scrubbing farm in Singapore. But again, it depends on the volume of the attack. The ISP sometimes does move to a scrubbing farm in Europe or America. What that actually does is introduce 2X latency for the end-user. And they have to first go and connect to the scrubbing farm. Then the scrubbing farm will actually proxy a request on their behalf to us, get the response, and then send it back to the user. So that's like 2X latency for the end-users. So these are the challenges that we've had. And if you have any questions. Hi. Can you throw some more light on scrubbing farms, as in what all services do these farms provide? The scrubbing farms are mostly black boxes. What they do is they're sort of reverse proxies. They also do pattern identification. And if you want to think of them, just think of it like a farm of NGNX servers or HA proxy servers, who will take your request in. And then you can apply filters there. So let's say if I identified a particular user agent that was bad, like in case of WordPress, we actually saw the WordPress appearing in the user agent. So they'll take that and start blocking it. You basically, with the DNS servers, you route all your flipkart.com traffic to them, and then they forward it to you? Yes. So I mean, that is how we move traffic. I mean, the other option being BGP, but the routing seen in Asia is very, very bad. Just to give an example, with one of the ISPs, we tried doing anycast. So we tried anycast in three locations, Delhi, Bombay, and Chennai. And what we saw was all the traffic just shifted to Chennai. I mean, even though I had a box right in Delhi, right next to the server, which was hosting the anycast, but it was all going to Chennai. So BGP is a very gray area, plus you also give away the control of your BGP advertisements. Right. So the thing is that DNS servers are something that are not in your control, right? No, but so we have a very low TTL. If you look up Flipkart.com, you have about 60 to 90 seconds. OK. Hello. We are currently contemplating on working with Arbor. Can you throw some lights on what have been your bad experience with Arbor? It's not with Arbor particularly. It's the people who are using that equipment. They don't know what is the capability of the equipment. It's like having a firewall. I can give you a firewall. But if you don't know how to use it, how to tweak it, it is not the capability of the firewall. It's the capability of the user that's actually limiting them. And also the other thing is, I think in India, the ISPs don't have enough capacity. I won't name the ISP, but one of them has about 5 to 10 GB of mitigation capacity, but it's advertised as about 100. I have a question about an API not about the websites. So I have an API which is, by nature, it's a public API because it provides, because it's part of a search as a service. And so now I am possibly thinking, how can it be misused by a DDoS attack? Because there is an authentication, in most cases, in that API. So even with authentication, you can't prevent against a DDoS. I mean, the authentication is going to be done by API. Or any other service for that matter. I mean, if we go back to this slide, there's actually nothing you can do here. The people are trying to all get on board. To mitigate it, what is the use of this for an API? How do you mitigate a DDoS attack for a public API? See, again, it's about muscling it out. In this case, if we have maybe 10 more trains or a longer train with more seating capacity, that is the only way you can solve it. But there is a limit, right? So let's say if you're hoping for 100 requests per second, you can plan for maybe 1,000. But if you start planning for a million to mitigate it, I mean, there's going to be a significant cost associated with it. So you could take help of people like Cloudflare or Ultra or Arbor by itself. It's on the same lines. You mentioned about homegrown. Application solution for this, right? What kind of solution was it? And then you had some points about logs doing more analysis. But what is specific in the application that was done? Like I mentioned, it's our application. So we know it best. So we're aggregating all logs from all web servers or actually all internet-facing applications. So basically logging as much activity and then being proactive enough to know that there's a pattern coming. So there are some patterns we know of. There's some patterns we can actually look at and start investigating. We are not there at the artificial intelligence level to automatically start flocking stuff. But the whole thing with that is with, so let's say you have an API for Flipkart. You have a seller platform for Flipkart Marketplace. And you have the customer-facing website. Now, if I see an attack on the seller side, I can take that information back and actually start blocking or take preventive actions on the customer side before it actually starts happening. So it's a global feedback system. So from all applications, I have that feedback. And then once I've blocked them on the firewall, so what I've done is the firewall is sitting at the edge of the network. So once I block it there, if I still see traffic from the user, I'm getting my firewall logs as well, I do not unblock them. I mean, as being an internet company, we unblock our users after x amount of time. And that's very small, like 10, 15 minutes sort of thing. So we had that feedback loop that even after being blocked, if you're still attacking us, we will not unblock you. So it's not totally a preventive solution. There'll be some initial attack that will come in for us to get that data to identify and then build on. So for the home room solution, you have your own HTTP protocol decoders and everything in place? No, no, no. We are doing log aggregation analysis. That's all we're doing. So we're using LogStash, and then we are putting all that data into Elasticsearch. So you can compare it to a Splunk. So when did you understood the first attack pattern? Like, I mean, did the site go down? How did you started doing the solutioning? I think it was the third or fourth time. In the last four months is when we've seen a rise in the number of attacks that we had. The first attack we actually saw was in 2011. And after that, the attack started in December. And as a whole, they've started increasing. So with the amplification attacks, there wasn't anything we could do. I mean, it was just a muscle game that you have to have the ISPs mitigated. In terms of the application layer attack, so we have graphs for everything. The QPS, they were doing the amount of data that we were pushing out per request, not per request, like through the website, and also at the network site. So we saw spikes in bandwidth. So we started investigating. We also saw spike in, not really a spike, but an increase in the number of incoming requests. So first thing was, we went and checked with marketing if they had something going on. But we didn't really find anything there. It took about, I think, two, three hours to actually find out what the problem was. So very interesting thing. I mean, Nginx also has a thing called a lingering time out. I mean, that is when we learned about it, where if the request doesn't come in within five seconds, which is the default value, it drops the request. So we saw this pattern that the request for dropping after five seconds, but we just had no clue why. In Loctash, you must be capturing HTTP logs. What about other attacks like NTP or DNS, for that? See, NTP, I don't have a port open on the firewall. I'm not hosting a server. But the thing is, it's choking my internet pipe. Even if I'm dropping it, I'm dropping it at my doorstep. But can you capture the logs for that? I have firewall logs. So you are capturing the firewall logs. So the net screen, they have the number of packets per IP and protocol matching. Aren't you compromising your security by describing so many of the protections, checks, that you are doing? No, I actually have a very generic presentation. Thank you.