 He will speak about proxies. Maybe everybody already used a proxy, but he will tell you what you use it for and how to use it right. Thank you. So, this is work that I've been doing sort of over the last year. I'm a grad student in Seattle and so this sort of came out of building proxies and trying to understand, okay, we've got this proxy, but what is the resulting market of like, who's gonna use it? What is the workload that it's gonna be used for? And there's really not a lot of data around that stuff. There's some more stuff on this web link and it's also linked from the talk with data and sort of one of the sort of outcomes is like, okay, if you're building a proxy system or a circumvention system, you wanna be able to understand like, is it fast? And to do that, you have to understand what's gonna be used for and so for that you need data about like what the workload looks like. Let's start by talking just briefly, so that we're all on the same page about what proxies are. And for this talk, I'm looking at sort of HTTP proxies and in particular looking at the open proxy ecosystem. So you just find these public IP addresses on the internet and they'll give you traffic for other sites and people use them. They're like really easy. A lot of people understand how to use them. And so it's sort of like this baseline of like, okay, I can't get to a site, my school's blocking it, something's blocking it. So normally HTTP looks like this, right? I ask for a resource and the server gives it back that top line and I'm sort of saying I want this path, this specific resource and then I'll both connect to the IP address after I do the DNS resolution. But then I'll also tell the server, I'm expecting to like be getting the resource from this host and that's how one server can serve the content for multiple different domains. And this changes a little bit once we get to the SSL world we hope to live in, but this is sort of the basic. So in an HTTP proxy, instead of just asking for this path, I'm going to ask the server for a full URL. I'm gonna say, hey, can you give me the resource on this other server? And then it will go out and fetch it for me. So it fits within this model. And then I can also set the host again to sort of duplicate that, but at this point it's a little bit redundant. And so this might not, you can say, okay, so does this then work for SSL and it doesn't and so once you go up to SSL we have to do a totally different thing and we add this new verb called connect, right? And I tell the proxy, hey, can you connect to this other domain? And then I can do my SSL and it just sort of relays TCP bytes through. So we've got this baseline. Proxies are a thing that have existed for a long time. There was this thing called IRCache, which was one of the first ones back in 95. And sort of in the early days of proxies when we were in this very sort of HTTP world, they were used largely as a way for actually performance, right? You didn't actually want to do a long and slow trip somewhere else. Hopefully you could have this canonical resource and multiple requests would be for it. And so there's sort of this set of research that happened in the late 90s looking at, well, our caches effective are having these proxies actually a thing that lets an organization send less traffic out. And there was a bunch of stuff that was like very impressive. So this is looking at between organizations how many resources are actually shared and you see, so this is sort of like for different organizations what percent and it's sort of around 50 or 60% of sort of your requests were actually sort of duplicated between different organizations. And so you could actually save a bunch of your bandwidth by having a proxy, right? So we had this sort of initial rise where an organization or a school or someone would probably have these gateways because they could save a lot of bandwidth. And then as we got personalization and as we got dynamic content, this sort of keeps going down, right? Because now everyone has their own version of Facebook and that's not gonna be shareable between things. And we get HTTPS, so again you can't cache that because it's encrypted to that specific browser. So this initial sort of use of why all of these proxies are everywhere sort of has kept lowering and you get like as you see these subsequent papers looking at how cacheable is the web, they keep going down. And so now we've got things like CGI proxies, right? So it's like I go to a page and it lets me type in another URL and we've got these open proxies and they're really more about sort of circumvention, right? It's end users who are trying to get them in spite of the organization or in spite of the ISP that they're behind. And that's a very different use case and it's one that we really haven't fought about too much. So I wanted to try and understand the workload. So I went out to things like Xroxy and a bunch of, there's a ton of these things that sort of aggregate open proxy lists and you're trying to understand, okay, where are they getting them? How much are there? And so with that sort of exploration, one of the things that we noticed is that there's actually a lot of information that the proxy is willing to tell you. And maybe you've been trapped in a hotel room that wants you to pay for internet. Maybe you've been on a plane that has sort of that thing that wants to pay for internet. And it turns out that this is like a great environment to get really frustrated and sort of keep trying things until you figure out, okay, what can I actually do with this proxy? And at one point I found myself stuck behind a squid proxy. Squid is one of the common proxies out there for caching and for transparent proxies. And one of the sort of lesser known things in squid is this thing called the cash manager interface. And so this is an additional interface that lets you manage the proxy. And it's exported normally just on local host, right? So if you are running this proxy from the local machine, then you're able to manage it. You're able to see statistics about how it's being used. And the way that works is that instead of asking for HTTP slash Google.com, I asked for this weird thing called a cash object. And if I asked a squid proxy that's got this turned on, which it is for default for the local host, it'll give me this sort of not actual HTTP thing where it sort of gives me a menu and then I can ask for other resources like what domains have you resolved recently? How much stuff is in there? So I bet it doesn't happen that often that this is exposed on the internet, right? So we use the map to go through the IPv4 space on the default squid ports to see how many of these open squid proxies are there and how many of them have this cash manager interface. And this is a thing that you could imagine configuration would cause to be exposed, right? If I've got my squid proxy, I've got some other thing also running on my machine, the proxy could actually see the requests coming from the local machine rather than see them coming from the remote client if there's some other transparent thing going on. So how much did we find? Well, if we scan port 3128, the default squid port, we found roughly 2 million open machines that would respond and this is common for a lot of like weird ports, right? So if you go out and run as the map on some strange port, you'll probably find around 2 million. So there's not too much signal, but if you then actually go and try and make requests as if they were open proxies on all of those, you find something like 28,000 that say they're squid, they would don't all give you stuff, like a couple thousand give you actual HTTP responses and are open proxies. And this feels about right, it corresponds to a little bit more than what an open proxy list would find on a given port. The open proxy lists generally will know about 5,000 total open proxies at any given time. And so if you do the scan yourself, you'll find some new ones that they haven't picked up yet. But then of those, roughly like a quarter actually had this cash manager interface. And that's scary, right, like, I don't know. Because this means I can now like look at traffic of random people on the internet and I can see what URLs they're visiting. If it's HTTP, I can start to see like those query parameters even and I can see who's doing it. And this is great for me because now I have like a chance to understand what the workload is, but it's really not great for the users. So we ran for a week just sort of querying this periodically and dumped like a couple million like snapshots of what the proxies were seeing and are using that as a data set to try and understand, okay, so what does this sort of world of these weird unencrypted open proxies look like? So what's out there? So the first thing is trying to understand are these squid proxies that are running out there run intentionally? Are they just sort of like misconfigured? What's going on? So we looked at how old the different proxies were. And so this is looking at the total traffic. And so that's sort of the y-axis, right? So what fraction of the traffic versus the age of the proxy that's serving it. So you see that most of the traffic that initial on the left bar is being served by proxies that are really new. And then there's some chunk that was served by proxies that we saw that we thought were quite a bit older. But a lot of the traffic is coming from machines that are up for less than a week. And we would see that this lifecycle was pretty common that there's a lot of things that show up as a new proxy. The traffic on them ramps up as they get discovered, as they get put on open lists. And then at some point, the person running it notices either because they've got bandwidth or they get a complaint or something and they take it off and that process seemed to take about a week in general. So this is this churn of how long do these things last. And this is an indication that a lot of these were not intentional. It's not like someone was planning to run these. So who is it that's running it? We tracked down a bunch based on a shared country. So we found sort of this South American trend of a bunch of the proxies sort of looking the same version of squid being in sort of the same locale and ended up tracking down a distribution of Linux that was shipped with squid and then also an engine X proxy that was going to it. So it would do this by default and we got in touch with them and told them that that was a bad idea. And then on the like, okay, so you've got this long lived proxy. What's going on there? I actually tracked down a guy at a Chinese university who was running one of them. I was like, Kate, do you know you're running this? And it turned out it was a grad school, a grad student in the physics department whose desktop machine had been running an open proxy for over a year. He claimed to know knowledge of why this was happening and that someone else had set up his machine for him. And then it took a lot of emails and help from Chinese friends in the lab. And since I don't speak very good Chinese to convince him that he probably didn't want to be running this and how to turn it off without knowing the root password on his machine. So a lot of unintentional users or a lot of unintentional operators. And then more interesting than that maybe is the user profile, right? So this is trying to now understand the traffic. We saw a lot of traffic coming from China. That was sort of the bulk of the requests which is interesting. And then also the US, also Russia as sort of hotspots. But a pretty diverse mix geographically. So we can look at the top searches and there's sort of these random Chinese search terms generally. It's hard to actually understand the popularity because the whole point of these things are caches. So once they get a search, then other requests for the search don't sort of show up as additional requests because it's already in the cache. So instead what we were using as a proxy is how many different servers did we see the same search on? And for those we saw these weird sort of Chinese search terms that seemed semi-automated, semi-sketchy as the ones that were coming up even more than Google. And so again, this adds some like credence that like the bulk of the users is based in China. So we've got this Chinese centric user base. We can also ask how fast are these proxies? How much like latency are users willing to have imposed on them by their use of them? And this was a thing that the proxies would say is how long did the proxy believe it took to get that request? So this is again this sort of semi-weird depiction of this but if you sort of try and look at where that red line goes 50% up the axis, that's the median, right? So the median is something like half a second which is really long by our normal like standards of like how long it should take for our website to load. And there's a bunch of roughly a quarter of the requests we're taking over a second to load. So these things aren't fast and yet they're still getting a ton of traffic. Overall we were seeing something like 200 terabytes a day of traffic going through open proxies based on the amount of requests we were seeing. There's also a lot of automated traffic going through them and this was sort of a surprise, right? It's trying to, right, it's hard to necessarily like directly quantify a lot of how much is this is real users versus how much of it is like automated like either programs that are scraping things or malicious malware or stuff like that. From standard like patterns in URLs one of the things that we found that was really interesting was that there was a bunch of people who were trying to scrape flight prices so they would be searching for like price of flights between city A, city B and the cities would change but it was like these standardized queries and there were tons of them. And we found this sort of unfinished PHP thing that was like relaying them back to itself but people are using this as a way to get IP diversity in the same way that they would for pretty much any other tool that's going to give them that. And so I think there's a couple of things. One is that having the open proxy in this way has a lot of problems. Certainly having your cache manager open so that someone like me can come along and see what the users are doing is really bad. And we're sort of in the process of continuing to sort of alert abuse contacts whenever we see a squid proxy that's got this cache manager interface open saying, hey, you're running this thing, the users aren't being informed that their traffic is recorded and invisible to anyone. You should really stop that. But separately, if you're just running this thing with no real target of what you want to get out of it as an operator, a lot of the stuff that's gonna happen is going to be sort of this malicious and automated traffic. So there's a lot of botnet traffic coming out of it and a lot of this stuff like price scraping or SEO spam, like comment spam. And so figuring out how you're going to provide access while restricting to stuff that you think that is legitimate is worth doing. How are you gonna shape the traffic or how are you going to at least get the sort of service to be targeting legitimate users and not totally like, not no bar for automated programs. So that's sort of the best practices, right? Is what do you want to do? Well, you don't want to not provide access, right? This is sort of this conundrum. But at the same time, I think open proxies, especially the HTTP ones are ones that we really want to go away. HTTPS is a lot better. Even on the squid and HTTP proxies, you're leaking much less, only the domain rather than the actual resources. And I think we can look at these workloads and understand that we're not getting much caching benefit anymore. And so you're not gonna get that caching benefit. You might as well not try and cache and just go with each thing being limited. I think I'm gonna leave off there. So thank you. There's a paper linked from the foreplan that has a bunch of sort of the traffic distribution and stuff, and that probably provides some more data if that's what you're interested in. But I wanna leave some time for questions. So I'll take questions. Thank you. Thank you very much, Will Scott. Are there any questions to him? No, it's a chance. Please come up to the front. There are two microphones. Don't be shy. Okay, we can start right here. If it doesn't work, it's... Take mine. If you search for HTTP proxy, you find these signs that index a lot of open proxies. Did you find there was an overlap between the proxies that you found? And have you looked at the list? Yeah, so that's actually also a... Okay, so the question was, there's these index sites. I showed a screenshot of Xroxy, but there's a bunch of these things that sort of provide indexes of open proxies. And does that overlap with what you find when you scan? And it does. We found more, so if you're doing sort of your fresh scan, you're gonna find maybe twice as many as generally are listed, and those that are listed you generally are also things that you find when scanning. They also list HTTP proxies that have like a really stupid basic authentication like some trivial username and password. And did you try like brute forcing any things that you found? I guess that's bad. Nope, we stuck with ones that were just listening on a direct port and would just directly do HTTP proxy. A relatively significant portion of the request that we saw through proxies were actually these index sites checking bandwidth and checking is this proxy still up. So you would see these constant pings back to lots of different proxy lists, checking that this proxy is still alive and checking how fast the proxy is. Almost as much traffic for that sort of automated stuff. Thank you. Yeah. Hello and thanks for this talk. My question goes more in the direction of discovering a proxy. You showed us some numbers, how you find or you scan for the port 3128 as a thing, and you identified 18,000 as a proxy. How do you identify a system with this open port as a proxy? Do you do it from a banner search or from a behavior search, or how do you identify this as a real proxy? That was from a banner search. So the question was how did we identify the proxies and then how did we know that they were? And so we did Zmap on port 3128 which is the default port for squid. So you can look at what squid runs on when it's run by default and it runs on 3128 and you find on open proxy lists as well that that is one of the more common ports along with 8080 and things like that. There's another set of proxies that we list. Like there's a Russian one called Mikrotik that gets a lot and there's Polapo and Apache gets used. And then we connected as an HGP client on all of the port 3128s that were open and those that had a squid banner or said that their server was squid we would count as. Okay, so this is identifying as squid. So that was the roughly 20 to 30,000 and then the ones that would actually function and functionally were open proxies that would give us Google when we asked for it. That was the ones that we counted as actual open proxies. So that was the 2000. Thank you. Hello, thanks for the talk. For those of us who don't have much experience in this area, what exactly makes it open? So is it that if they're exposing this port 3128 but obviously anybody on the internet can connect to that but how would you then make it a closed squid proxy? Sure, so the question is what makes this open? So squid gets used a lot by organizations. So if it's like at an ISP, especially for mobile for instance, they may have all of the consumer devices going through this proxy on the way out. And so if they've got that configured correctly, if I'm somewhere in another country or something and I'm trying to connect to that outbound proxy, I won't, it won't do anything for me, right? So it's restricted on the IP space like maybe a private NAT or something like that for the organization. Restricted by the network. So these people with open proxies are basically putting squid in their network configuration that's not locking it down to there. Alternatively, you can have it running on a totally public thing but you can have a password for it and there's various levels of actual workingness of these passwords by default it uses an insecure thing but even having that password means that some random person can't just probe and find this thing directly and it won't end up on lists. Can I ask one more question? Of course. The cache manager thing, you said that it's only available on local hosts so how could you actually access it when you're not on local hosts? Right, so the question is this cache manager that squid exports is exposed to local host connections so why does it exist on like a quarter of these open proxies that we found? And the answer there was sometimes there's this other program or sometimes the way that the proxy is configured. The request will first go to something else and then they will go to squid and squid will see them as coming from local host even though they're then relayed back to me. So what I'm connecting to isn't actually squid, I'm connecting to something else that I don't know necessarily which then gives them to squid. So that was the engine squid set up for us? Right, so if you've got like an engine X reverse proxy or you've got something else there or even if you've just got a weird IP tables configuration potentially this can happen. Cool, thanks for the clarification. There's a question to our left. Please don't give open proxies such a bad rep I depend on them and I use them to access the site of a public broadcaster and if you consider that I'm German and listen to the kind of accent I speak you can take a guess at what broadcaster I'm talking about. If I wasn't able to use open proxies from that broadcaster's country I would not be able to access at least their archive video material and it really helps me, it helps me do the work that I do here now which is to translate. So we didn't do your talk but we do translate all German talks into English so if you're interested in a German talk and don't speak German, try us. So what I do is I research the kind of URL patterns that need to go via that proxy to access their videos and I did that myself and so I minimize the use of them and I have to find a new proxy every now and then every few days or weeks and it works rather well if you're patient. So they are doing a real service which is to overcome geoblocking and from the kind of scene that we are in we are quite opposed to geoblocking so they do something that's really valuable. So the question is like there's a lot of value to open proxies in terms of being able to modify your path to the internet and get around geoblocks and other restrictions to access and that's totally true. I think the point here is we can do that without having a lot of these downsides. So we know how to do this in a way that isn't going to cause that traffic to be surveilled and potentially intercepted and we should move to that as quickly as we can. That's true, but we aren't there yet so I need them. Did you check if any of the proxies modified? The pages send for example and compressed them also have exploits or something. Yeah, so the question is did we check to see if the proxies themselves were malicious or were somehow tampering with the content and we haven't done that? Okay. Yeah, I was more interested in sort of the traffic workload than whether the proxies were behaving well. There was definitely a lot of flaky proxies and proxies that would always give you the same response. Next question. Hi, nice talk. I wonder is it possible to use E-Map to hit all of the different proxies that you found and actually ask for the cache object directly or to do a chain to connect to localhost and use squid itself? In the past I've used that to own up some squid servers where you connect to itself and then from that you connect to the cache object. Did you try to automate that at internet scale and then also to expand it to some other ports? So the question is can you recursively first get the proxy to connect back to itself and then use that? That sounds like it has the potential to work. We didn't try that, but that may make this even worse. What are you doing later? Any more questions? No? Then please, give Will Scott a big thank you. Thank you.