 Okay, thank you very much for coming. I hope you have all found a comfortable seat. I'd like to welcome you to the next talk, Privacy Mail Towards Transparency in e-mail dragging. Well, what's this talk about? Website dragging, that's just today's jam. We all know that this happens. We all more or less know how it happens. We more or less have accepted it, lived with it and have established our defense measures, ad blockers and so on. But mostly unknown, also our e-mail traffic is dragged as well. And these guys started a project to make this e-mail dragging transparency for all. How it works and how you can all benefit from it. Therefore, please welcome our next speaker, Max Maas. Have fun. Okay, thank you very much. Can you hear me okay? Welcome to my talk. I'm Max Maas. This was just announced. I'm giving this presentation, this is joint work with Stefan Schwerer, a student of mine and Matthias Hollig, my supervisor at TU Darmstadt. So maybe before we get started, a quick question for everyone here. Who of you actually receives e-mail newsletters? More or less everyone. Who of you actually wants to receive e-mail newsletters? That's about half of the people. Okay, not bad. And who of you was aware before they saw the description of this talk that these e-mail newsletters tend to be tracked? That is also about half of the room. So, when I talk about online privacy, I usually get the reaction that there's something like, oh no, it's another talk about tracking. Like, I get it, people are watching me from the shadows and at some point it just becomes boring. So there's basically three reasons I want to convince you to stick around for this talk. The first one is we're not talking about websites, we're talking about e-mails, as was already said. So it's a bit of new ground. The second is there's going to be a live demo and something for you to play with after this talk. And the third is it's only a 30-minute talk, so you don't lose a lot of time even if it's terrible. So, when you talk about e-mail privacy, we usually talk about stuff like e-mail security. So we have a message and we want to send it from person A to person B, and we say, oh no, we got to encrypt everything because the NSA is looking at all the connections and all the servers and encrypt everything. And I mean, yes, we should encrypt everything, but this is not what this talk is about. This talk is about the sender of the messages who wants to know if the recipient of the messages is actually reading them. And so it's about analytics here, similar to analytics on websites. And if you're familiar with the world of tracking on websites, you may know that it is possible to track views. So, if you just open a website, this can be tracked. Similarly, if you just open an e-mail and you have remote content enabled, then this can be tracked by, for example, embedding images or embedding style sheets. And these can then be personalized, and when you open it, they are loaded. And through the fact that this resource got loaded, the person that sent you the e-mail knows that you opened it. So this is very basically the way it works. You can also track interactions. So if you click a link, this is usually done by personalizing the link so that you get a very unique link that only belongs to you, to this specific e-mail address, to this specific newsletter that you got, to this specific link in the newsletter, and then the people sending it know very exactly who was clicking which links and so on. And finally, e-mail is very useful to link different identities. So I might have a smartphone and I browse the web on the smartphone and the advertisers get a lot of interesting data about me from the smartphone, and then I have a laptop and I browse the web with the laptop and the advertisers get a lot of interesting data about me from the laptop, but they don't really have a way to determine that this laptop and this phone are actually both belong to me. And this is where e-mail comes in, because if you do read your e-mails on both your laptop and your phone, and you load remote content, then this can be used to link these two identities. So they can tell, so these two devices receive e-mails from the same e-mail account, so they probably belong to the same person, thereby merging your different profiles and making it much easier to sell you, I don't know, useless stuff you don't need. So what's the big deal in terms of e-mail tracking? So e-mail tracking is highly prevalent, depending on who you ask, between 24 percent and 85 percent of all e-mails contain tracking. These two numbers come from two studies that had fairly different methodologies, the higher number looks at only top websites and the lower number looks at a larger set of websites. So the truth is probably somewhere in between and depends greatly on which e-mail newsletter you do receive yourself. So if you open a tracked e-mail, the website knows if you open the e-mail, of course. When you open the e-mail, which device you used, they also know which software you used, are you using Thunderbird, are you using Apple Mail, and they can also know very roughly where you were based on the IP that you used. And this data can then also be shared with others. We're going to see an example of how this works later, but this basically means that different tracking providers can work together to track you all cooperatively. So how can we detect this? First off, there's static analysis where you basically say, I get an e-mail, which is basically just a glorified HTML document, and then I look at it. I just look at the HTML code. So I see, okay, there's an image embedded here and there's a link set here and so on, and that way you see very roughly what would happen if you were to open it in an e-mail viewer. The problem with this methodology is that maybe there's a link in there, or an image that is being embedded, but just by opening the e-mail and resolving this image that is being embedded, it actually turns out that this is like a chain of forwards that sends you via four or five different companies before you actually end up with the image. This is actually happening. So static analysis can only give you a lower bound of the amount of tracking that you will encounter by opening this e-mail. What you would want to do is to do dynamic analysis. So you basically look at the system in the wild. This means that you take a system that can render an e-mail realistically. So this could be a Thunderbird, or this could be Firefox or something like that that just evaluates all of this HTML and renders it and also receives all of the third-party resources like the embedded images and so on, and ideally can also click a link. And this is what we're doing with our system. So I've been talking about our system a lot now. Let's quickly tell you what this is actually about. So what we wanted to build is a system that you can basically sign up a newsletter to, and then we would receive the emails, analyze everything for you, and then show the world basically a report of what is going on in these newsletters. That was the overall goal. That's why I'm talking about transparency and e-mail tracking. It gives everyone the ability to look at what is going on. If you look at what other people have done in the past, it is very often just a snapshot analysis. So you say, if we look at the system right now, this is what we see, but it is not updated over time. It is a one-off thing. So from these one-off studies, there have been three. One of them was done by Engelhardt et al. And they actually also did a dynamic analysis. They are actually the inspiration for our system. And they looked at, I think, the top 100,000 most popular websites at that time, analyzed their emails, and they found that there is widespread tracking. I think the 85% number came from them, so 85% of emails are tracked. That was from their study. And they also looked at defenses. We're going to talk a bit more about defenses later. The second one was by Chu et al. They looked at, in a static analysis, looked at both their own email accounts that they got over the last 10 years, all the emails they got, and they checked what was going on there. And they also signed up for, I think, the top 10,000 websites or something like that for the newsletters they could find there. And they also did a user study. So they asked people, like, how would you feel if your emails were being tracked? And then people said, oh, I wouldn't feel very good about it. And then they probably told them while your emails are being tracked, deal with it. And stuff like this. And then the third one, they also did a static analysis. They actually have a fairly interesting paper because you might know these disposable email services, like 10-minute mail and so on, where you basically just say, I want to sign up for a new account somewhere, and I don't want to give them my actual email address. So you get a temporary email address from them, you sign up, you click the confirmation link, and then you forget about it. And apparently for some of these services, it is actually possible to scrape these inboxes. So you basically say, I want to see all the emails that were being received by some interesting identity somewhere, and they basically did a very large scale scraping. I think they had a total of over a million emails they collected that way, and they did a lot of interesting stuff with that. So they also found that people were signing up for healthcare plans with that, and had their social security numbers in there and stuff like this. So it never ceases to amaze me how stupid some people are. But okay, so much for that. So it's a very entertaining and interesting paper to read. You will find it in the references. Yeah. So if we look at similar systems in the sense of systems that allow you to find out what specific systems are violating your privacy, there's two in the area of websites. One of them is PrivacyScore.org, which I developed with a couple of other people. I think one of them is also in the audience, Henning. And we are basically looking at websites and doing dynamic analysis of websites. So we tell you, like, is there trackers embedded, all of this. The second system is a web call by the Swedish NGO Datasheet. I've always mispronounced this. And they do a very similar thing and also give you hints on how to address these problems if you happen to be the operator of this website. So how does the system actually work? So this is what the workflow actually looks like when you use our system. So let's say you're interested in the newsletter that is sent out by, I don't know, Example.com. And you would basically tell our system, look, I'm interested in Example.com, and then we would see, okay, do we have it on file already? No. Okay, here's an email address for you. So we give you an automatically generated email address that looks like it belongs to an actual person. We also automatically generate a gender for this person simply because some newsletters will ask you for the gender and then it's good to have it on file and might also allow us to do analysis based on if there is gender-based discrimination going on somewhere or if different people are getting different content. So this is why we do this. And we give this email address to you. And then you can use this email address to sign up for the newsletter. The newsletter will send the confirmation mail to us and then I get a notification to my inbox that basically tells me, look, there's a new confirmation mail, please look at it, and then I look at it, make sure that you're not trying to, I don't know, open an eBay account with our email address. And if everything is fine, I will click the confirmation link and from then on the whole process is automated. So they will send their newsletters to us and we will analyze everything and put the results on our website. So how does this analysis actually work? So very briefly dip into the technical details. We have a mail server. We retrieve data from the mail server using our system. We have a crawler in there. It's written in Python. And it will of course first save the database, save the email to the database and afterwards pass it to openWPM. OpenWPM is basically a Firefox that has been changed in a way that you can remote control it from Python. So you can tell it, please visit the following website and then it will log all of the requests that are happening and a lot of other stuff. Just write all of that to a database and afterwards we can just look at this database and see which requests are being generated if you open this email with remote content enabled. We then also do a second round where we randomly select a link from the email that we think is not the unsubscribed link and we click it and we see what will happen there. So this process still has a couple of minor bugs that we need to work out, but that way we can also tell if clicking a link will actually also forward you via tracking services. The results will be saved to the database and then we have a different job that basically just runs the analysis, checks what is going on and prepares the data to be displayed on the website. So I promised a live demo, so here we go. So this is our website. Let's see, okay. And I don't know, does any one of you have a newsletter that they are interested in? No one. Okay, that's awkward. So let's say, then we just start with, I don't know, the newsletter from Finanztip.de. And this is a German site that gives you advice on how to pick a good bank account and stuff like this. And you can see now that when you receive an email from them, you will establish a connection just by viewing it to YouTube, to CRSend, which belongs to some marketing automation system. You'll retrieve some data from Amazon. You will give some data to Google. This is probably just Google APIs, so they host style sheets and stuff like this. More from Google, YouTube. This is probably also a tracking service. So there's a lot going on, but just by opening this email. In the same way down here. So as I said, this is somewhat buggy, so don't trust this too much. We are working on addressing this. This currently contains quite a few false positives. If you want to know more, go to this link when you visit the email yourself, the website. And then you can also see that we look for A-B testing and third-party spam. So if we say, okay, we want to register a new email address here, just have the, I don't know, 7th identity that also receives this newsletter, then we would click here at another identity, get our automatically generated person, copy that email address, and then we can just paste it in here. Hit subscribe. And yeah, at this point, they will send a confirmation mail and sometime during this talk, my phone will probably vibrate and notify me about the new signup and then I can confirm it. So as you can see, it's actually fairly easy. If you want to sign up a service that is not yet in the system, it's also fairly easy. You will get a notification that says, look, we don't have the site yet, but you can create it here and you get to the same page we just saw. So it's all fairly straightforward to work with, I would say. Okay. So looking at the results in aggregate, you can see that from the newsletters we have in our system, which is in total 136 right now, and the 10,000 emails we analyzed so far, you can see that 112 of these 136 newsletters actually do have some sort of embedded content that is loaded when you just view the email address. This might be tracked content. It might also not be tracked. It might be that they have resources that they are loading like images that they just don't want to put into the email so that the email is smaller, but they don't track it. But yeah, all of these will establish a connection if you open them with remote content enabled. In the same way, 104 of the services have some sort of third parties involved if you click a link. Again, this is currently a bit buggy, so this is probably more like an upper bound and the real number should be a bit lower. We're going to address that over the next few days. And if you combine the two, you will see that, well, 85% of services have at least one of the two. The third parties that we are seeing most are the following here. So you can see MailChimp is an advertising and marketing automation company. Google stuff is mostly CDNs. I don't know if anyone except Google knows if these are actually being tracked or not. I mean, I don't think I ever saw any reports about them being tracked, but I mean, I also don't think that Google just hosts them out of the goodness of their heart, so I don't know. We have more marketing automation stuff. This is a German analytics company that is being used by a lot of German publishers and Amazon content distribution networks are in there, so there's a lot going on here. And yeah, all of these people do receive some sort of information, but I mean, at this point you might say, okay, well, I mean, yes, they receive information. This is not very nice, but I mean, it's not like super terrible, right? So let's get to the next part, cookie-sinking, which basically means that these companies also tend to share data between them. So this is an example from a newsletter by FastCompany.com, which is a US-based news website. And when you just open the website, when you just open the email, this is the requests that are being generated by just one of the embeds. So you see you get a link to FastCompany.com which collects an impression. It contains your plaintext email address. And then they forward you to LIADM, which belongs to Life Intent. It's an American marketing company that has a very, very nice website where they say very clearly how they violate your privacy. And they receive the MD5 hash of your email address. They also receive the SHA1 hash, and for good measure the SHA256 hash and the domain that your email address belongs to. You know, just to make sure. They forward you a bunch inside of their infrastructure and then forward you to MathTag, which is another very nice company from the States that also receives some sort of identifier about you. And they exchange identifiers. And here in the end basically, LIADM knows what identifier MathTag has about you and MathTag knows what identifier LIADM has about you and they can very easily now link their profiles about you, make sure that they can target their advertisements for you. So for example, MathTag, no, wait, Life Intent, I think, were the people that actually allow you to host banner ads inside your newsletters. So you have a newsletter, you open it, and then in the newsletter an ad will pop up that is targeted for you. So this is the kind of stuff that is going on here. So, well, you can see here that your email address is actually being disclosed. I'm looking at this in aggregate. You can see that MD5 is for some reason very much favored here. So a lot of these services tend to use MD5. If you also use URL encode, so this is actually not in any way a security measure. This just basically means they don't want your email address to break on the way. Some of them use SHA-256 or SHA-1. Some of them use Base64. So there's a lot of creative stuff going on. We're also checking if there are nesting hashes. So if they say, we take the MD5 hash of the SHA-256 hash of the uppercase version of your email address or whatever, so far we haven't found anyone doing that, but we're prepared. So this gets us to a very interesting point because these advertising companies usually argue that, well, I mean, it's a hash. I mean, hashes are not reversible, right? So this is actually a pseudonym because we can't get back to your actual identity, and that means it's now pseudonymous data, so it's not covered by GDPR. That's the argument that you usually hear from these people. So this all rests on the assumption that it is impossible to invert MD5, which, I mean, it is hard to invert MD5, but I mean, it can't be impossible since there is a company that actually tells you, hey, we'll do it for you for four cents an email. So, yeah, this whole idea of hashes being in any way a good pseudonym that is not reversible is pretty much goes out the window at this point, but even if it didn't, we still have the problem that, well, I mean, you only have a hash of your email address. What can you do with that? Well, you can go to Axiom, which is a US-based data broker, and they basically just tell you, oh, yeah, just give us an MD5 hash or SHA256 hash of the email address, and we just match it with the stuff we have on file, and then you get the whole data broker profile about a person. So at this point, I think no one needs to be convinced that this is not actually pseudonymous, because if you can invert it for four cents, I think this is not really a very good protection mechanism. So a final thing we looked at was A-B testing. So companies may send different versions of their newsletters to different people, and that way they basically want to find out, look, do we generate more clicks if we put the price in here, or do we generate more clicks if we don't tell people the price, and then they just have to click on the link to the article. And we do that by basically having multiple identities, multiple email addresses signed up for the same service, and then just comparing the emails they are receiving. Why is this interesting? Well, because it's another indicator that there's definitely analytics going on, because you don't do A-B testing and then don't track it, because that's the entire point. So if you see A-B testing somewhere, you can be absolutely sure that there's tracking going on. So this is a bit of a sanity check so far. We only found three sites that are doing this, all of them e-commerce websites, as you might expect, because they usually have something to sell you, and they are very good at figuring out how to best sell it to you. So what did we learn? In prior research, we saw that actually people are not aware that this is going on. I mean, we also saw in this room that about half the people did not raise their hand when I asked them if they had known that this was a thing before they read the talk description. So this still seems to be the case even in our community that people are not 100% aware that this is even an option. We're also lacking good defense mechanisms. So I mean, experience with online tracking has shown that asking nicely doesn't tend to work. So just tell them, please don't trick us. Yeah, not going to be effective. So the standard defense mechanism now is to just say, okay, we just use ad blocking. The problem with ad blocking is basically two-fold. First off, ad blocking doesn't work in your Thunderbird. I don't think there's any ad blocker that works under Thunderbird. I think it might be possible with a bit of work to get something like Ubrok Origin into Thunderbird, but even so, then it doesn't work in Outlook and in Apple Mail and in K9 Mail and all the apps. So you would need ad blocking and all of your email clients doesn't work. Second problem, prior research has shown that the lists that are being used by the ad blockers actually don't have very good coverage when it comes to email trackers. So I think the number was something like 70% of email tracking companies were correctly identified by existing ad blocking lists, and other 30% would keep on happily tracking you. Now, in our community, there's very often two tips that are being given. First is just don't use plaintext emails or use plaintext emails. Don't use HTML emails. If anyone has ever seen the plaintext version of a commercial newsletter, you know that at this point, you might as well just not subscribe to it. So this is really not an option if you actually are interested in receiving these newsletters. And the second tip is don't load remote content. Again, good idea in general. Also the default value in many email clients right now. But again, if you ever opened a commercial newsletter, you will see that in many cases, then you will not have images. In some cases, the entire newsletter will just be one big image. And so again, if you don't load remote content, you don't get any of the benefits you, well, you want to have when you sign up to a newsletter. So it doesn't really work. We tried in the web tracking domain. We basically did an experiment with Privacy Score where we said, let's just take all the German health insurance companies, put them, scan them with Privacy Score, put them in a list and compare them, and then start ranking them. And that way we could see which were the most privacy-friendly health insurance companies and which were the worst. And then we contacted them and told them, look, we ranked you. And it's about the privacy properties of your websites. What do you think? Would you be willing to make changes and all of that? And in many cases, the responses were, please go away or we will sue you. So also this is an approach we will probably try like one or two more times in different contexts and maybe also with emails. But yeah, I mean, basically privacy, just like web privacy, the privacy on emails is a bit of an unsolved problem. So in conclusion, web tracking is only part of the problem. We also need to consider email tracking. You can consider email tracking yourself by going to this domain, privacymail.info and trying it out there. You can find the source code on GitHub. It's under the GPL. If you want access to the data for research purposes or whatever, you can also write me an email yourself. You can find the slides here and under this URL down here. And the slides will also contain all of the papers I referenced. So you can find them there. And with that, thank you very much. And we have hopefully a bit of time for questions left. So where do we have questions? What about server-based filtering? So if you put it on your email server, basically with the same filter list, normal ad blockers use, could you just filter out most of the... Yeah, but then you can't receive the newsletters, maybe? Yeah, so if it's just about not receiving them, then of course you can start discarding emails that look like they are tracked. But then again, since most newsletters are tracked, that is basically equivalent to just not signing up to newsletters. And changing the URL server site to remove information is probably possible, but in many cases this will also make links break and stuff like this. So yeah, you might... You can try it, but it's probably not going to work very well. Have you tried to track them back? Like, if you get a newsletter you have not signed up for, can you somehow find out who gave them your data? Because they use the same hashes or maybe some of them even solved them? So what we're doing is we are... We're receiving all the emails that we're getting and we're checking basically is each of the emails that arrives for a specific identity, so for a specific virtual person that we created here, is it actually originating from the service that this identity is associated with? If it is not originating from that service based on the URL, the from address in the email, then I again get a notification and I take a look at it and if it turns out that this is some sort of spam, then this would also show up in the system. So you saw when we looked at the results here, you could see that down here there's a third-party spam category. So far we haven't seen any cases of third-party spam in this way, but we are prepared to show it if it happens. Another question, if I have this kind of newsletter, they are tracking links and then I have to click them so I can't really get around those links, tracking links. I can hide the images, but if I want the information behind the link, I will click it. Would there be an way or maybe like an API later where you could assign to the same newsletter on your service and have your service click the link for you and just get the result at the end you want prepared without tracking your own email address? So basically a proxy that transparency rewrites it to the final URLs for you. So if we were to receive emails and then forward them to you and change them on the way, the first thing we'll probably run into is copyright law, as always in the Internet. And I mean it would probably be possible, but usually like we've seen that in many cases, if you just copy the URL that you got in there and then you just look at things that look like identifiers and change a few numbers, they often still work. And so this might be a way to try to defend yourself, but there's really not a very foolproof way. So it might end up not working. So it's a bit of an unsolved problem. Yeah, sorry. I'm just a bit curious what your end goal with this thing is because what can you really change? You can create awareness, sure, but if I want to receive that newsletter and want to find out what's behind the link, I'll know they're tracking me, but I can't really do anything about it, can I? Yeah, I mean it's a valid point. I mean there's two answers to this. The first answer is, I mean, why are there health notices on cigarettes, right? You smoke them, you know they're killing you, you're still smoking them. So I mean it's most of all, it's a case of actually, as you said, awareness. So we want people to at least be aware that this is going on. And the second point is if people are aware that this is going on, if we find cases of really abuse of these tracking powers, then we are able to tell people, look, all of these email newsletters of companies that you are using are using this terrible company that was just in the news, and then you can go to the company that sends you the newsletters and tell them, please remove these people from your newsletter and you can try to push them to improve the privacy. But of course, yeah, in the end it's a case of, I mean we're scientists at TU Darmstadt, so we have to publish papers, so of course also interested in having a large data set. So by doing this you help us do our science and generating a large corpus, and mostly it's about awareness for you and the other end users. In that case I'd like to follow up a bit because you say abuse tracking. How exactly, where do you draw the line? Because I'd argue there are legitimate users of trackers in newsletters. This is actually valuable data companies benefit from this. And if people are aware of this and are fine with this, I'd argue this can be okay. So how do you decide whether or not it's abuse and how do you show this because people are going to go on this list and see all these people are bad and they're tracking me. So this is a valid point. I mean we as the operators of Privacy Mail we are not making any value decisions here. We're just saying look this is what is going on and then you can make up your own mind. And then if you say I definitely don't want any website or any newsletter that embeds Facebook, then this is your personal decision and you can say okay then I'm not using these. In the end the argument of there are beneficial uses of tracking and it's good for the economy to slightly rephrase you and I know that's not how you meant it but so it's a slightly mean way to put it but yeah, I mean of course there are good ways to use tracking and everyone needs to decide what they are willing to deal with. I mean I personally I think that child labor was also good for the economy so at some point we decided that it was not good for society and we decided to put a step stop to it and society has to figure out if email tracking is bad or not. As we are running out of time there's no more time for more questions. I hope you have enjoyed the talk as much as I did and if so please give me a big hand for Max.