 So this is a talk about archaeological studies that is in data waste given by Katarina Norkung, a blogger, author, but many of you will know her as a privacy activist and the one for Leti. A warm applause to these two speakers and welcome to the English translation of this talk. We're always glad for your feedback under the hashtag C3T, our Twitter account is C3Lingo. Yeah, thank you, Pooper. Nice how all the heralds are performing something with the French announcement. A question I would like to ask at first, how many people here have bought their Christmas presents at Amazon? I would say roughly half. Please keep your hands up because we'll continue. Who of you did only do the research at Amazon and then bought elsewhere? Are there people with there then would we get people that more people would people add to that? So not many people were added. I think there's as far more than 50% is hard to see, but those that showed up will surely agree with me when I say that it's quite comfortable to do terribly comfortable to do to research or even order anything with one single supplier. And you imagine this as a very comfortable thing anyway. And I don't know who at DHL made the decision to have their ads on Formula One cars. But I think that person has a great sense of humor because in reality it's quite different, right? There's really well paid drivers and fast deliveries. That's not really what you get in practice. And with Amazon, I would say not at all. And one or other, we'll probably know that you will then get these notices that the package would be given to your neighbors. Or as this note says, I hid it in the trash. So there's no, and that's, of course, no guarantee that it is really there at best offer. Those that haven't read it, this was seen on in the target bigger newspaper. I hid the packet in the trash in a blue paper in the blue paper container. Yes, what could possibly go wrong? And that really makes you happy when you think, how can we solve this problem? And of course, Amazon has a product for that, at least for their US customers. They can now, as a prime customer, decide that I'm using Amazon key. Well, then it's an intelligent lock system for your door, for your apartment door, and you can say, well, the Amazon Deliverer, I trust them and they can place the packages in my corridor. They can unlock my door. And appropriate for that, there is a video surveillance system for you to control when the children are coming home drunk or something. And I don't know what you feel like, but I think that's fairly creepy. I would never do that terribly creepy. And I think very hard who I should give access to my apartment. But as a privacy activist, I wondered, isn't the insight that Amazon gets into our lives through our clicking behavior even more intimate than a harmless glance into the entrance of my flat? Those looking to my flat may know something about how I live, but those that know my online behavior. They know when I click where they could glean the way I think. And that is much more intimate. And therefore, in the last year or the year before, in 2016, I decided to conduct an experiment. I wanted to know what does Amazon store about their users? And as part of that, I started to order all my Christmas presents and other presents from Amazon and to research everything there to generate a record that would be as complete as possible. And my target was, my aim was from the beginning, that I want to have access to this data. I want to cut through it. I want to see exactly what Amazon stores about its users because as customers, because I don't know if you know this, but according to Article 15 of the European Data Protection Regulation GDPR, everyone in Europe has the right to always go to their online service providers and ask them for a free copy of all the data they hold about you. And they have to deliver that. In practice, hardly no one does that. And with Amazon, as far as I knew, no one really went through with this to get something usable out from them. So I decided I am going on a hunt for that data. So what did I do? I bought a lot. What did I buy? Well, almost 60 books within 14 months. And if you then wonder, yes, by now I have read at least half of those. Yes, I have. I've also bought practical things, such as chalk in a spray can, a button machine. In my purchase profile, we also had strange things, such as the lavender spray that you can spray into your cushions if you're going to go to sleep, if you may have heard about this. What unfortunately is not seen in the picture is the home trainer that I bought. It was so successful that after three months, I sold it on. And I bought some useful things to a mouse, some files and shoes. And through Amazon, I also became the owner of the most beautiful house shoes on the planet. In 2017, I thought, well, this data set is now fairly well-fed with clicks and purchases. So I'm now going to retrieve my data. And unfortunately, I have to say, that was the beginning of a long and intensive pen friendship with the departments at Amazon. Because at first, I did not receive what I wanted, but I kept asking questions. And at some point, they sent me CDs. So first, I had to go to my basement to see how I could read out these antique data volumes. And on the first CD, unfortunately, I only had what I had expected a copy of the data that I had already accessed online, plus some extra information, not really what I had wanted. And I then kept nagging them and received a second CD-ROM. Actually, I received three. The other was lost in the post, in the trash, probably. And yeah, I looked at this last CD and wondered what's on there. And there were things there like PDFs where I could see what search queries I had used, what advertising emails I had responded to in what time to the second, and what ads on the website I'd reacted to. The interesting thing was an Excel sheet. And that sheet had the innocent name clickstream. I opened that sheet, and it took a whole while to load. And at one point, I saw, all right, this sheet has 15,365 parts. The lines and each of these lines have 50 columns, 50 items of extra information. I went to the supermarket to visualize this. That's the amount of paper that it would be printed out. I had intended to put this into one pile, but that would have been larger than me, with the size of a height of 170, because that would have broken down shortly. I put it into two piles. And when Katerina asked me whether I wanted to evaluate data, I first thought about my database lecture, the relationship between customers and supplies, what they buy. But the data at Amazon is not just purchases. They basically save everything we do on that website. It doesn't matter whether we just go to the start page or search for products, or look at a product in detail. They even store if we expand an image. And every interaction that we have when we use our online profile with Amazon. Because we have a foundation sock here, I thought I'd like to take you with me into the journey of the unknown data and the things you can do with such data in an explorative analysis to really understand what's in those data and what you can see. So as I said, there are 50 columns called dimensions as well, because I cannot list all of them. I try to broadly assign them to categories. We have time, of course. We have account details, such as whether you are a prime customer or a business customer, a location information. Amazon stores the IP address, not the last byte, but everything else. But they also store which country, which federal state, and which service provider you're in or use to visit the data, the website. And also, of course, the URL that you visit is stored. And if it's a product, they also store the product ID. Oh, I forgot about the session details. The session details are things like what do you see in the cookie as well, so that you see which session you're in. And also, there is another ID that connects you through all the services from Amazon and identifies you with all these services. And regarding the navigation details, you can see where these things are going. They see where you come from, see where you go in the site and to where you leave. All that is stored somewhere, not just the history, but also what you do, what kind of interactions you do. Do you add something to the shopping cart? Do you look at an image? Or do you save something for later? Also, last, you have what we call Amazon internal. You see which web server the query was sent to and whether that query had an internal IP address within Amazon, but you also receive a long list of all these dimensions and explanations on them, what's behind them. And sometimes, Amazon does not store the clear text but codes and codes things. Zero might mean that you put something into the shopping cart, so you think, oh, great, there's not so much to pass there. But then I realized, when I looked at the data in more detail, that Amazon doesn't tell you what they encode, some cases are simply unknown to us. So let's say that this pen friendship is continuing because I had more answers regarding that. So if you look at this set of data with those 50 columns that have 15,000 entries, I used two to show you as an example how they are set up. You have the date and action, such as search or purchase. You have the URL, the federal state where it came from, the internet service provider, the time it took to load the page. And not just for those two entries, but all the 15,365 ones. Which time period are we looking at? It starts on the 1st of August, 2016 and ends on the 31st of October, 2017. So that's 196 days with about 78 entries per day. If you have such an unknown data set and work with it, then I would always use Python because that has more or less been accepted as a standard for analyzing data sets like that. So whatever Amazon says, I will look at those dimensions in detail, how many dimensions are there and how often were they used. Some of them are always used, such as the date and time, but other dimensions aren't often used and we had one dimension that was never actually used, something about images, no idea what's in there. I then looked at every dimension in detail, what's in there and how frequently this happens and then I was looking at the time because I realized that this was supposed to be to the second, with second precision. But I kind of don't understand how you can have 45 entries in a single second. So then I thought, okay, let's look at another column, let's see what's in there, maybe that's just an outlier or something, but you never know. And then I looked at a single day. You can, of course, aggregate for a day and you get a relative distribution and you have three outliers which are very conspicuous as they are high above everything else and one that is seven and 110 entries per day and I don't know how intensively Cata was using Amazon on that day from early to late. Well, that was quite an achievement anyway. So I looked at that day in detail, we have 710 entries for the day and I wondered, what's the time range? It's 20 minutes, 35 seconds. So that's one entry every 1.74 seconds and that then would have had to look like this. So if I would manage that, at least at that point, I would perhaps consider a career as a pro gamer. And then I went further into the data and you have these value counts that would be used in a histogram or something and this function I have really good friends by now. I really looked at this very closely and see where things come from and I realized that these actions are always well defined. It's actually stated about 4600 times and perhaps I thought maybe it's not always easy to classify and then there are other two interesting things I noticed, request and lazy load. These things, I didn't consider much of an interaction. I am a web developer, so I thought maybe something else is in that data. So I looked at the URLs that is connected to these entries and at least with Ajax, which is a web technology, I was only using action, let's look at the browser. As I said, I'm a web developer as well and the browser will tell us a lot about websites and the traffic that's happening in the network and you did that for Amazon as well and just selected an image and looked at what is loaded as you are on that page and yes, of course, every page loads a lot of stuff and as soon as that is finished, I think, okay, in that list, I will search for the URLs that I wasn't able to quite understand earlier and yes, surely these are all images or things that buttons reviews that are loaded later, which in my view, isn't a user interaction but it's all part of that quick stream data. So if you look at the real user interactions from those 15,000 and X entries, I assume that a single interaction has to have a page action, otherwise it's not really user interaction, so let's select those at first and then I don't want it to be a request either because that doesn't look like a user interaction either and not a lazy load either and if you take all that and just take it out of from the data, you have eliminated 75% of the data and you're left with only 3,747 entries that are real that I would assume to be real user interactions. I don't know how many of you I use GitHub but I love this graph that shows you interactions that you have to do with GitHub and the same thing you could do with Amazon. I don't know if I would be so happy about that. At least every box represents one interaction and the darker ones show the many interactions you have on a single day. For example, just before Christmas, right? They are very well developed. That doesn't mean that every interaction leads to a buy and I try to find out how Amazon classifies purchases. For example, hey, we could look at page actions. Of course, Amazon would have nicely formatted data like orders or purchases, but nope. For some reason, Amazon has a load of data that you have to manually filter on your data to get an idea on which interactions are purchases and which are not. I also could not find out in this enumeration of place orders if there's any interactions between them. So I could not find that out from the data set. If you compare interactions versus purchases from the 196 days that I interacted with Amazon, I only had 24 days in which I purchased something. But you can also see in December, Kata bought four days in a row something. And you can also see that it was a lot of books. For example, in December, she bought 32 books in four days. Okay, we'll leave the topic of interactions and purchases and we'll go towards the locations in which Kata was when she was browsing Amazon. The first thing I saw is that she was mostly in Berlin when she interacted with Amazon. Okay, maybe she lives there. You also have the States of Brandenburg and Schleswig-Holstein. I don't know what her interactions with Schleswig-Holstein are. She also had interactions from Nordrhein-Westfalen and Lower Saxony. And I don't know how you guys do it, but if you want to leave Berlin, you have to always leave via Brandenburg. And in the time where I did this experiment, I sometimes had to go to Schleswig-Holstein so you can see that. And Nordrhein-Westfalia is my family. And when you go there, you have to drive through Lower Saxony. And I think that I can see from this data set when I visited my parents, but we will see. So this is only the column of the States in which we were in. So how does Amazon know that my parents live in Northrhein-Westfalia? Well, think about it. Just before Christmas, I visited them. So that's how Amazon knows. So even though, so you can find out very private data about me. So then I looked at the internet service providers and one thing is something that I've noticed. Some of them tell me about what I do for example, Qatar uses the Freifunk internet service provider in Hamburg and then I saw something from 2017. That's when she was on the Bahamas. Not what you think about, that was holiday. And in July 2017, she was in Poland. Yes, I was on a family holiday there. There was something else that I noticed and I think that you have a relationship to universities or libraries. Well, yes, I actually do like to write in libraries. Some of you will know the Deutsches Forschungsnetz, the German edu-realm. This is an internet service provider for universities and libraries. And you can also see that in the internet service provider data. So what Amazon does is you, Amazon only allows 50 characters for the storage of the internet service provider name and the Deutsches Forschungsnetz is too long to store there. What you can also see in the data is how long you stay in the library. Sometimes you're only there for a few seconds, maybe a bad conscience when you're in the library and go to Amazon. Sometimes you stay there for a minute and sometimes you're on Amazon for 13 or 14 minutes. Maybe you're procrastinating there. But also in some days, you've been on Amazon for over an hour. Well, that was only research, you know. So something that I've also noticed in the data set is that at some specific point in time, Amazon tries to find out whether the tab is in the foreground of the browser or in the background of the data. Since the data set of CUTTY is limited, I can't find out whether that is intentional or how good it is. I can't find out right now with the limited data set whether Amazon tries to improve this detection. And what's very interesting about the data set is also that you don't need that much technical understanding of how stuff works. You can, for example, see in the column V and column V is about referrals. So that means where are you coming from? What was the page you visited before since you come to, when you come from Amazon? And in my case, I found a specific website from this media website, Spiegel Online, within specific article, and I was visiting that page before I visited Amazon. So that was in my clickstream. So Amazon knows what news I read, maybe what political affiliation I have. I also found another website on this telepolis website. This is a very critical report about CETA. So you can think about, Amazon knows which political articles I read because that's in a referral. So Amazon can know my political affiliation. And in that case, I actually did a campaign against CETA in Silesia-Kolstein. And you can hardly, you can't overstress how threatening the difference between products bought and products clicked is. For this talk, I put everything I had bought in that time on my kitchen table. But if I were to list all the products or put all the products I have looked at on a page in a photo, I wouldn't have had just to clear out my kitchen table but my whole flat just wouldn't fit in there because it's so much more. And what you also saw in that data is that CETA more than 500 times was searching for certain items or terms and that she visited product pages a lot of times. So you can say that it's a lot more because these are the more obvious things. But if you go into that data set manually and look for certain patterns, you can see that there are even more products that were looked at. And I then wondered, what does someone see that doesn't know me personally but does know my data trace? What kind of person do these people see? And do I think that's all right? And I then looked at certain things I clicked at and looked at it from different viewpoints. Let's take the issue of life planning. If someone would ask themselves, well, how does Katarina Noken imagine her future? What are her plans? And then looks at her clickstream. What do they find? I bought a book that deals with arguments in favor of having children but also another one that deals with alternative partnerships and polyamory. And of course, also a book about someone who exited the economic system, the consumed society and turned his back, turned her back on that. And if you look at these three products and maybe a few more that are in there, then you would think, okay, well, this is an extraordinary, very individual, maybe difficult life plan to put this all under one hat. But how is this actually? This book about children I looked at because I was made aware of the author because I like the blog Spreeblick and wanted to know what this guy is also writing. The second book is one where I knew the author and was invited to the reading and I wanted to know what the event is about. And the third book simply is one with the same publisher that I am using. So I wanted to know what the publisher I'm going to use also publishes. So that means that clickstreams or the image that is generated from that clickstream and the person that I really am, probably to very different things. Let's take the issue of health. That's much more clear. I looked at Schnaps. There's a single category for that with Amazon alcohol. And you could imagine why I have an interest in Schnaps. Well, maybe I have some health issues and I would like to self-medicate. I also looked at a book about arthritis, a very severe disease, and what's even more shocking, I didn't just look at one, two or three, but many books about cancer. One as an example, healing cancer naturally. And if you look at the clickstream that way, you would imagine that Katarina Nokon is a wreck in terms of health. But what is the real picture? I have to say that this gene is really good. I really like drinking it. The books about severe diseases, I looked at because I was doing research. I wanted to know what the part of esoteric nonsense in the Amazon bestsellers list is concerning health. And this intention you cannot see if you just look at the clickstream. I am neither suffering from cancer nor arthritis and I have no signs of that either. And another interesting thing is political opinion. In my clickstream, there's a huge number of books about the right-wing alternative for Germany. There are books against the AFD, but also books by right-extreme authors and conspiracy theorists. And if you just look at the clickstream, you would probably consider me a very strange or even not very likable person. And you would think this is a right-wing person. But what the thing really was, I deal with the AFD critically in my blog, so I did research. And the research you do, yeah, well, no one can really support the AFD, that you really have to say that. And of course, I wanted to see what the other side writes. What do they publish? What do the hotheads publish and how high is this placed on the best-sellers list? And if you just look at my data, you don't see that. You might consider me a right-extreme person myself. And this gets interesting if you consider who would have an interest in this data. Where there's data, there is interest. And of course, there are authorities that would be interested to query such news data, everything that's there. And if the clickstream is there, it will be queried. And imagine a police officer that might think this catter person might be a cyber-criminal or maybe a potential threat. There is a potential threat that might emanate from her. And let's see if there are any evidence to support that thesis, that theory. What do these people see? Well, first of all, they see on my list a so-called killer game. And that looks extremely nice, doesn't it? Extremely likeable. As if you look at this from an authority point of view, I also looked at a black t-shirt with an imprint that says, chemist only because superwoman is not an official job title. You could then wonder that I have interesting skills and hobbies, I think that. And next, a suspicious cooking device, a pot and a balaclava mask. And I don't know how you see this, but from the point of view of a government authority that would not look good at all if you look at it with that kind of mindset, you could consider that to be a dangerous person. And at that point, I think it would be high time for a visit. But of course, there's a completely simple and harmless explanation for every product. Exactly. Anyone could say that, of course, someone says in the audience that would be the counter-argument, of course, and that would leave me in a very bad place. But I don't know how you feel, but I think to just have to think what kind of consequence there could be from that kind of data trash, a worst case scenario, I find that extremely threatening. And Katta then gave me her trust to look at the data and keep it confidential. But how does Amazon see this? Amazon is large, Amazon is very large, about 300 million users. And I did this analysis for just one person. Amazon can analyze this for all those 300 million people and they can see patterns in that and they can see which products come together in the shopping cart, for example. And what does it mean for something to be in the shopping cart? Amazon knows what do people buy that buy this product? And the simplest example is the precise scale. As a cook, you may want to know how much goes in there and how much have I put in there already, and if you buy this, then you immediately get these great suggestions. That is a clear sign that there are other uses for precise scales and you are laughing, but that could really lead to serious consequences for someone if they, without their knowledge, are put in a certain category that has nothing to do with them. For example, if I look at a device to cut glass, a glass cutter, I would get a recommendation for a balaclava mask. And what do you get with the right equipments? You can imagine yourself. That is a problem because I don't know what you feel, but I want to know which categories I'm put into and I would like to have a say if those categories are unpleasant to me and I may think that some categories simply should not exist. And the problem is that when querying my data, I just get this minute mosaic piece of the whole data and the actual information, the possible evaluation, the possible insets or conclusions, I could only fully understand if I know the whole thing, but Amazon will never release this and Amazon will never release the algorithms that they use on search queries or something because that, of course, is a business secret and a trade secret and that, of course, are the really interesting data that we need to really get a full insight in the way this company looks at us and how they manipulate us in a targeted way to tell us to buy more, for example. And I don't know what you think about this, but the name Amazon is very fitting. So I don't know who thought up the name, but it's great because the Amazon River is the biggest river in South America and he has a lot of little rivers that run into it. So it's similar to my experiment. I only used one Amazon product, but I could have also used a lot of Amazon products. What would have been, for example, what if I looked all videos with Amazon Prime Video? What would have happened if I decided to include Alexa also Amazon Echo, for example, if I put an Amazon Echo in my bedroom? I think my data set would have been a lot more interesting and I, for my part, decided cognitively I do not want to do that. That is too much. I will not put an Amazon Echo in my bedroom. And a few weeks ago, I was very happy that I did that decision a few weeks at that time because I don't know if you noticed, it was in the news, another user asked his data from Amazon about his Amazon Echo and he received an entire different data set from another person. So that's something that you definitely should not put into your bedroom. Just to reiterate, if you use a lot of these services, you use the same Amazon ID and that is true. That is also what is being stored in the Amazon data set. So if you use other Amazon services with the same login address, it's definitely in the same cookie. It's in the cookie, which ID you use. So now you think, well, it's great. You have your data, but what about my data? How do I get my data? There are different ways on how to get your data. I can tell you how I did it. So first, I read through the AGB, the terms and conditions, and I read it in its entirety, which was a unique experience for me. And I also looked through the data privacy statement. And in both of these documents, there are categories listed. So I can see what I will... Also, I used my common sense and to just consider, to think... To consider what I could actually expect from the point of view of logic. I'm at Amazon and one week later, I get an email with it. Do you really want to buy this product? So they have to have stored for one week what I bought or didn't buy. And when they say we don't do that, that's of course a lie because otherwise technically it would be very hard to implement this. And from that information, I built a checklist with my expectations of what a full statement should contain and only then did I formulate my request to access my own data. And I cannot stress enough how important it is if you apply for your data to always set a deadline. Without a deadline, nothing will move. The GDPR does have a timeframe in which they should respond to you. That is one month, but it's never wrong to explicitly state this deadline in your statement and remind them when that deadline is passed. And what can also be motivating if you put things like in there, if you don't reply, then I will ask the authorities, the controllers. And that query of sadly is not enough. If you put in that application, you have to expect a pen friendship. That might be a nice thing. If you put in this request for the high property, you first receive the answer. Well, look into your profile, that has all your data. That of course is nonsense. That only contains a small fraction of the data you want. So send them a friend reminder. Next, then you may have a letter or an email saying, yeah, we have your data here and you look in there and think, okay, someone just printed out the profile or copied it and put that into a PDF. So not what you want to have. You shouldn't accept that at all. So another friend reminder. Then you have reached the next level and at some time there will be some insight and oh, we found some data here and now we'll send them to you. That's the point where that CD arrives and at that point it gets interesting because the probability that they play this kind of game that you have to repeat it for a while is quite high and if you repeat it often enough that then you will have your whole data set and that data set, you can now search at will, analyze and believe me, it's worth it because it's one thing to abstractly know that you are being surveilled by every step and click and a completely different thing to see your own broken sleep rhythms for a whole year in front of you and that is something I would not interest to any retailer. And I then ask myself, do I want this? And everyone that sees this data set will surely ask themselves the same question and in my case, the decision was that in the future I will buy my board books directly from the supplier not via Amazon Marketplace and at one time or other I even saved money this way and the information about that it's being stored will probably enable us to say I'm going to use the service differently or I may not use it at all and if you don't use it at all then I would recommend you can actually hand in request for deletion, might also lead to a pen friendship and we in this talk considered that we do not want to say that this is an individual problem by users that are using Amazon. No, the problem is that Amazon acts just like many other services to be honest surveillance has for long become the norm and what that means for the individual we wanted to make visible but we still think that we all together have to fight that privacy, data protection, security has, have to be the standard, the default. Yes, and that gets me to the end that gets us to the end so you have all kinds of information on my blog on how to ask for your data and if you're interested in the data analysis I have published on Amazon, no, on GitHub there's a repository called Amazon where everyone can look at the data and see their own broken sleep patterns. Yes, thanks a lot. Applause. Huge applause.