 Hello and welcome to the Tracking Website user's booth. This booth will be mostly about the social and technical limitations about tracking website users. I myself am Martin Sobelhelas from the, well, I'm working with the website team. I myself am a member of the Debian System Administrators team and also still a bit working within the Listmaster team, but not involved that much anymore. During last year's Webmaster team meeting, or web team meeting we had in Vienna in December, beside talking about the website we also had, we started some discussions about which content to move from the wiki to the website and vice versa. So we wondered which effect will that have to the web pages if we start moving content around, if we start moving content around within the web page and not only moving content from the wiki to the web page, and we very fast came up with social limitations on that problem. What shall we do? What can we do? What things we should not do about that? Our current status is we have very, very few informations about our website users. The only thing I am currently aware of is the link over there where we have the web logs from one of the WWMIRRORs in some sort anonymized, so we just only get the link from the web page and the account of pages, the account of views of that specific page. And actually we cannot tell anything at the moment about how many page impressions do we have per date, how many, if we have a release, is there any difference on the user, do we expect a user peak or if we change content on the main page, what will happen to the traffic of the web page then? And we started thinking about what we can do. I think the easiest thing is having something like a web serializer or AV stats running on the web logs. But it's a big social problem if you start gathering all this information to, should this information be public and available and so on and so on. Well, we would like to see how many users we have on the web page. We would like to see the effects when starting moving content around within the web page, from the wiki to the web page and vice versa. And there's also interest especially for the translators to see if there's content on the web page not yet translated. For example, let's say from India or wherever, and a page viewed quite often from the Indian developers or the Indian website users. So we could start prioritizing translations. So we get those pages, which are quite often viewed, but not yet translated before other pages, which are not that common used. So for example, Debian news items from 2002 are probably not that much viewed than, let's say, the www.debian.org slash develop page, something like that. And we would like to gather information about that if it's helpful to have some pages translated with a higher priority. What we could do is, well, the obvious things. We could start analyzing the Apache log files. Nowadays, all the web mirrors are on debian.org or DSA maintained hardware. So we can easily get the Apache logs. Other techniques that could be done as having some sort of icon, not one by one pixel, via a CGI script or whatever, together information also about the, for example, about user agents and so on. And mostly of that can also be done using JavaScript, like some sort of Ajax request, for example. We then would also get information about the screen size of users and other different things. The problem is, do we want those informations? If yes, which social limitations are we facing with? That's actually the thing I would like to discuss here in this box so we can gather information and then the web team can use this information for setting up some sort of tracking, I think. The obvious question is, should we start tracking at all? And if yes, do we want the tracking information public available to all users? Or is that only something the website team should be viewing? Yeah. I think I would like to discuss that with all of you in here. But the persons on the streams want to hear the question as well. OK, so first of all, information that you want to gather will be probably used for making user experience better. So information you gather will help you to better organize your website. And in that sense, I think that that kind of information gathering can be useful. Secondly, if I use the internet and if I make a request to your site, as soon as my request leaves my computer, it's not private anymore. So this is something that you can't call private data anymore, I think. And that is the reason, I think, that there is nothing wrong in gathering that kind of data if it will be used for increasing user experience. So this is my sense. OK. Yeah. I have some kind of personal experience because we introduced some tracking technique at the institution I'm working for. It's a German-based institution, so I'm not sure how, well, how well you've said experiences. But we had major difficulties with privacy issues when we started to doing this. We had, I think, three lawyers writing some kind of privacy statement, which we just put somewhere on our website, the lawyer said it would be OK, but we needed to publish that. So if we need to do something similar for the Debian project, it might be a little bit more difficult, because we might also add other sub-projects or things like list stuff or things to our global privacy statement on our web page. But I'd really like to see something like that. You were talking about do we actually track individual users or should we just gather statistics about the site usage? And there's a big difference between the two. Do we actually have a use case at this point in time where we actually need to track individuals or individual users? For all the cases you've given so far, just statistical analysis of the site would be sufficient. If we don't actually have a use case for doing it, why even spend the time and effort doing it? You've already stated from a statistical point of view that it would be good so we could see what translations need to happen. And so that is implying we haven't got enough bandwidth in workload to cope with what we need to do now. Why make more work for ourselves? Just give it to Razup. One of the reasons for why we want to track are what useful things that might come out if we are tracking users is there's usually a lot of people complaining about the bad navigation and we might want to have input on how we can improve this. And if we know what paths people go and which directions they follow to get to the information they need, we can get actual information out from that, how to improve on that level. And that's one of the main reasons why tracking users is useful for us to improve the navigation part. The idea is not to have one-to-one mapping between, in the log files, but to have some sort of anonymized IP or however we, on the DSA side probably in the end or whoever is going to implement that. We will implement some sort of anonymization on the log files, but as Randa already said, tracking the paths a person takes over the website and for example, merging content from two pages into one because mostly all persons view both pages, for example, merging the content into one page would make it one click less in the end. Okay, so this will be boring to Zauwil because I told him over lunch. But in general, previously we already had a web allizer analysis on Clacker Debian Org when it was 3W's Debian Org that gave way to, I don't know, BitRot or something so we don't have it anymore. So as far as statistical analysis, I don't think there's any downside to it. And we could even, if resources permit, publish the results, you know. So tell the users what we gathered from the logs about them. And as far as the tracking goes, there's one use case where this is something somewhat anecdotal but I think other people have told me they have the same experience so I'm going to lay it out. You go to a website, you can't find what you're looking for for example at www.debianorg.org. Then you go back to Google and then you search for a string that leads you back into the Debian Org domain but at the right place. So if we had originally, when the user was at www, set a simple tracking cookie. So no actual information is relayed other than the existence of this cookie. And then when we receive back that user on the website, for example, securitydebianorg.org or packagesdebianorg.org, then we know, oh, this user was in this session already at one of our other websites but did not come with a refer field that said he was linked here directly. So this is a useful bit of information. We don't have to, obviously, we don't want to keep this information where was the user and spying on them and whatever but the essential component was our navigation was broken. We did not relay this user through internal links to our website because we cannot verify it in the refer field because refer field says Google or it says Yahoo or it says Bing or whatever. Or it says, I don't know, random forum where some other user who had found the same solution backlinked it in the right place in our hierarchy. So that sounds like one simple usage of a tracking cookie and a very trivial tracking cookie. No extra information, so just PHP sessions and whatever. Okay, so if you think that collecting statistics or tracking users would be useful for you, then you probably should, we probably should do it and figure out how to do it. So that answers at least from my point of view with no head on the question of whether at all you should do it. One of the things I would like to see in the setup however is that privacy is conserved in the sense that nobody should ever be able to figure out whether or not I have visited a website and certainly not which sites I personally have visited after the fact. So collect the data, run some kind of post processing on them and then work on that data. And if we did the anonymizing step right, then there's no reason not to publish that data because somebody might be able to do something useful with it. So collect the data, anonymize it and publish it and then put whatever kind of statistics or whatever you want to do with it on that data. Do you want to direct on that or may I pass the microphone to? Somebody previously mentioned a privacy statement. Modern day websites powered by lawyers. This is a realistic thing. Users have come to learn to expect a privacy statement. In Debian, our users who know us, they expect that we won't invade their privacy so they don't really look for a privacy statement. But it would be okay if we spelled it out. If we told users, okay, we have no financial interest in tracking you and selling you the latest version of Debian or whatever. Just spell it out. For example, simply have a page that says, okay, we won't do anything evil because we will do this, this and this. For example, we do collect logs because Apache writes them to disk. Okay, we will not turn off that feature but these logs will not be sold to adclick.net or whatever it's called, whatever agency. So maybe just spelling out the obvious would be our privacy statement. It is obvious to us but not to random web users who might be alarmed because they read on a random website. Your privacy is endangered by any cookie or by any log or by anything. You should sit down. Spelling out the obvious is fine but then we should not use the existence of this privacy statement to rationalize things that otherwise we wouldn't do. It shouldn't be a place where we tell people, oh, and we are going to be very evil and it's okay because we just told you. I wonder if some initial navigation tracking can already be done with the current logs by tracking IPs and web browser identification strings. I mean, that won't be totally accurate because obviously you may have like thousands of people behind a knotted address but it may already give some ideas about some pages that need to be merged and that may be like a first step in the sense that you start getting some information and try to use it and so you can validate if that information is actually useful and you don't need any privacy statement or anything because that's data we already have and then from then on that could be a second step. Obviously the downside of this is that you do the first crappy step and you never do it properly because it gets like a long path with loads of steps. I think we're back at the question of are we collecting statistics and paths for the sake of it or do we actually have a genuine use case? If we've got things we want to do, let's go ahead and do them. I'm still trying to understand are we actually trying to say, look, we could do some cool stuff with the statistics to make things better or are we saying I want to make the flow better, I want to merge pages together where it's obvious that actually this should be one page or even split pages up or improve our, if you've got a specific task, go and do it. I don't know, I'm just kind of wary about this or let's just do this globally, we collect loads of information and then we can mine it later because you'll always find that you've never collected the right information. You've always got to have to collect something slightly different later when you need to do it. So half the time the exercise is pointless. I'm still, as I say, give it a use case, have a specific task to do with it and then go and do it. I think it's kind of, it's to us in the web team or all guys in the web team like me, it's kind of obvious what needs to be done because we have all these sorts of diversity of web portals because for example, we have packages Debianorg and www.debianorg because these aren't the same virtual hosts so the logs are separate but really they're one and the same. They're the same website as far as Debian user is concerned. They want the official Debian information and packages Debianorg provides official Debian information. That's the kind of thing that needs to be analyzed but yeah, definitely. As I said, in the first place we had the web allizer once before so statistical analysis first and then we can work on that. More questions? Well, one of the things, for example, which annoyed me a bit during the, I think it was Lenny release cycle we had this big picture on the main page which used more or less the half of my screen resolution so do we know anything about the current screen resolutions of our website users and could optimize graphics and so on directly to the, let's say to 50% of our users. Is that an appropriate thing to track or is that tracking too much information? The same thing I said at lunch. Okay. Well, at least we have to spell it out. If we discuss it in a big group I think there's more consensus just on the two of us. So you have to, anything we do with the website accessibility is always in the back of our heads so we have to be able to any kind of new technology or old technology that we are not using now do we detract from the experience of the lowest common denominator user? So if the tracking cookie will not be loaded by a user who doesn't like JavaScript or who uses a, how do you call it, ad block or no script? No script. If they actually see all the content and they are not harmed in process then there's no real realistic reason to go against it. And for everyone else, if it brings a concrete enhancement that's okay. And this builds on the general ideas about what do we do about the website. For example, one of the examples I promulgated was that we need to have more RSS feeds and readers and to be able to integrate. For example, the web page on the official website about the Spark port should have an RSS feed from the Spark mailing list so as to integrate better these things. So, and if we do this with an iFrame we need to be able to know that does anyone still have a browser that doesn't support iFrames at all? So if they see a giant gaping hole where they're supposed to see something, some content, that's a bad thing. That would be a bad thing. So we shouldn't do that and we should instead do some other technical measure that accomplishes the same thing, but. Yeah, also like collecting information, statistical information about browsers. No more comments, questions on the whole topic. Paul. So the Debian Wiki publishes IP addresses of everyone who changes a page on the history. Yes. It's not that visible, but it's there. The other thing is that the Wiki has several statistic pages about which pages are the most popular and stuff like that, which pages are the most edited and all that sort of thing. I'm not sure of the URLs right now, but those pages are there. The other thing that I'm thinking about here is the publicity implications of this sort of a thing happening. I'm just imagining a slash dot headline. Debian starts tracking users or something like that. Well, in that case, we might really publish some privacy statement and hope that's a headline on slash dot is Debian publishes privacy statement. Yeah, my point is that we need to manage that release and make sure that it's positive rather than negative. That's our job. Wiki makes things easier and here too. The page history is clickably available directly from the same page. All of our changes to the main website are logged into a version control system that has a web front end. So basically we're doing already the same, as far as that privacy aspect, who did what, when and tracking those users, actually publicly tracking them and telling everyone when they edited the website and stuff. Wiki unobfuscates that process. Now it's hidden in a CVS repository so you have probably 15 clicks from the website, from the main site, but it's there. So it's kinda, you know, we shouldn't kid ourselves about this information really being private, but exposing it completely may not be really smart. And we're not even spoke about bugs, there'd be an arc yet. Bugs? I have my own small website where I look at the Apache logs, directly, they're small enough. And it's kind of interesting in the Google searches, it actually includes the search string. And it's interesting to find how people find my website searching for combinations of words that, yes, all of those words are on that page. I guess it does fit, but they're obviously not looking for me. Does anyone remember that? Yes. Dueling benjos. Yes. So, more comments. You were saying before that it might, you were saying something about the bug tracking system. I think it might be interesting to track which of the most popular bugs that people are finding and looking at the web page for. You want to comment on that? No? Yeah, oh sorry. But isn't the bug tracking system something where we should try to take the privacy even more serious? Because think of security related bugs. I'm thinking more general statistics of which is the most popular bug, not necessarily revealing who's looking at that page. Okay. The bug tracking system has an outstanding bug report for like 14 years now. Please hide my email address in the web search because there's a battle of priorities there to conflicting requests. One is the request from the social contract. We will not hide problems in any shape or form. We will allow people to comment. So they need to have access in some form to those email addresses. So you might as well publish them on the web but then it's also a huge privacy issue because an unsuspecting user uses our bug tracking system, reports a bug in good faith and then we give his email address to a flurry of spam bots that then abuse him for years and years to come. So that's really a use case where privacy issues are really huge and we don't have it fixed, you know. There's no fix really. I'm not really authoritative blarist here, the active member of that team but Baku and I was involved. The official line was there is no official line. We're not going to change the status quo because we were still disadvantageing the old users who had previously posted their email addresses to the bug tracking system and got abused and then the new users are spared but then again a spammer could just subscribe to DebianBugsDist at least Debianorg and then harvest all those emails anyway. So it's a multifaceted issue, so there is no real solution and a real hardcore privacy issue for actual users. Want to directly comment on that? Our current policy is that as we are publishing their email addresses, we frequently get requests for people, please remove this from the ear bug logs and the canned answer we have for that, I didn't write it but it's someone dead, is that it's already been published, it's already been distributed on multiple mailing lists, removing it from our website probably won't do any good and it does make responding to the user more difficult. And if we can't, if we cannot email, easily email the user and with a reply to their bug, it makes the bug fixing much more difficult. So it is a trade-off and some people are requesting simple obfuscation which some spammers won't be able to work around but most spammers will work around so easily that it's... Coming back to your statement earlier of we shouldn't change the status quo because it would disadvantage people who've already had, come on, we change things all the time, we make things better. Should we leave everything as bad as it is? No, we improve things. That's not an argument. Blar's argument was the retort to that question because the trade-off goes both ways. So if we obfuscate, we fix one problem but we create another problem, potentially just as disruptive to the development process. So if the users are harmed by the inability to talk to each other directly, then that's the problem again. Hi, for the... In general, well, sister there certainly is this trade-off between accessibility of contacting the people who submit bugs and the privacy of them and maybe their request not to be available to receive email. For the initial submitter though, we do have an alias submitter like bugs-submitter at the BTS. So it seems to me that at no loss of anything we could hide the initial submitter's email address and follow-ups. Well, we can email them back saying, well, maybe the page can just say, just so you know, your email address will be public and then people hopefully understand for the follow-ups. Calm down. I suppose we could set up email address redirects but then the submitter addresses are definitely getting spammed. It goes through our spam filters, obviously. But I think also we are losing a bit of track of the... Tracking? Tracking? Yeah. You've got to relay it. Nobody catches those bounces. Yeah. Unless something changed in the back end since I last looked seven or eight years ago. But nobody catches the bounces for NNN, their submitter at a bugs-debian work. So you mail essentially a black hole because the bug tracking system will not... The mail server over there will not deliver that. So it's not perfect. There's a lot of overhead in this. I agree. This is now off-topic for website tracking. So I'll talk more about it later at all. Should have used a smaller room. Yeah, Dan? Oh, it's a really... Okay. So more about tracking and social issues about tracking. Not really. Then we close this up. I just want to point out that... Well, from my point of view, privacy statement is one issue and how you actually do the tracking or whether you do the tracking and whether you record IP addresses is another issue or how you do it. Actually, the different things are the implementation of doing it and the other thing is like how you make it public and how you present it to the public. What I wanted to say about the presenting it to the public is that there's a Mozilla approach to create some privacy icons, which are nice to... Can be nice. It's not really final yet, but they're trying to make some icons which will show user easily how data is collected and used on the site, which could be nice for Debian 2, I think. I can post a website URL on the copy in a minute. So, no more comments? Well, if I may go. Maybe I missed it, but I guess it makes sense to state it just to... From what I understand about the social issues, there seems to be a sort of consensus that as long as we don't surprise people by doing something that we don't say we do, we don't consider it controversial to start tracking users if we need it. Yeah. That seems to be a consensus and I think that's an important one. I mean, I have no problem with it and I was kind of happy to have it stated clearly. Yeah, and I think we are also mostly out of time, so I take that as a... If there's no... No, good. Final words. Yeah, well, just an idea, just thought. Um, we could... Well, obviously, we will have this kind of depth conf and announcement and something in the project news. So, would it be worse to summarize the idea that we might consider tracking users and, well, use that opportunity to ask our users if it's okay, if we also say we will anonymize... Well, you know what I mean. Anonymize. And just ask for their opinion. I think... Would it be worse, an idea? I think the idea is good and I think what will happen before we actually activate that or if we actually start that, there will be probably an email on Debian Project saying so, that we are doing so. Well, an awful thing, we need to take care that we don't end up in a slash dot with some negative views, but asking users might be a good idea, I think. Yeah, okay. Then I would say we've closed the discussion here and thanks for coming.