 Well, I'm Alain Schröder. I'm from Germany. That's I guess you can guess that from the name I'm going to talk about data mining popcorn. That's the popularity contest, which you probably know This talk is going to be a bit technical after the beginning. So I don't know if I scare somebody off. I hope not I'm doing this currently as my master thesis at the University of Paderborn. So Well, it's still ongoing and it's not finished yet Okay What I'm going to talk what is my motivation why I'm doing this even What is data mining I'm giving a really short overview about that and I'm going to my implementation up to now and What is going to come? The privacy issues I guess that you can all think of and at least I'm giving a little conclusion and outlook Okay, so as we all know Debian is really great. I Think you already heard that lots of times today. We have 11 architectures. We have a really easy installer I even heard we have a graphical one yet. Even if I never used it before but Sounds good We have localizations to like loads of languages We are highly customizable to you can guess like scolar linux or Dapmat whatever all these sub projects they built on on Debian and That's one of the features of Debian. Well, we have the best packaging system. I'm always like to say At least RPM is no comparison to us and also on Windows side or Mac OS side, there is nothing really which can even keep up and With the last installment we shipped 18,000 well more than 18,200 packages Which is quite a lot and also one of the biggest problems and well That's why I'm saying Debian sucks. It doesn't suck any more than any other linux distribution out there Or even Windows or Mac OS as I just said, but just try to imagine your Pretty new user to linux and you're trying to look for software. It's pretty bad if you're trying that It's while a pain in the arse, whatever It's very impersonal and it's totally complicated. It's getting better in the last few Years and I'm going to show what what their possibilities are That's a shell Using apt-cache Searching for burn CD. You get loads of hits. You have KDE front-end for burning CDs backup manager Audio management and playback application. I don't know what that really has to do with burning CDs, but it's just hitting these These two words I'm giving it and a displace it like every Disp this package installer you can use now the graphical ones and upticks or whatever They are basically displaying things like this. So the next Better thing is depth tags The problem is Yeah, Enrico The problem is knowing these I well, I know what ISO 9,660 is but I guess lots of people don't oh so What what the heck is x11 if you don't if you come from windowsite or even use the Linux PC for the first time It's like what what the heck? so After subscriptions, oh, I'm sorry Well, I'm having the next version of this the smart search here It's much better actually I'm pretty often using this because I like it You just put it in your your keywords in you get the list first more or less like What you get with up cash, but then you also get Wanted tax on the right side and available tax the wanted are the ones which are also already selected and Available ones are some you can still select and then go deeper But still he is the little problem. Well, if there's a description now, it's I guess better Yet that you can go deeper into the Search well that you get better results Okay, that's What happened until now so I guess you all know Amazon. So there if you Is Martin Croft on here? I guess you would be very happy. I'm making advertisements for him because it's his book If you're buying some book at Amazon or any other big website at the moment you get displayed other books on the on the bottom which are Also bought by these people or some people even say that's one of the big features of web 2.0 Well Whatever I think it would be a really nice feature for Debian to do something like this Showing people who already have installed these packages your audit you're having they also like this or this or they also install this and this and What you might also know is last FM. They've recently bought by an American company. I think They're indexing music They're showing you can give in like any artist You know off and they're showing similar artists to it so I said would be very nice to get shown similar Packages if I have like GNOME if I have game Which I'm using pretty often, but I'm getting pissed off with then I might like just to to see Well alternatives to it well The principle we did we need is key ISS keep it simple stupid. It's highly Discussable if that's good or not, but I think it would be good to even offer it while the goals are To present as few packages as possible to the user to personalize the search so that Packages you already installed are not shown packages, which are very likely to be fitting for you are shown and not any Console tools if you're only using GNOME tools Also well ranked results after this and Well, the best thing would be of course present the right Package for everybody first, but I guess that's not really possible, but maybe we can get nearer to that Okay, what is data mining? I don't know if anybody of you has ever heard of it well heard I guess, but it's always like a big blurb it's used for huge amount of things and basically It's just knowledge discovery and data, which it was called before it's just amount of It's just algorithms for discovering any connections in the data or drawing conclusions whatsoever, I'm I'm giving I'm going to give a little example to the Few I'm going to present well the algorithms you can find are Either very simple or totally complicated, but they have to scale over a huge set of data So usually there are very CPU and IO and intensive or at least one of them So you're going to run into problems usually if if you're going to use it on real data These algorithms are of course very sensitive to what you put in That's also the GI GO thing It's called garbage in garbage out if you're just throwing garbage at the algorithm Of course, you're just going to get garbage out some connections between whatever your grandfather and red bicycle or whatever It's just stupid if you put the wrong things in of course, you're going to get wrong things out You should understand the algorithms if you want to use data mining because then you know What is garbage when you when you're putting in or even can find out? What is garbage when it's coming out? Okay, there are several types of data mining or a typical category categories that's association analysis, which I will explain further later on There is clustering which I will go explain shortly and classification also very shortly and regression which I will just drop and There's also like text mining which is more or less just a collection of these especially for mining text files and stuff or But mining is also one of these big buzzwords currently Okay clustering clustering is if you have a set of let's say users or What's ever a good example Islam? Maybe a phone company. They're having users Which have landlines or mobile phones or whatsoever and they're using SMS and so on and You're just running these clustering algorithms over it and you get different clusters out of it There are one set which is maybe mostly using landline telephones to call to overseas There may be just SMS textures it's well, you get these groups and Afterwards you have to to even look What these groups are doing but you're getting groups which are connected to each other similar to each other the three big algorithms for it if you want to know our hierarchical clustering k-means clustering the Cohen network, which is another word for self-organizing maps And loads more so Just for describing this problem quick shortly. We don't need to this really The next thing is classification Most famously used by for example banks You know, they just put your data in and then decide, okay, you're credit worthy or not Well, you live in the wrong area of London or whatever and you're not credit worthy. That's the bad thing about it that's not that easy of course, but That's what these algorithms to Well, there's k nearest neighbors decision trees neural networks and base networks. I don't want to go into that either Okay, the association algorithm analysis there are two algorithms From the description point. It's very uninteresting What you get out usually is something which is like somebody who buys limes also likes to buy Kachasa or brown sugar Or somebody who buys limes and brown sugar would like to buy or is Very likely to buy Kachasa or even somebody who buys everything three is likely to buy a very overpriced box of ice ready-made so These are the things we want to get out for Debian or what I want to get out for Debian like Whoever installed open office installed game and Okay, I can't get up with a real good example right now But they are very likely to install game also so we can present This package very high and for example a package of which is especially made for KDE very low because They're not using KDE Okay, the measures you're going to cope with our support the so-called support Confidence and the improvement which are actually pretty easy Measures I'm going to explain them, but first a bit about the statistical basics about behind it. It's very easy Because it's only about dependency if If something is statistically dependent on each other then you can use this One example would be Your chance to to roll to dice and come out with an result of at least 11 If you don't know anything That's three to 33 I guess everybody can see that But if you already know The result of the first dice Dice is a dice whatever Like it is a six Then you only have to roll a five or six. So it's two to four or even one to two the chance that you're actually Going to get to 11 if you of course first rolled like one your chance is zero This is the exact thing you're going to look what already happened I'm going to apply this to the future to the new customer to the new Debian user So if somebody already has package a he also likes package B Okay, I'm going to go through a little Calculation example that's most math you're going to see today. I guess if we're having 100 installs in Total we are having like 50,000 at the moment. I guess And 50 users of that installed game You're going to have a support of 50% for this package or Even 50 depending on what you can exact definition is You're having a support of 40 or 40% for gimp and The one below is The users who all who installed game and gimp So we're actually having like 25 users of game which didn't install gimp That's just the Combined set Okay, from these you can calculate the confidence which looks Maybe a bit difficult. I don't know It's really simple If somebody already installed game The chance that he also insults gimp is 50% the other way around you get different numbers Somebody who already installed gimp is likely to install with the 62.5 percentage To install game Okay, that's the confidence that's the confidence that you Rule is correct correctly applied to this user You're basically on the left side, that's what what's history and The user already has this this and this and the total user base The other total user base they install to sixty two point five percent also the package game so it's just seen over the whole user base, of course, you might have a user which is totally different set and Is in the other 37.5% Okay, there's another Value that's improvement. That's just how much better you are compared to before if you don't know What the user installed before like the empty set? That's The confidence of be if you have nothing installed before and the likelihood that he installs Gimp is 40% that's the support. I already told you before So if you Calculate this value you get in these both cases both cases values above 100% so you're getting actually better than if you don't Consider that they already installed game or gimp. Is that clear to everybody? I'm getting loads of Strange looks at the moment. Yeah Okay So that's actually the whole thing behind it So let's Come to my implementation part the technical part. This is what it looks like. I left in the bus words for fun You're going to load the data you're going you're having which is actually just the popcorn data and the packages data and You load it into your own Well database in my case and then you're going to run Analysis the loading part is actually called ETL in the business computing world. It's extract transfer load So it's just one one of the typical bus word bingo words the But database which is optimized for analyzing data is called OLAP online analytical processing It's another one of these miracle words. I Guess you can sell everything to to businesses if you combine these two words with business intelligence So it's one of these really great bus words Okay This middle part is my computer at home at the moment which does the analysis and runs through all the data and Produces a data set which is then exported to a web server or later on maybe lip-apt Which I heard of like two weeks ago and we go again Okay Yep, what's about the pop-con data? Well, it's harvested by Debian. I'm a Debian developer. Maybe Yeah, the question was how I harvest the pop-con data There's a package in Debian, which is called popularity contest if you install it It sends Once a week all the data it collects to Debian server Okay, I just I just download everything from the Debian server I can do that because I'm Debian developer So lots of other people here in the room can do it too. It's just stored in in one special directory I don't load it. I get like 18,000 files Well nowadays 15,000 files 50,000 files for each submission one file and then I put it into my database it's just really downloading and Putting out up Okay, because you're not going to use all the data or at least there are privacy issues privacy issues and Uses some other issues I'm filtering it pre-processing it. So I'm first dropping everything Every Debian package mentioned in the popular popularity contest, which is not mentioned in any of the packaging files Which you can get from every every Debian mirror So if you have in your popularity contest the file like Linux kernel version 2.6.15 Dash kicks ass it won't show up Have you done analysis of packages that are outside of Debian then you just drop them or do you do some sort of general count or Anything like that at all? Well a general count is already available On the popcorn side. I didn't drop it really my problem is this still is work in progress So I'm going to show you how far I got I'm trying to do a little analysis on that But I guess the most Well, if you look over it you get loads of kernel Versions and opera and Java and like two or three other big commercial packages but Of course, it might be interesting to look into that deeper Okay Dependencies if you're going to leave the dependencies in you're going to get loads of rules afterwards which suggests Well, if you installed any package, you're also like to install lip see Which is of course pretty Unlogical or at least useless because there are dependencies and They are installed by default anyways because it doesn't work and without it Might be a good hint actually for RPM, but I don't know how far they got now So I'm filtering that out with SQL I'm also dropping all the packages which are below a certain threshold So if there are less than ten packages installed for a certain package I'm dropping it because the it would be statistically very Doubtfully if the number you get is actually even mentionable Okay, also, I have a little white and blacklist, but I'm using that currently just for for base packages, which I own installed by default Because you're going to get loads of connections to host name Which is also pretty useless because everybody has installed that by default Okay, my own implementation. Yeah, sorry the with a white list or Or How many did you remove so you said less than ten for example, so that's what you use how many do you remove? I mean half of them or just a tiny actually I never Recorded the number, but you can you can see by yourself by going to to popcon.demian.org Limit everything that's below limits Below ten actually I'm dropping them not dropping them up at the moment while I'm building my my internal data Because I'm still in experimental phase But Below a certain limit it's just meaningless so Yeah said okay or Your your number Sorry, if you go to popcon.demian.org you get a list with all the installed packages and also with installed the numbers how many Users have installed it so you can count it by yourself actually But well, I can also look it up later on I guess a little problem because Applying this kind of filter and even using popcon Without knowing that most of Debian developer installed popcon and not most of the users of Debian installed popcon you will get things like Everybody installed dpkgdev for example, which is not the most common right That's correct The point is well, they are around 50,000 submissions right now, so we don't have 50,000 Debian developers. Well, if every Debian developer has 50 machines out there Of course, that would be only showing the typical Debian developer But still this data would be useful to Debian developers then Maybe we can get users to use this system then or they use the popularity contest and Well, then afterwards after more and more users began to use it actually use it Then it's getting more meaningful for all the developers or everybody. I mean Okay Okay, I think I had that my own Implementation is actually in the database itself. They are standard for Implementing data mining in a database. It's called ISO IEC 13,249-6 and has a very sexy name of SQL multimedia and application packages part 6 Sounds like a monster. It is a monster. Well after always changing my own Zuntux, I Thought it might be very helpful at least for me and maybe even for others who might want to use the same thing Later on if I just keep up to a standard. Well, the whole thing is implemented in PL, PGSQL Which is also a strange thing. It's language which itself is in Postgres It's a scriptable language I Just use it because it's it was there It's a proof of concept more or less what I'm doing here and the most of time of the time you're going to shove around data and not going to loop in in some some In the actual programming language, so it's should not be is a big Performance it but actually I found out it is But anyways, if somebody wants to go into details with that you can he or she can come up later on The algorithm takes lots on loads of time to finish actually this first started to work like three days ago and I'm now at 18,000 submissions to process and still like 10,000 to go and it's of course because the data set grows Bigger it's going slower and slower. So I expected to finish like in a in a week And then maybe I can actually get useful data out of it The algorithm It self compresses the data in the data structure very highly it makes out of Like 600 megabytes just like 60 megabytes. I would guess So you can actually distribute it to a single user to your user's computer The rule calculation so that Everybody will install this this and this that can be performed locally This is actually not working yet. So don't believe me It's not well, I guess I have it working like few weeks, but it's still not implemented and I hope you won't rip my head off Okay, the presentation side is pretty simple. It's currently a website based on Ruby on Rails It uses PostgreSQL's t-search To which is a full text index was just at hand. So I used it You can actually upload your own popularity contest Data to filter the packages which are on the server at the moment It just filters out the packages that you already installed and doesn't suggest them anymore the plan for later would be actually using this huge list of Packages and then suggesting the correct package Problem is I don't have the results yet Okay, it shows you the neighbors of packages. It actually does that since a few months because that was pretty easy And well the suggestions filtered and waited that would be the goal after using this huge line of packages I can show you the website Okay, I just put in game at the moment. It's actually Looks like this since a few months. I think Enrico already seen it, but I don't know This on the right Well, you can put in any name of any package and it will come up with a list or actually the package It shows the topic related packages This data on the right, which you can see at the moment is actually based on the Deptex Data without it. It wouldn't work It's really great that the Deptex Project exists It's like the best result I got up to now It's pretty simple algorithm. It shows the nearest neighbors of the package by Calculating how many Deptex are different to each to the package So if there is no no Deptex different then for game you would get all instant messages which have Which are based on Gnome, which support I guess all the same protocols and so on so if you drop Gnome and For example at KDE the distance would be two So They go down That list that that thing is actually online. You can use it So far it's dps.parkautomat.net which is also in the Announcement of this talk Also, I Calculated loads of Relationships between the packages those rules I showed you just just one package to another You get something like this on the on the bottom This is just a list of packages which are installed like every by everybody who installed Gnome So it's pretty useless somebody Deinstalled one of these so it's a very high confidence that This is actually package you want to to have so This is unpersonalized and pretty useless So that's why I'm still waiting for the well Data which is going deeper Okay, the privacy issue as you all can think of You don't want anybody to know that there's actually one user who uses these lists of packages and nothing else At least I would say it's a bit Discussable if that's good or not if somebody installed Does know that so that's why the Popularity contest data is still not available freely on the internet, but just for Debian developers on on the server But the so-called FP tree frequent pattern tree Which is generated by this algorithm is pretty Distributable I will show that because first I did this preprocessing thing So all the packages which are not in Debian are not in this data structure And also you prune it afterwards, which is of course not working yet because I didn't have a real FP tree yet Okay, I'll just show what an Frequent pattern tree would look like This is a list It starts with open office org on the top with 50 installs then to the left evolution with 20 installs Going down to gimp Five installs and then to the real bottom with game three Installs so you would get a list like well there are three people who installed game Gimp evolution and open office org Also, you get a list of five people who installed gimp evolution to Evolution and open office org This is still below the threshold and there would could be like Just saying there's one people who installed this this and this but since this low since I Want to enforce this this limit this threshold You just prune the The tree Which looks like this because game has less than three installs you See that we well you could cut away Gimp and then put it to the to the left to the others So you get 13 installs who installed game evolution and open office org You can't put it to the to the others because All you could put it To the right, but then the evolution part would be dropped anyways This is what a tree looks like at the end so There are only Ways through the tree which I have at least a limit of ten items So I don't think there's an privacy issue afterwards with this okay now conclusion and outlook my conclusion is I can't say anything because The last part is not working yet But I can give you a little outlook what you can what we can do with the thing to What about hardware information and in popularity a contest? It's a big privacy issue, but the possibility would be like saying everybody who installed System using This rate controller or the graphic card x epsilon y also needed this driver or this set of tools Maybe not necessarily everybody but lots of people You can also do this per hand But I guess if it's automated it would be nicer. It's not must but possible also the same thing applies to partitions The the file type of file of the file system, so Maybe somebody who installed NTA also has an NTFS partition likes to have files for file rescuing and Resetting Windows passwords well Size of partitions might be interesting too so everybody who has a little has just very small partition size or Size overall they like more like text-based programs those who have Like 50 gigabytes free. They like GNOME and KDE and whatever but that's just possible for the future and up to discussion and As far as I know Do you know the HW info package? HW info it harvest a lot of things out of proc including partitions and It also on PCs. It runs. I think DMI decode and which gets information from the BIOS It has lots of good information in it that you could parse and put in there does it send it somewhere or By default it's just a command line utility, but yeah, okay, that would be up to inclusion in popularity contest I Just can't remember the name Reinhardt Patterson. I think is is the one who maintains it amazingly and he's very privacy Concerned over there So I don't know if that's actually up for discussion or not Maybe we can talk about that later or even when I finish which is like about now Do you have time for like one more slide because then I'm finished Okay Another one would be of course integrating into Libapt Because then everything it would be central. Oh, so my implementation of this ISO thing could be much faster If it's only memory-based because I implemented disk-based because I expected everything to grow very huge and Maybe a streaming algorithm would be nice, but that's just Up for the future and well, this is documentation if you want to actually look up all the things I'm told you and want to look deeper into it. One of the big algorithms is apriori Don't use it. It's going to waste huge amount of IO time and CPU time I used it for the generation of the Suggestions you just saw before on the bottom which are useless for us And it took like a week to generate it and generate a 22 gigabytes to 22 gigabytes of data and Actually, if you use Well draw this tree like before deeper. It's going to use to to the power of Well the number of packages you're going to consider. So if you're going to consider 10,000 packages Then actually it will be to to the power of to the 10,000 power That's pretty Undoable for any computer at the moment The better one is the one below that's fp growth Well, there are tools open source tools you can use Vaker and jail well formerly known as jail and now rapid miner. Those are Java the Java based open source Data mining tools if you want to look into that They both don't have the fp Tree fp growth algorithm So don't use it for this problem Also, there is the standard and if you don't want to shell out 250 bucks for this standard you can go to the DB to documentation of IBM Which actually has most of the standard in in it. So it's pretty readable these DB to documentation Okay I think We can discuss now Yeah, small comment and question first of all for the hardware detection part You should be aware that open to already have implemented something that will allow the user to submit information about their Hardware what's working or not and I'm sorry like it centrally basically understanding nothing here. I'm I'm not sure if that's me or I'll speak here. Okay. I can try to speak louder For the hardware detection part you should be aware that you've been to already have a tool available to Let the users submit information about their Hardware with information if the video card is working if someone is working like things like that to a central point The problem is that that data is not easily available unless you are maintaining that database, I think So that might be a good entry point either to port that software to Debian or get access to that database and My name is Peter Reynolds and I'm one of the popularity contest maintainers and I have a questions From us to you. Is there anything we can do within the limits of Acceptable privacy that can make it easier for you to Build this service for the Debian users Well at the moment, I don't think so. Maybe I'll come up in the near future with some points First my problem. My big problem is to just finish this My infrastructure to actually generate the rules that somebody who's Using this website or the app can actually use the data Because well directly if after I finish that I have to write my thesis and Turn it in and afterwards I can think about anything else But I will come back to that and of course, yeah, thanks Hi, I was just wondering you're keeping all sorts of positive data there But when you're pruning the tree or in fact throwing away information that some users really don't want that Utility if there's three people installed it out of 500. Are you throwing that information away? Well, that's just a typical issue. I mean most of the people are using Google and That's ranking the search results too. Of course, you don't have to use it Of course, if you just get hits which don't apply to you just because you once installed the wrong package well Then you don't have to use it It's not applying for everybody and that's Why you're getting results like the confidence is just like 50% or 20% or whatever Well, you can't get really much better than that Hi great you're doing that and I guess we are going to talk a lot in the next days of that cough Because I've been trying to do the same, but I don't have your level of understanding of the theory So you are actually doing it properly. I I Got as far as Finding a library for frequent items at mining that in C++ that's supposed to be very fast But crashes GCC if you try to compile it, but oh whether it's code in the public domain so we can fix it Hopefully or fix GCC So I'm interesting to actually find tools also because I will need things like this for other things in depth X So we should definitely talk about that One thing that I developed this morning and the popcorn people don't know about Is a bit of shell script that implements are kind of a cue that would allow Popcorn to notify to write somewhere that It's got new submissions and then afterwards we can read from this somewhere and Only analyze the recent Submissions that arrived which I need for because I want to maintain an index of that data So that allows me to do incremental indexing. I don't know if you would need that as well that is something else we can talk about and LeBapt now has an EPT cache Thingy that kind of tries to put all sorts of data together. Of course, I'm interested in any data. I could add Be it ways of searching things or ways of ranking results I can already rank by popcorn and there's probably more we can do So yeah, thanks a lot and this will probably start to be a very productive moment. Okay Just one short note I'm just staying here till Wednesday because I have to leave at that point. So Try to get to me early not on Wednesday or Thursday. Yeah after this talk Hello, I have a question because we in the CDD scope tried to use meter packages Which has just dependencies and used said you just are sorting out the dependencies It is for our scope. I want to know Do you sort out these dependency automatically was without any regard or how do you do are you doing it? I currently just See what is based on what and then throw it out It doesn't consider any meta packages at the moment But if you have a popcorn results as meta package is installed 50 times and so well meta packages you mean The ones which are not installed or like tasks something whatever Yes, some kind of task. We have some meta package for instant made bio Who is was in all biological software in in in Debian if that's actually installed I mean if there's a package that is installed which has the dependencies Yeah, then these dependencies are thrown out. Okay, the meta package would be in yeah, but everything else not okay So if you have GNOME package GNOME desktop All the packages would be out. Okay. Thank you The database approach for that remaining is break is collapsing at some point. Do you think with a given size? Do you think you reached that size already? With the a priori one it totally collapsed at the first my problem at the moment is Everything works fine while it grows bigger and bigger But it works Just I can't do it Do everything in one step because postgres keeps a lock off of what I'm doing and of every call you're doing Because it's transaction Safe so it's actually getting slower the more calls I'm doing in my Procedural procedures so after like 1000 submissions. I have to stop and Do it start from that point again? it actually sort of Breaks the algorithm because the algorithm FP growth is based on just scanning the database twice first to to get the Count of each package and then to build one big this FP tree this this data structure from that and my problem is well, I'm doing the scanning and and Starting to put all the 10,000 Submissions in and then I have to stop and scan the database from the beginning again So I'm doing loads of database scans, but with a priori you would do like millions or Billions of of database cans There are other approaches to that too. For example, Google has a distributed Method of scanning these kind of Datasets Why not use that something like that? I Don't know about the distributed things. I know about the streaming algorithm, which I mentioned on on there On one slide at least it's called it's called H mine. It just starts to at some point and scans over the whole thing Why I didn't use it well, I'm still Trying to implement this one and Didn't didn't have the time to look into anything else, but I have to finish with my master thesis That's the problem. I would like to do all the other things things, but yeah There's a Ruby implementation for that. So you seem to be Ruby inclined. You could try that Ruby implementation for this this disputed. Oh, great Okay, hi, I think this idea is very interesting I only see one problem default desktop task in with the db and installer installs a lot of packages beginning from Starting with gnome and all its dependencies ice weasel game and all this stuff So they will be ranked very highly So do you treat packages which were installed manually afterwards? Do you rank them higher because of that? I mean otherwise only game and and standard stuff will show up Well as long as you don't personalize The whole thing and that's the really big problem. My solution would be just to personalize everything so as soon as that program the lip app or whatever which is Where disintegrated has your popularity data or you even direct assess to your Database file to the package file Then those are automatically filtered out So it's no problem anymore. Okay. Do you also take in mind? To that packages that were removed manually afterwards. It would it would be interesting too I mean there is no way for me to Follow that at the moment because I'm just based on the pop-con data which doesn't include remove packages actually you could Like weekly Go to to polarity contest pull all the data out and put it in a big database and look if that package is still installed But you're going to get into big Size problems, but that's actually possibility for later Things well implementation you maybe could if you know he installed the Debian system with task cell and in selected at the gnome task and You could compare that list with the afterwards installed packages So you would know if packages were installed or removed from that later? Well, you can't be sure somebody might Well, well if the task is installed, of course, it's already filtered out I don't know if if the newer tasks are still package based No, okay, then I don't know if the package is installed or not. So I could Do a little filter Taking this package list and see if all these are installed and then just drop them if if they are all installed exactly to the Task yeah Well, I see that we are talking about Installed package now, but I was thinking that in pop-con you got also a package you voted for This means that you have at least one binary inside the package which was used in the last In the last day. No, I'm missing a point I didn't get your point. Sorry. Well, there is there is install package and package You voted for something like that and the one you voted for in fact our package which are installed but That you also have used it in the last Day or last month. I don't remember. Yeah in the popularity contest day today is I think if you used Binary or program in the last three weeks That would be Usable if you do two models one with all the Analyzed and the not analyzed and then combine it into a score afterwards That's actually I have like a plan for that But the problem is everybody every like developer file The slash death things would be left out, which would be helpful to some people I guess still Somebody who develops with this also like to develop with that Okay, I still have five minutes. So I guess I have to stop after this So Actually, there was a discussion on on the pop-con list just before to integrate PHP files and maybe Python files to So it's getting broader if yet if files are actually used and well, I Would still say do two different models and combine them afterwards It's statistically not that difficult Yeah, I guess since we're running out of time. I think there's enough people who are interested that maybe we should have a buff on data mining Could we do that before you leave town? Yeah, that's no problem. Just Well, I'm leaving on Wednesday Maybe we could gather right after this and talk about a good time to this or tomorrow No, I mean just gather and pick a time. Yeah Okay, well One question Okay, I'm just one to the wacky idea buff and so one of the wacky idea was to have Like to integrate hardware that information to work on will allow us to make pressure on Hardware vendors about drivers because we'll have some meal number to put in front of them and Also, I was like there are a few people that have started to work again on packages.dbn.org like to improve the web page in general and so that Distance information and you know recommend packages like Amazon does for books or CDs, whatever would be really good to have on such pages and A thought comment is that I will maybe like to have none anonymous popcorn data Because I don't know maybe I tend to You know, I like the way Joy has use his computer and I would like to be interested into knowing which data is you know he which packages he use and and can Like, you know try to install the same and see if that fits me Okay, Enrico Because I didn't get everything actually Small comment on that one first It's an interesting idea and I'm pretty sure it will Fly at least with some people but I'm also pretty sure it's not called popularity contest so make another packet and Get that spread out one comment about the the Feasibility of using popularity contest data one thing we need to keep in mind is that it's highly inaccurate and Only provide a lower limit to the number of the debut and installations and that the votes information is Only useful to compare against To compare similar packages so you can't compare the votes for a kernel packet or a binary packet With each other it only makes sense to come up to compare binary packages to each other or development packages to each other or kernel packages to each other because the votes is very It's a made-up number that doesn't make sense across types of packages also the The popularity and popularity contest votes are designed to not be affected by crone jobs That's why most of the stuff in Lib is excluded because Every night a crone job Reindexes the LD.so cash So every library would be used every day if you actually took that access time into account So we are open to suggestions on how to actually get votes from packages But it has to make sense and not make them always used by everyone So thank you The time is out. I'm getting a red card over there. So Maybe you can shout Okay, just a very quickly about the idea of looking at what Joey has does Every single implemented a very simple metric that shows you what packages you have in your system that other people usually don't It's implemented in an EPT cache I could blog about a command line that gives you a List of packages you could talk about to your friends because Obviously you use them and you like them and your friends don't and then you can just ask Joey has to run this command and post the results in his blog Okay. Well, thank you everybody and I guess we'll meet and Look when we get time frame for