 Okay. So I would like to start. I would welcome these three people. This talk is lead by Tassia, you know, Enrico and you know, perhaps might know Alain from Dedenborough. They are all talking about an application recommender. Good afternoon, everybody. Well, this buff was, the main goal of this buff is putting this topic on the agenda. And as these people would be here, this would be a great opportunity to share experiences. And well, I'm finishing, kind of finishing my masters in this topic. Information retrieval, recommendation. And I'm a follower of when he calls work. So many things that you are going to see here are use case of that stuff that he, that infrastructure that he showed. And so, well, we, we both, we, the three of us have experience in, have some, some moment in time, have worked and tried to build a recommender for Debian. And I'd like them to speak a little bit of their experience. And they're both, both services that are not online anymore. And mine is almost there. And well, they speak a little bit. Then I speak a little bit about mine and then we share and we open the discussion. Maybe you can give us some advices and ideas to contribute to the work. So Enrico, what? Most naive first. Okay. This, this buff, we made a, we prepared an agenda. It's on Gobi also. I don't think if you want, if you can take notes, it will be better than me and they, I think. And well, but it will be there as two guidance. So Enrico. Okay. Hi. Yeah. About package recommenders, like what are the packages that I should see, but I don't know. It's an, it's an interesting and difficult problem. And I did try a very naive approach, which is to index all the emails sent by popcorn, which is basically a list of packages people have in their systems. And index them as if they are documents and if each package is a word. So basically each system is a document and each package is a word. And then use text search system that I like a lot. And I reckon you guessed it's sapien to given my package list, look for similar documents. And that's it. That's a naive approach I had. I send a list of packages in my system and the database will give me similar systems. And then I take the packages in those systems that I don't have and those recommendations. It was a prototype. It was a naive approach. It kind of worked for several cases, but it's never been, it's never been studied enough. Other people did something a bit more solid. So I'll pass the mic, I guess, too. Well, hello. Yeah. I did a talk in Edinburgh as I was introduced just before about my approach. So that video should actually be still online. If you're really into that approach, you could listen to that. What I did is basically the Amazon approach. So somebody who buys A also buys B. There's the famous example of somebody who buys diapers also buys beer, which is not the most intuitive connection, but it actually is like that. This approach had the bad side that it takes a huge amount of computing time. And I spend most of my time with actually building a new algorithm that allowed to, well, answer this question in, well, let's say, sort of real time. Yeah. I put this prototype up at Edinburgh and got great feedback at the actual conference. And afterwards, I'd never heard loads. I didn't have really feedback. And then luckily or unluckily, I got hired and didn't have any time anymore. So yeah, so I'm very proud or think it's very good that Tessia now does the proper follow-up work or actually doing it from another perspective, doing it properly, I guess. I hope. I hope. Okay. So what I'm doing is I tried to use the same approach that Alan did, the association things that you build association rules that who has A, B and C also has D with a certain probability or support, but it's really heavy and I didn't have computer power for that. Then I started to work on a recall and recalls approach, which is really based on search. So all my strategies are based on search, but I do it in different ways. Well, basically, I can separate my strategies in content-based strategies and collaborative-based strategies that the content-based only looks what the user have, what the system have installed and recommends packages based on that set of packages. And the collaborative needs really this repository of users, history of users. So we use Popcorn for that. I built an index of Popcorn and well, then I do the search in many different ways. And now what I'm working right now is in a survey to collect people's evaluations about the different strategies. I've done, well, I'm still doing some experiments to really calibrate the algorithms. Because there are many parameters, the size of the profile, the size of your neighborhood, the size of many things. And we are going to look at it in the challenge. How do you really build this profile of users and so on. So I'm finishing this step of these experiments for me to choose the best parameters and then ask people what if the tests that I did are good or not, and which should be the strategy that should be adopted for a service. And I can show will, afterwards, the survey and the interface for the recommender. But I'd like to, before showing that, I'd like to talk about these challenges that we face when we are developing this kind of system. And maybe you can help us. And as I didn't have that much opportunity to talk with them, this is a great opportunity, I think. So challenges, well, there are many. First thing that came in my mind is, well, it can't be something that the user will go there and look for the recommendation. It should be, in a certain way, integrated with some application installers. So it should be built in a way that it could be attached to an application installer. And I was really excited about the upstream thing. And I was falling all that. And I was, I'm also waiting and there's no more movements about that in the list. But, well, and Hiko has talked a little bit about how it was. And I don't know how we're going to do this, if any, well, where is the microphone here? So, yes, I'll show it right now. And then, well, as I said, I was working in this interface. And then I stopped it to build my survey because, well, the survey should be ready first, I think. And so this is the, I'll just show an example of how this should, it's really, this interface is really similar to the screenshots. So it's based on screenshots. Yes, the demo thing. Okay, I'll show it afterwards. And when they start to talk, I can't figure out what's happening. Well, the survey, this should be working. I'd like to release it during DebConf. And, well, then the thing, sorry? Well, as Enrico said, it's really easy to get all this info about packages. So I basically do compute my recommendation and I bring back the set of packages that I'm going to recommend. And I look for their info in UDD. And screenshots, I build this. And here the user will say, this is useful or not? This is a poor recommendation or this was pleasantly surprised. And let me just, okay. And after then, but this is the survey. As I said, I need this just to validate and to test different strategies. This won't be the user interface for the, as I'm not an interface builder, I'm, this wasn't really the, the focus. But, okay. And after this, these 10 evaluations were complete one response. And the user can do as many as he wants and continue this as much as I get, it's better. And when he finished, he can choose if he wants to give more details who he is and if he wants to be, his name published in the thanks page. I don't know. And that's it. This is the survey. And this, okay. Yes. Well, the input file is the, is a popcorn submission file. The format is a popcorn submission file or any file with the, with the package names as the first fields of each line. So I have some, some examples that I use in my, this is with, I got with DPKG query. So the first packages is, the first field of each line is one package. And it also works with back, with just, with a list of packages or a popcorn submission file. So this is what I have done right now. And I plan to release the survey before we leave that conf. So that you can help me to collect data. And the, the, the interface for the recommender is coming. Valessio is helping me. It's part of our team. Valessio, Tiago and me. And we are open to contributions. The, the, the code is on GitHub. GitHub.com slash tass slash app, app recommender. The survey is at deb.lib.le.slash. As I said, it's not ready yet, but you'll be notified when it's, when I need your help. So I think after my tests and the, the, the basis of tests that I can do without people telling me if it's good or not, I use a technique called cross validation. So I have a set of packages installed in one system. I can get any popcorn submission file. I partition this, this set of packages, take one partition out, give this what is left to the recommender to, to be the profile and compose the recommendation. And then I see of this partition that is out. I can see, I can check what was recommended or not. And then I can check, I can use lots of metrics like precision, recall, F1, well, many metrics. And so these, I can know which one is, is working better than others, but it's not really a real validation. I, I think I really need people. I really need real users validating what they are, they are getting as recommendations. Do you want to talk about anything about validation or, of results, anything? Yeah, quick question. You talked about the upstream thing earlier. And I know that they've agreed to use some shared interfaces and some shared source code also from software center. Yes. But is anybody of the distributions that's, that already has something like that publishing their algorithms, how they do the recommendations? Or is that always secret source for everybody? Well, they do, well, the software center they do collect, they, they have a ratings and review servers, but I don't have access to that. And they, they don't have yet the, the recommending algorithms ready. What they do, they do that maybe you can talk about. I was, I wanted to say that I'm not aware that anybody is yet doing recommendations at all. Well, they do sort, no, they, they sort and they say, these are suggested packages, but they don't, they don't have, they don't do this kind of recommendation using database and They have rating, but I don't think they recommend based on what packages you have installed in your system. But even they are not using the rating yet as a, as I, well, I followed the, I followed the week of a commander. And as I see, it's not ready yet. This part as they, using this rating and review server is not ready yet, I think. DKG. No? I did not really understand how you get the set of packages, which are you evaluating in your survey? How got, did you got these 10 packages, this imaging? This is the, the result of the recommendation. Well, the, the user uploads a file and I will get this, this file as my input. I will infer a profile. This is the, the, I think is the, the biggest challenge is infer, inferring this profile based on a set of files, designing which ones will be used to do the queries and. Wouldn't it make sense to use SAPIAN to, to have some similar. But it's all SAPIAN in, it's, I use SAPIAN. Well, there are huge amounts of approaches to this and in this case it's SAPIAN. In my case it was the a priori algorithm. So it's a matter of how you approach the problem. Yes. There's many, many strategies and approaches to, to build recommendations. You can build, as he said, he builds association rules. But then it's, it's really heavy. As I told you, I couldn't do that. I tried with a priori and other trees algorithms. But with search it's, it's easy. It's really fast and it's doable. And I, I also build, build index for popcorn and for, I use SAPIAN index, but I use, I also partition the index. I don't use the whole index because, well, but it's, it's all SAPIAN behind. What is exciting is that this, these works will bring knowledge into SAPIAN. It's like I met him in Edinburgh and he was doing this a priori list and he pointed me at an a priori implementation. I was like, and he told me what a priori is and which is this thing like people who have this and this also tend to have this, with this probability. I was like, hey, that's interesting. And I used it on the DevTex tagging website to automatically calculate tips. So if a packages has tag A and B, I will say, well, 90% of the time, chances are you'd like to have C as well. And that worked really well because the set of information is much smaller for tags than it is for packages co-installed in systems. And so it's exciting to see new technology being brought into SAPIAN. Do you still have your algorithm? Okay, let's talk about that. Okay. Andres, if you want more details about the implementation, as he said, the SAPIAN, the actual SAPIAN index is built on, well, it indexes each package as a document and each term as descriptions and, well, there are many sources for the SAPIAN index. It indexes as terms, all related to one package. I can use this to do the search and I can also index the popcorn, but then each document is a user, is a submission, and the terms are the packages or their tags, like I can build, I used to build the popcorn index, index also tags of packages, and then you do the search in this index. And then you can combine source coming from here and there and then you have the different strategies. Maybe we should go like in a high-level mode. So this usually works. You have a set of data. You feed it into this algorithm. It is trained, so it builds a model. And then you come to the model with either a context or not a context. Context could be your popcorn file or three applications you mostly use, which are your, which is your use case or something you want to search for, like Excel, we saw before. And then this engine, the model starts kickoff, calculates something, is a major large black box and spits out some recommendations. This is like in a simple way. So that you already have seen with the demo that Tessia did because the 10 recommendations were based on a model. I don't know actually what model, what is in the model. Yes. I use also different, you mean for the profile of the user. Not the profile of the user, but the engine that calculated it or the model that calculated it. Yeah, it's different from your approach that you really, you have a training phase and you have this model of classification or anything. For me, it's search, it's search. So I build the index. This would be my model, like the index, and then you query the index with a query, a search query, which is the user profile or the neighborhood, it depends on which strategy you are using. So my model would be the index. Okay. One major problem I had was what do I take to teach my model or to get a good model. So if there are no questions, then I'm just talking a bit about it, I guess. You can also see it at the guppy, goppy file. If you train your model with data that is not related, it of course will turn out with really strange recommendations. So for example, I turned off the lips. Well, actually you did that. You limited everything to the programs, the desktop programs, I guess. Actually, I was only used programs as whole programs, the tag whole programs, but after talking to a Hiko, I started to also doing this profiling based on desktop applications. So only packages that has desktop files. But in my case, I have these many possibilities of profiling, and I can pick any one of them. What I also did, and I think you're doing, and also is just taking the recently used packages as a reference. So I usually use mud and also evolution, and that is actually used as a reference then for several things. But I don't use sulfid, which is maybe also installed. Actually I use it too, but anyways. Anything that you really use, because anything that is installed on the compute can mean anything. But using what is recently used, in my case that was four years ago, the model turned out so sparse that the actual programs I could rate or give recommendations for were quite a few. So for me, it was, well, I could use it for a second input, like do a normal recommendation for anything, and then use that on top to weight these programs too. These are more important. I'd like to have the idea that maybe there are some apps that not a lot of people use, and then they won't be in the survey or anything, and then we might not think about them, okay, and we might miss them. Did you think about that problem? Well, there's two things. One thing is that we can't really, the really few use the low popcorn and really few use applications, they can't go into the collaborative recommendation because of privacy issues of this data that I'm using popcorn. So packages which are really, I'm just discarding, but actually they would be discarded for the algorithm, but I'm just being, just to prevent problems, because this was really an issue for me to use the popcorn data. But then, well, these terms that appear in the index very few in very little cases, they will have a higher weight when you do the search. This is part of the search strategy. It depends on the weighting scheme that you are using, but usually if, let's say, well, if the the package is really common, and I have it in my profile, having this package in another user, having this package doesn't link this user to me. But if I have one package that is really uncommon and another user has the same package that is uncommon, it makes a link. So that's the reason why recommendation engines usually use several models for generating something like based on actual installation behavior, if I call it like that. And also, if there is nothing like that, or that's actually a point on the list, if we just recommend the same things all the time, then new programs won't ever be recognized. So the thing is to get another component in to actually like, at least put it somewhere in the top 10 or something so that people see it and then also try those programs, not just listing just the... Yeah, this is usually treated as the new item problem for recommendation. When you have an item that is not new for the population, it will never be recommended because it will be... I just wanted not to stop this discussion because I have a different topic. I'm quite interested in the thing because in the plans we create meta packages and I really want to know what else have the people installed. But that's why I'm specifically interested to perhaps get a better design of the meta packages. But I'm wondering if we create these meta packages this has just installed a common set of packages on the machine. Would this disturb your research somehow because we just force the user to have some packages installed? Well, that depends on your view because, I don't know, in Debian Science way it installs, let's say, Skycomp, I don't know, and R. So a user who doesn't use the task and installs R is then also recommended because of this, more likely to Skycomp. It's of course... Well, you have to ask if that's a good or a bad thing. Okay. Yeah, I just wanted to say my naive approach would certainly always suggest you like KDE games or GNOME games if you didn't have it and you had games installed, but a system can, when it's built properly and not just a prototype would actually wait according to that. Because, well, if you compute association rules, so people that have A and B also have C, we are basically auto-detecting the dependencies because, you know, that's what they are and you need to throw those away because they're obvious. And then you see what remains and those are actual choices made by people. This is something related to, well, one of the approach that I use for profiling the user is, well, if I'm running the system in my local machine, I know which packages were auto-installed and which were not, so I can remove all auto-installed packages. If I don't have this information, if I run it in a server and I receive this popcorn submission as input, what I do is I remove all the packages that are dependencies of other packages that are installed. It's not really the same thing, it's not accurate that much, but it's something because then I can, well, if my input has 2,000 packages, it usually goes to 400, 500, it's really, the results are really better when I do this. And, well, Libs also are usually, it depends also who is the target user. So, if we want to build a recommender system for, specifically for blends, for science, we could calibrate it better than if it's a general thing. Actually, just a short comment with the a priori, it's I figured out everything that has a 100% installation, well, possibility. It's crap because it's a dependency. Because everybody who has, let's say, Xapien has this library. So, that was quite easy to figure out that if you take those away, at least with a priori, you don't have a problem. With other statistical approaches, you still might have that. Okay, that was, it's just to say that I logged in your server, that is my server. And I update the Git tree and tried to recommend that it's working. I don't know why your local copy is not. No, it's just Murphy. Yeah, you can try your task.org if you want. It's working there. So, this could be an interface. Yeah, I give the details of the recommendation how it was, how the strategy was calibrated, as I said, which way to scheme I'm using for Xapien. Well, the list of packers, if it's more than five, I will not show them all, of course. Strategy, that's it. And for a user interface, I should probably not show these details. I was first, we thought about showing the user the whole possibilities. So you could choose the way to scheme, the strategy, and if you want to cluster the input, or you can do hybrid, hybrid strategies, you can use two strategies in parallel and show the results together. You can use the results of one as input to the other. There are many ways of doing that also. But then I think for the end user, it doesn't care. I should really find which one I think is better, and then this should be hide. So, that's it. Thank you, Tiago. Yes. Yes, if I click here, I have all the details for the packages. I have even the tags here. And the image can also be, well, these interfaces is also a prototype, but it's what we have for now. And, well, the code is available. I really appreciate collaboration. So anyone who wants to, who is interested in getting into the code and helping me with interface and any other stuff, programming the strategies I appreciate. Okay. Any other questions? Yes. I'm sorry, I missed the previous session, so maybe this was covered in the previous session. But I'm curious, I don't understand exactly when you say we're just using search, what that means. I'm not used to search resulting in a string of words that you might be searching for, but rather in finding the thing that looks like the thing that you've put in. So can you maybe clarify more what the input is when you're building the model as search? Okay. And then my second question is, do we have data, you were saying that collecting data about the user, about their most recently used applications, is a more interesting way to get a cluster of terms that you're then going to look for a recommendation from, right? Because you might have a system that's used by five users and the packages that are installed represent a sort of union of all of the packages that they use. And I'm wondering whether your systems were trained with per user data or per system data. Because it seems like if you train the systems with per user data, you can get more interesting info. But I also don't know if we have any sort of per user surveillance turned on in Debian that we could actually take advantage of. Not that I think we should have surveillance turned on. Actually, what I say is a user is a system. The user, because when you talk about recommendation system, we have users and items. So the user here is the system. And the items are packages or applications if you use only applications as the repository, only the repository of applications. And well, when I said you want the workflow, I have something here. I have a picture. No. Maybe down there. That shows the data flow. But basically what I do is I use the packages installed as the profile. Oh, shit. Here. So the user, which is a system, gives me a profile, a set of packages. I will treat it in different ways to really find which is the profile. As I said, I can discard lots of packages. It depends on the strategy I use. Then I give it to the recommender. We have many different strategies. As I said, content base collaborative use also content base is only as source only app index. The collaborative also use popcorn that can be the submissions can be closer before going to popcorn or not. Can be the raw submissions. Closer is grouping things together. Okay. Then there's demographic profiles that are actually I'm not using for the survey because I'm not getting this information. But that would be good for blends, for example. When we say demographic profiles that the user tells you the interested fields and I don't know. His profession. Then you can also filter the results. Item reputation is when I could use how many bugs the package has, how many RC bugs the package has. I could use this kind of info that I could get at UDD and also refine the results. I'm not using it yet. But it's a possibility. Then I output the recommendation. For talking about the search, how it works, I get this set of packages. I clean it so it's a lot less because as many data as you have probably the result will be more garbage. So you should really filter it to what is really important. Then, for example, the content base, I map this profile into tags or descriptions, terms, words. Because the content base, the recommendation is based on the content. So you need to know what is the content. You can represent this item, this package, using its tags or using its descriptions. You can do both. And you can do the different strategies. And I search in AX Zappia index for these terms. There are the most relevant terms for this user. And then I get the packages from the, because then the packages are the documents in that index. So I get which documents are relevant for me. This is for the content base. So what I have used, and I think you're using is a Popcon data, so per machine data of the usage. Another source I used was Deptex. And I think you use it too. Other, well, Zappian output is basically that. We have a list in the copy of other sources we might want to use, but we don't have. Or there might be, for example, a social graph, like the demographics, what is your task that you want to do or your typical task at the day. You gave actually some good examples. So what we wrote down, for example, is a social graph. That's basically that. You could base it on something like new PG. I was calling Jonas. Maybe Jonas. We're talking about social data. Maybe we want to, we don't have that much time, but we were talking yesterday about using this kind of stuff, for example, for Freedombox, where you can, where we could build a social network of people and their packages or applications could be exchanged. So instead of publishing your data to a central service as a Popcon, you could share with your friends. And then you could get this recommendation based on what your friends have. Maybe it's more useful than the whole world, because maybe probably your friends or your, I don't know, people that have similar tastes, but you say who they are and you are low and you share your info only with these people, it would be more like the privacy and all the issues that Freedombox wants to address. That would be something. But maybe Jonas. What we, for example, have is the architecture of system, like the, it's PowerPC, it's ARM, it's I386. So on ARM, I don't get the recommendation to install grub because it's quite well used and it doesn't fit to my architecture. So anyways, another good information would be, for example, the size of the hard disk drive. So light-white use or heavy weight or whatever. So that's just a comment. You covered it pretty well. So I have no more. Okay. Can I add one very, very quick question that kind of got excited about seeing that graph. So you have a kind of box called content base that can take the list of packages installed by the user and the normal app in index and get the recommendation out. Yes. That's a command line tool that can run in anybody's system today. I want it. Yes, I was just here. Okay. I think we should finish because the time is up. One last problem we have is, of course, privacy. For example, my system, you can identify single users in several cases. With easy filtering, you can't undo it. So it's not distributable, actually. And I can't, like, give a soap interface to anybody just wearing it. And so maybe next steps would be defining some API. You seem to have a quite good system. So there could be other recommendations built in there. Well, what I think is, as Enrico said, you can do it with one command line. The core is not, as I said, it's not heavy. It's doable in a simple machine. Okay. But the thing is we, what I think is the biggest challenge for me, for example, is to decide what is important, what is not from the profile. And if we are going to combine strategies, how are we going to do this? Because the simple ones, the pure strategies, we probably can do with the command line. But then probably we can have better results than just search in the API index, simple search. Okay. This is basically the point testing in the Gobi file. So look at it or ask us. And actually, well, if we have a running system, we need people testing it and telling something back. And I would actually suggest putting up some great new mailing list where we can communicate about it. And well, I think we should stop. Yes. Oh, thank you, everybody, for being here. And I hope I can send you the, I can release the surveys in Debian Conf so that you can give feedbacks. That's it.