 Okay, so good evening everybody, actually I have to apologize for the title because part of this talk will be about visualizing package clusters as it was in the abstract. But since people were asking me during this conference various stuff about how things going in a search project I'm involved with, I'm going to have both parts. So basically it will be an update of what we have been doing in this research project called Mankuzi and also an update on how we are using tools from our previous project called Eidos to do Debian quality assurance on a daily basis and also for releases. So it will be three parts basically. The first part will be a brief overview of some tools we are using which comes from a project called Eidos which this from quite years some years ago. Some will be about some new tools and some will be about presenting stuff which is coming next. So let's start with what was called the Eidos project which that's the the world you might have been seen in some tools like Eidos Dev Checker or Debian Weather and it's actually about another research project which was running since from 2004 to 2007. It was a European founded project which involved various university in Europe. Among them there's Paris 7 which is the university I'm currently working with. And also the project involved various Linux distribution companies like Mandriva and Caixa Machica which is a Portuguese distribution. And the objective of the project was actually helping out we, package maintainers, in creating a high quality distribution. So the focus was on Debian maintainers and in particular in how to produce, manage a distribution like we do every day. And in particular looking at the formal part of packages. So mainly inter-package dependencies and our formal study of these kind of things can help out in doing Q&A, quality assurance work. So Debian was not officially involved mainly for the reason that it has no, so it is not a company and usually when you go to the EU is kind of hard to go with the non-profit association. But one Debian developer named Harald Thainen was involved and actually was working in contributing backcode to Debian and it was quite successful. So as a lot of research project, Eidos was split in various work packages according to different subject. What the part on which I've been mainly involved was actually the first part. So formal management of software dependencies. So actually nowadays it is becoming more and more a common concept. Those of component based software, but we in Debian and all Linux distribution has been pioneering this concept since quite a while. And the main question of component based software which has been addressed in the Eidos project was this one. So is it possible given a user selection of packages to install them when we consider that the repository has closed? So basically you took a distribution like Debian is, which is a set of packages and you ask yourself, can I in some configuration, whatever configuration install each single package? So why we are interested in this? Well, because if you have a distribution, those packages are not installable, well, those packages are not really useful. But still you do, welcome them, you spend effort in uploading them, you spend effort in tracing these bugs and stuff like that. So to do that, basically in Mancusi, they formalized a simple mathematical model of a distribution. So it's very simple and it's based on a simple proposition logic. And the key point of the formalization are dependencies. So we all know this is a snippet of a control file basically. So you have a package, you have its dependence line, sometime written by us, sometime written automatically by some tools we use. But still in the end, what the user receives a package with a specific line of dependencies like that one. So the idea to formalize this kind of relationship between packages is to interpret these kind of lines as formula, logic formula, where you have the usual stuff from simple logic. So conjunction is junction basically. However, you cannot do that really in a naive manner because when we say a name of a package in a dependency line, we are not really saying a specific version of a package. We are saying whatever version we have in the archive of that package. So if we just say libc6, it can be whatever version of libc6 it's available, which usually and hopefully is just one. But it can be more than one. And actually we also have virtual packages. So we have some packages in the archive which depends on things like mail transport agent, but actually that's not a single package. It's a disjunction of packages. So basically there's a simple mapping between dependencies and packages, but you have this kind of expansion where you expand a single package name to a disjunction of various packages. So given that, how do you model a repository? Well, a repository becomes a set of packages, okay? To which packages, each package in that repository is associated to a formula, to a logic formula, like the one I've shown you before. And in addition to that, globally, you define a set of conflicts. So identify in a given repository all pairs of packages which are not co-installable, okay? Because we are used to declare conflicts in only one way. So a package declares a conflict towards another package, but the conflict is actually bidirectional. So even we only say in one direction that a package A conflict with a package B, that's actually the true also in the other sense. So once you've done all that, the question of whether a package is installable or not is equivalent to a problem which is known in computer science called SAT, which stands for satisfiability, and basically boils down to deciding whether a given formula is actually satisfiable or not. And how it works the mapping? Well, each package B we have in archive becomes a single variable, which can be true or false, where true means that this package has to be installed and false means that it has not. Each dependency becomes an implication. So the first example we have seen that a term dependent on libc6 and on either libic or xlibs. Yes, the example is quite all. And you have this translated into a logic implication. So if you install a term, then you need to install also libc6 and one of the two other packages. So this way you form a mapping between dependencies and the implication. And additionally conflicts get translated to formula which states that no two packages which are in conflict can be installed together. So in this slide you see not A and B, that means that you will not be able to have at the same time A and B installed and having your formula satisfiable, okay? And not so nice consequence of this mapping is that deciding whether a package is installable or not becomes an NP complete problem, which is a kind of problem which are well known to be really hard and requiring an exponential complexity in the site of the archive. And we have 20,000 packages in the archive. Luckily, all instances we are facing in reality are quite easy to solve. So if we think at the problem of installing without a specific version or libc6 in a given package repository, like the one which is shown on the left, you get a big formula which is satisfiable only if a given package manager has a way to install that specific package. So this example is quite small. You just ask for a package and you get a formula with a lot of literals on the right. And as you can imagine formula can become quite big. So for instance, if we try to install a KDE package and you look at the corresponding formula, you get a formula which is something like 32,000 literals. So it is a really, really big object even if it is usually easy to solve. Where comes the quality assurance from all this? Well, you basically start looking at a repository, a repository which is a set of packages and you start thinking about good properties of that repository. So what does a user see of a repository? He see its own installation. So he sees package status which is usually a subset of the packages coming from the repository. We say that this subset, this installation, it's healthy. So it is in good shape. If basically dependencies are satisfied that what usually we say in our terminology. And what does it mean for a dependency to be satisfied? It means two things. The first one is that all dependencies, all packages which are installed have their dependencies installed. And we have called that abundance. And also, you require that all packages are in piece. So there are no two packages installed which are in mutual conflict. And then we look at repositories and we define the property of being trimmed. Being trimmed for repositories means that all singular packages, every single package in the archive is installable. That does not necessarily mean that it's installable in your machine because you can want to have a package installed which is in conflict with that one. And you will never be able to install the other package. But still, there should be at least one way to install that package. Otherwise, your package is completely useless and there is pointless to distribute it in the distribution. So do you think we have some kind of packages in our distribution? Well, actually we have quite a lot of them. So it is quite common to find in some of our distribution packages which are not installable at all. And to help out in the Hadox project, they developed tools which help in finding these packages and possibly avoiding to release them. So this is a sampling of some of the tools which have been developed in that project. The most popular one is probably Hadox Dev Check which is a command line tool which you can use to check whether a given set of packages file is actually contained some non-installable packages. And I will show some example of it in a bit. Then there is Package Lab which is like an interactive console. So it's not interactive in the sense that it is graphical or anything like that. But it is a textile console with which you can apply with packages. So we can try to create a situation in which the user can end up like having stable and testing at the same time and doing this artificial environment check about whether packages are installable or not. And then other tools which have not been adopted yet well, only TART actually. See if it's an internal component which is used by a Package Lab and Dev Check. And the last one is called TART which is a tool to help splitting packages into medias. And basically the idea is that it ensures some good properties like if you split Debian in 10 different CDs, well, you have the guarantee that all packages like in the third CD can be installed without requiring packages from the next CDs. So basically that you can install whatever package you want without having to do disk jockeying with your CDs at worst you will need to insert all of them. So about Ados Dev Check. So it's a command line tool. You can install by installing the Ados Dev Check package which is in the archive since a couple of years now. So basically it consumes that same boot packages file for APT, binary packages file and check by default if every single package that you have been given to the tool is installable or not. If it is not, it usually try to provide you with an explanation and which is not shown here because it's not the right option. But anyhow, the point is that it is a really fast tool. So for example, to check the all of a testing machine some weeks ago on MD64 it takes like 50 seconds, five seconds to check all packages in the archive. We are using it quite a lot. We are using it quite a lot in Debian actually for quality assurance. So the first use, it's a daily monitor that you can find on the web which is at Ados Debian net. And every day it checks all of our testing and unstable distribution to see whether they contain uninstallable packages or not. And there are reports which we are using in the QA team to check why those packages are not installable. Sometime you have transient reasons like a build D not catching up. Sometime you have a convenience reason like you have an Arc all package but which is not in sync with some architecture any package. And sometimes there are really a really serious packaging bug and it should be obviously fixed. So another interesting use of Ados Dev Check has been doing by Neil Williams in MDbian. So basically what they do in MDbian is before uploading a package to the MDbian archive, they check, even to the unstable archive, they check whether the upload of this package will lead to some uninstallable package in the archive. And if it does, they are referring to upload and try to understand first what they need to fix before going ahead with the upload. So that's kind of interesting but it's not directly applicable to Debian itself because from the moment where we upload packages to the archive, it is not immediate that the package show up in the archive. The obvious example are build Ds. So you upload for an architecture but before it gets rebuilt another architecture it can take a while. So the check you do in that moment is not necessarily significant. And another reason is additional checks like the new queue. So I can do the check right now on my laptop whether a new package is going to break something but then the package takes two weeks to enter the archive and so my test has been actually pointless. So what you're trying to think is whether it would be interesting or not to add some advisory hook to Deput to check whether, only as a warning, you know? So you try uploading a package, you see that it's gonna screw up something. So either it is expected or you need to look into it. Another application is add those build depth check which basically do the same game as build at depth check but on the build dependencies. So you basically also provide a source package file and you check whether all build dependencies of packages in the archive are actually satisfiable. If they are not, it means that there is a package which is not rebuildable in the archive and is something to be fixed as well. An interesting hack has been done this morning actually by Nomeata which I think the, what? All right, an interesting usage of this tool has been implemented by Nomeata this morning. So basically the idea is to make one a build, use this kind of tool to check before trying to build a package whether its build dependency are satisfiable or not. If they are not, it is actually pointless to try the build and you can automatically set the package in depth weight. Finally, another interesting use of this kind of technologies has been in detecting conflicts. So basically we want to avoid this kind of message which we see quite often in unstable which is package blah is trying to overwrite package file blah which belongs to another package. And so what we do to detect this is basically trying to see whether two packages which can be co-installed actually share a file or not. So we scan the contents file and we check whether there is a way to install two packages which should share a file. Sometimes there are false positive like if the maintainer script used DPKG diver or stuff like that but we check them manually before actually verifying that. And we filed quite a lot of bug about that. We do that on our team basis and help squashing some bugs in Debian. And of course you might have seen the Debian weather which actually show the forecast of how many packages are not installable in a given architecture. And this is actually just for fun it's basically shown the amount of an installable package in a given architecture and show them in a form of weather. So really bad weather means that a lot of packages are not installable on a given architecture. For example, alpha is quite rainy. So package lab is another tool which is a console based tool where you can play with packages and you can for instance load the current package list past package list and play with them. You can do package installability check on the install of that check on the installation you are building and you have some kind of functional query language to check what is going on. So actually there's nothing more than that check but you have an interactive environment where you can create your own distribution. So I'm sorry for the small font but on the left there's basically an interactive session where you're basically check unstable with itself so you check whether all packages in a given distribution are installable. Same thing as Edo's dead check but you can do the same like two months ago if you have been injecting all historic information into the tool which can keep a database of the evolution of a given suite. And you can do also co-installability test. So for instance you can check whether PHP4 and PHP5 are co-installable in a given suite and stuff like this. So this was the work which have been done in the Edo's project actually before I joined that and it is a work for us distribution side, okay? So tools to help out in doing quality assurance verification. From that it starts another project which involved more or less the same people but actually to to convolve some other Linux distribution which is called the Mancusi project and basically the Mancusi project is shifting the tension from the side of who make distribution to the side of the users. And in particular the user we are concerned with assistant administrator. So what we do basically is try to address a couple of problems that can show up when you do upgrades in a general sense. So when you use your package manager to change something on your machine and as before Debian is not officially involved for the same kind of bureaucratic problem but now we are 2DDs working on this project and we also this time try to contribute code back to Debian so it is me and Althine. And the focus of the project is what we call the upgrade problem. So problem is used in a generic sense actually. So a problem is what the kind of issue the package manager need to solve when you ask it to solve some dependency to install a package to remove a package or anything else you can ask your package manager. So of this particular scenario Mancusi is trying to work on two different parts. The first part is all back support. So basically we are trying to provide technology which will help in going back to the state where your machine was before you attempted an upgrade. So actually this has been doing from two level a technological level. So you're trying to integrate in package managers technologies like snapshotting techniques or this kind of stuff. Or on a more formal level like trying to develop a senior language for maintainer script than shell script. So where you have instruction that have a semantic that allow you to go back to what happened to what was the state before but the part I'm mostly involved with is again dependency solving. So we're trying to improve the current state of dependency solving in package managers. So for instance I'm not sure all of us is aware that up to get the package manager we all use is not always able to solve dependencies even if there is a solution. So there are a lot of cases in which you have some way to install a package but up to get is not able to find that solution and propose it to you. So we are trying to solve this kind of expressivity problem. While doing that we actually started studying an interesting object which is the dependency graph of Debian. So imagine putting on a graph all binary packages of Debian. So if you take unstable it is about 25,000 packages and start drawing lines between packages each time one package as a dependency on another package, okay. The graph obtained that way is quite huge. So it has about 20,000 nodes and about 400,000 edges. So it's something it's really hard to grasp, to print, to manipulate with whatever tool we tried. For the funds of data this is the actual growing of this kind of graph during the years. So on the X we have the release of Debian starting from zero dot 93 and then we go on. And basically the growth of this graph has been exponential since a couple of early years ago and now it's stabilizing, luckily for us because we were not able to have exponential manpower to work on our distribution. And so we are trying to give a meaning to the arrow we have in this kind of graph. So basically we are asking ourselves if the dependency between two packages is meaningful in the sense that what will happen when I will change a package in the archive? So in particular we are trying to ask the question can I move a package P which is in the archive without affecting another package Q which can depend on it? And this kind of answer you cannot really answer on the basis only of direct dependencies because a package can be depending on another package but the dependency can be meaningless in the sense that you have alternative dependency on something else and your users are always using that as a dependency. Or you can have virtual packages. So a package can depend on MTA but that does not tell you much about whether the package is useful or can have a problem with respect to Sandmail which is just one of the provider of that virtual package. So we introduced another notion which is called a strong dependency. So we declared that the package P strongly depends on a package Q if there is no way to install P without installing Q. So these two packages are really intimately related. You cannot touch one without affecting the other. What are the relationship between this kind of notion and the other notion of dependency? So the notion of dependency we use and we declare in our control file. Well sometimes they are related, sometimes they are not. Consider for example the first example. So here we have a package P which depends on Q and R so the fact that they haven't written anything there means that this is a hand of dependency. And here we have a package A which depends on either B or C. So these strong dependencies which are entailed by this kind of relationship are nothing here because looking only at this you have no guarantee that Q is installed whenever P is installed or that R is installed whenever P is installed. No sorry, this case. So here you have an alternative dependency so you really don't know if A is installable, if any time you install A you have B or if any time you install A you have C. On the contrary here you have end dependencies and so you have the guarantee that Q and R are always installed when P is. So the idea is that touching Q, like releasing a new version or upgrading E on your machine, will for sure have an impact on P because it is always installed when P is. So these are the simple cases but the other cases get more complicated. For example, consider a case in which you have an alternative dependency which gives you no information on whether Q and R are installed but for some reason you also have a conflict between R and P, okay? And know that this conflict can come from far far away can come from the fact that R depends on a lot of packages and in the end you have a package which is in conflict with P. So in this kind of scenario, each time you install P, Q is also forcibly installed, okay? So there is some correlation between the two. Don't be scared. I'm just showing here that each time you have a package that have a lot of reverse dependencies in the usual sense we have, that package is also a lot of reverse strong dependencies. So you have a lot of packages depending on them but there are exceptions, okay? So why are these interesting? Well, we try to define a notion of how delicate is a package in a given distribution, okay? And we define that notion as the number of reverse strong dependency on the package. So the number of packages which are forcibly installed when you install this kind of package. And on top of this, we define a notion of sensitivity. So the question is what do you think is the most dangerous package in Lennie? So the package that will affect for sure most packages if you touch it. Lib C, well, not really. There are quite a lot of other packages which will for sure have an impact on an upgrade and Lib C is only in the 13 position. So the first package, the number which will affect according to the definition of course, the most number of packages in the archive is a package called ZCC43.Base which is a package I discovered during this kind of stuff. I didn't even know it exists. And for some reason you cannot never install Lib C6 without installing that kind of package, okay? So we are not sure yet this notion is sensible for defining how delicate is a package but it still has some interesting properties, okay? So, but still this list does not give you any information about how these packages are related. There are some packages in this list which are related. For example, I can tell you that Lib C, that GCC43 is up there because it is a dependency, a strong dependency of Lib C6 but there are packages which are totally unrelated. For instance, DPKG or Pearl Base are not necessarily related, okay? So we try to check whether we can have a visual representation of the relationship between packages which are really delicate for our archive. And we came up to a notion of dominance. So we have a notion of when one package dominates another package. And the idea of this dominance is that a package P dominates Q. If the importance of Q is like of due to the dependencies on P. So consider the example that I gave you before Lib C6 and GCC. So basically why GCC is important? It had really a few packages real depending on it. But yes, there are direct dependencies from him, from Lib C6 to him. So Lib C6 in some sense is dominating GCC and that explains the importance of GCC base. Apparently this gives us some fun graph which I'm going to show you in some more details. Which kind of show you our distributions, our releases in a graphical way, which apparently highlights some important cluster of our packages like KDE, GNOME and stuff like that. So all the graph I'm going to show are available at this URL, where we tend to have also some JavaScript widget to zoom in SVG files. And I'm trying to check whether Inkscape is able to show them. Not so, sorry. Okay, so let's start from our old distribution of yours, like 093. So back then our distribution was quite simple. So this is not all the packages in the distribution, but the cluster of important packages we had in Debian 0.93, which I think is 15 years old now. So we used to have some dialogue stuff on which small cluster with tech stuff and another cluster still of tech stuff playing with the path of that distribution. So then go to something I remember a bit better, which is 2.1. Still nothing interesting, very big to see. I'm sorry, I'm looking at the, okay. This is Debian 3.0, where we start seeing some more interesting clusters. So for instance, here we start to have some big GNOME cluster, where you have GNOME, GNOME support, GNOME bin, GNORBA, and stuff like that. Here we had back then PHP GoPro, which I have no idea if it still exists or not. And also we had some still GNOME. And then we come to something we know a bit well, loading. So this is a visual representation of Debian 4.0, and we start seeing some 11 cluster of packages. SVG is a bit painful. There is a zoom tool here. So this is for instance, no. That was the bad one. Tutu, okay, you did that. Don't you have a, okay, I just don't buy hand. Okay, so this kind of big cluster is related to KDE, for instance, and this kind of stuff. So we are getting bigger, we are getting more complex, but the point is that you see that this kind of cluster changes from disto to disto. And for instance, we observed that there was some very stupid dependency in past releases of KDE, where you were forced to install all KDE games if you wanted some KDE amusement. And this kind of forms tend to disappear from release to release when the maintainer noticed that it was too strict as a dependency. So if you want to name your favorite package, you can try to look it up in this kind of representation. So it is still big, but it is a kind of more reasonable representation than having, it's kind of invisible on this kind of presentation. Okay, so this is kind of stuff we have been playing with. And from here, we are going a bit further. So we are doing the same kind of stuff with conflicts. So we are doing tests where we check when two packages are really not co-installable at all. So you take two package from the archive and you check whether there is no way to co-install them. And doing this kind of experiment, we for instance find out that there was like 1,600 packages which are not co-installable with PPM to FB. And looking a bit more, you find the reason. So basically PPM to FB was declaring a conflict with Python 2.4. And basically all the packages we have in archive, which are Python related, require a more recent version of Python. So as a result, this package was completely, was suddenly dropped. So this gives you an idea that strong conflicts can be quite useful in discovering all packages which have been neglected. Nobody knows that they can longer install in any reasonable installation. And we got rid of a couple of packages like that. Finally, what we're going to do next, basically in Mancusi is like trying to improve the situation of our dependency solving abilities in Debian. So the first step is making all our package managers complete. So basically guarantee that each time there is a way to satisfy user request, then the package manager you use is able to find it. This is the first step. And we're trying to do that also, collaborating with package manager developer to factorize out the dependency solving code because we have a lot of dependency solver in Debian like one in APT, one in aptitude, which does not use directly the one on APT. Now we are going to have capped and all this kind of stuff. And we're trying to create a common code base to avoid having unexpected results. For example, when you run a build D, which is yet another version of dependency solver. Another thing on which we are working is providing a way for users to specify really precise optimization criteria when they want to install up something. So for instance, we want user to be able to minimize download size of packages when there are different possibilities to satisfy user request. Or the same thing with user disk space. So you will have your package manager which is always able to choose a solution that minimize the installation size, which is quite interesting for instance in embedded scenarios. Or even stuff like, okay, I don't want to install at all any package maintained by this guy because I don't trust him, okay? So we're working on this kind of preferences languages. And of course, we want dependency solver to be as fast as possible. I believe you all stumble upon aptitude starting to loop, trying to find solution and not being able to do so. And to do that, we are planning to organize a solver competition in which basically user will be able to submit their dependency solving problems like in the style of popcorn. So basically you can configure our package manager to dump the problem he tried to solve when you provide a request. And we are going to collect them to organize a competition on that. And to do that to the end, we have already developed a format to exchange upgrade problems, the description of upgrade problems which is distribution independent and works both in the Reddit world and the Debian world. And it has currently been implemented in CAPT. So with CAPT you can already dump your kind of dependency scenario. CAPT is basically an attempt to implement the APT stack being compatible with all options of APT. So it should become a doping replacement for APT with hopefully cleaner code base. And that's it. Do you have any question? Just remind everybody to stand up, have a clear view of the front video camera and I'm about to hand you a mic. Hi, I in an intern worked in 2006 on the Dev Martial project which was facing a lot of the same issues with the dependencies. And we did a simpler thing than I think what you did. When you're tracking etch or, you know, Sarge, that sort of thing. We just pulled the packages as they were released and then we tested to see whether they were good and that let us collapse the version numbers down to what was actually in that particular distribution. It looks like you're attacking the whole problem of all the packages available and then you find a set that would be consistent. What do you mean with you were testing whether the package were good or not? We are doing that just for the installability problem. So basically we check whether they are installable or not in a given distribution. Well, see what we were looking at was, you know how on some days, unstable is a perfectly good distribution and you can install it and everything works great and then there's really bad days where you just don't want to install or upgrade. We were having our mirror process would immediately process the new incoming release and go, I can't resolve some of the things in this. So while we'd still pull it and still have it, we wouldn't move a sim link on our file system up to that yet and none of our systems would upgrade to it until the next day when whatever package problem there was would be resolved. So basically it looks like your target is different than ours because we basically, well not we, in this project which predates my involvement, they were trying to look at the final product so stable releases. So from that point of view, it is pointless to look at daily what happens because users are not supposed to use unstable, okay? So the idea is to look at the final product you want to release and check whether it is good or not and it is not having the power to fix it because with this kind of tool, you pinpoint a bad package and you put your hands and look at what's wrong with it. So you are intending this for targeting unstable or for targeting stable releases? Well you can use it on whatever you want but the pointer is not necessarily preventing, you know, bad things to happen but rather looking, finding them. So you can integrate it in your package manager and you can inform it you that something bad is going on today in this distribution and you can prevent it but this kind of integration was not necessarily the focus. Okay, on your right. I just wanted to ask for the solving issue if you intend to use some existing projects, for example, SUSE has set solver tool which does the things quite good and if you intend to use this existing tool somehow? So actually the SUSE solver descend from work which was done in Edo's so it is basically, yeah, the same technology. So they pioneered the idea of mapping the problem to SAT and then various kind of people started developing their own SAT solver and one of that was the SUSE one so there is currently no share of technology but once you decided that that's the way to go anybody can have his own implementation of the solving while for the competition actually we really hope to have different implementation because SAT is vague, to have a real efficient solution you need to look into the precise formula you get and different people can do different heuristic to try solving the problem as fast as possible. Okay, thanks. Thank you very much, too.