 So welcome to the talk on botch, which is how I ended up, I should not walk here, how I ended up naming the software which is able to bootstrap Debian-based operating systems from scratch just by doing independence analysis. Botch is short for bootstrap or built ordering toolchain. It started as a Debian Google Summer of Code project in 2012 and was continued as my master thesis at Jacobs University, Bremen. In the end I give some links to where you can download that if you are more interested. My mentors during the Google Summer of Code as well, the time afterwards were Wookie and Pietro Abbate, where Wookie provided me with the practical side of things, how cross-building and bootstraping is actually done, and Pietro with all the theoretical academic stuff that is needed to go through all of that. So how does it work in Debian? We all know that in the common case, source packages are always natively compiled, and they are compiled with the knowledge that they have access to the full archive of binary packages, and that is not the case during bootstrapping where some core must be cross-compiled, and only a few binary packages are available initially, which means you run into dependency cycles because you need a binary package built by a source package A, which can't be built because it needs some package by a source package B, which can't be built because it needs something from A. In fact, you don't run into dependency cycles as much as you run into strongly connected components, which is the part of a graph where all vertices that are part of the strongly connected component, or SCC, are in a cycle with each other, and those things can be pretty huge, like that one, which is how it looks right now, and every time Debian is to be ported, somebody has to solve that by hand. It's called a hairball, and which makes sense because it looks pretty bad, and that one particularly has only a thousand nodes and a few 10,000 edges. Another thing we can harvest from snapshotdebian.org is this graph. On the y-axis, it shows the number of vertices in this biggest strongly connected component, which I showed before, and the x-axis shows the time in years, and we can see that the problem size, the size of the biggest SCC grows over time. Software becomes more and more complex. It depends on more and more things, and that produces more and more cycles in the base packages. Yes, like what is contained in it? Well, the structure of the dependency graph for Debian or Ubuntu for that matter, and so most Debian-based distributions throughout all this time is there's one big SCC, which is of that size, and several, like 10 or 12 smaller ones that don't go above the size of six or eight. The biggest one contains all the interesting stuff, like browsers, email clients, GTK, GNOME, everything, because some package just depends on some GNOME thing, but to build that, you need the rest of GNOME, so you end up with that. The full list is on one of the links which I show in later slides, so you can have a look and see what's all inside. There's Firefox and Thunderbirds and half of KDE and all that kind of funny things. Another funny thing is once you have that built, so once this big thing is solved, the rest is practically a linear build order, so once that is done, the rest is simple apart from those small cases like some Haskell stuff and so on, yes? It looks like it's going down again at the end. Is that real or? I don't know how to interpret that, but that's the data. So yes, it is real, so it, apparently, that's Debbie and Sid, so apparently it's going down for some reason or the other, I didn't invest time looking on why some certain jumps happen or not, but generally this seems to be an upward trend, which also seems to make sense, given the complexity argument. Right, so the current bootstrapping practice is that people use again to open embed it to avoid the cross compilation because that's not working very well right now and build a minimal system to compile on, do dependency analysis manually, so I heard from people that they were doing that kind of stuff on paper and drawing that, and then after finding those cycles or finding stuff that doesn't build or any misses things, to manually hack source packages so that they can build with less build dependencies that they would normally need and thus break dependency cycles. And that takes loads of time and the goal of the Google Summer of Code last year was to have that automated to avoid that to be done repeatedly every time a port is done. So what would happen? What could we have if bootstrapping was easier? The most obvious thing is that porting for upcoming architectures would of course be easier. You could also see more custom ports optimized for a specific CPU. So the again to argument that, well, you don't have it as optimized for that CPU as we can would go away because it would be easier to have Debian optimized for a certain thing. It would remove the need of getting to open better to make Debian more universal because it would be able to bootstrap itself without needing anything else. But there is more. You can use it to update lagging architectures, creating the build order for that, build for targets that can't build themselves once cross-building works better. You could have a Q8 tool which allows to check the archive regularly for bootstrapability. And you can also use it to order library transitions which include cycles like for Haskell or a Camel that's currently done using Ben. But Ben can't handle cycles. So for example, if you look at the Ben transition output for Haskell, it's all garbled because it includes cycles. Ah, right, that's great. Right, so the essence of this talk is that the core algorithms for the graph analysis necessarily exist and they are fast and they seem to be correct from just looking at it. But when the decisions about the new dependency syntaxes, the multi-art and cross-building to the practical plumbing and to triad on in practice which is part of what is being done in a Google sum of code project by Alkman this year. So the tools which I ended up writing are written in OCamel mainly and there are a couple of Python and shell scripts to put them all together. It's all LGPL3. It's using DOSA3 as a helper library for the parser and solver. Like it's a real solver, not like it will find the solution if there exists one unliked up. The tools follow the UNIX philosophy so there are access multiple applications each executing one algorithm and they're all connected by pipes using Debian package description format like the DEPH2 format as an exchange format. And the graphs that are generated are in graph ML so that you can easily write a tool that consumes them and does other stuff with it. And it's all in git at the URL which you see there. So more specifically, you can now create a dependency graph. You can analyze it using several different methods and you use these methods to find source packages to modify to make Debian bootstrapable. And after you've done that and you modified enough source packages, it allows you to create the build order from the then acyclic graph. An important thing to mention is that it's all theory. So at no point I am compiling packages or installing packages, it only works on the metadata. So it's only using packages and sources files and input and assumes that if dependencies are satisfied then I can build it. Which of course hides loads of things that can go wrong in practice, especially when a part for new architecture is done. But well, that's what's being done so far. So what I did in practice to make it all work is that Debian has some sort of reduced build dependencies or build profiles which we call them now and that cross compilation for at least the base packages works. The bootstrap workflow would be to first select binaries for a minimal build system and cross compile them in this. Of course again, it's its own task of breaking dependency cycles but it turned out at least for Ubuntu as far as Huki said to be rather easy and not involve much dependence analysis. If you cross-built a small base system then it's not necessary to go into full graph analysis mode exactly and then you have your base system which you can use to start compilation. So in the rest of the talk I just assumed that we are doing native compilation half magically from a box D. Minimal build system just including essential packages and build essential. So from that you create a build graph. Ah, I had to need to change that so build graph is a special term which I wanted to avoid. So you create a graph and extract strong component components and you analyze them using some heuristics to find source packages to add build profiles to. You modify them and go back to two until the graph is cycle three then the algorithm selects the source packages to be profile build makes the graph is cyclic and gives you a build order. So far so good. The hard part is to break those dependency cycles and that can be done in multiple ways and the most obvious of course to use build profiles and to use them to build source packages with less build dependencies than they would otherwise use. Another very helpful thing is build depends in depth so if your source package depends on stuff which could go to build depends in depth then that would be helpful with bootstrapping because during bootstrapping you of course don't rebuild the architecture all packages so you can ignore the build depends in depth list. Another method is to choose different installation sets for non-strong dependencies. It means to not install like package dependencies have disjunctions so you can satisfy dependencies using different sets so sometimes one set is hard to bootstrap and the other one would be easy so that means just choosing another set. Another one would be to make binary packages available through cross compilation so once you're stuck and the other things don't work cross compilation might come to the rescue and give you some binary package which solves your cycle. You can maybe use existing multi-arch foreign packages so maybe your new architecture allows you to run stuff from an architect that already exists and can help you to satisfy stuff and break cycles through that or splitting source packages in a way such that they are split in the part which others depend upon another part which depends on other things which would again break cycles, yes. Multi-arch foreign packages. Yes, I'm not sure whether that would work or if that would be useful at all just from a theoretical point of view. So you'd tell me. I'm a bit confused what this applies to for the initial cross build or for? No, for native as I said this is all native stuff so I thought if a new platform would allow technically by CPU to run an existing architecture then those packages could be satisfied and could be used to satisfy dependencies and break build dependency cycles. So that might work, might be less useful in practice but at least it's something that might be too big considered. I developed several heuristics to find those source packages to modify and heuristics are needed and cannot be replaced by an automatism because all this work can only be done by human only humans are able to code and to analyze software and machines can't yet so this is all heuristical work and it only uses mostly uses the dependency graph syntax so it's mostly ignoring the semantics of dependencies but it turns out that doing that already works surprisingly well just taking the syntax of the graph and not the meaning of the packages. I developed several kind of heuristics simple ones, component-based ones, cycle-based ones and in the end feedback arc set algorithm and we'll shortly introduce those now. This is the output of botch, it can be looked at in this URL. I generated the last one just today so this website is the output of botch showing all results of all heuristics and outputs showing all kind of dependency cycles, all heuristic results and so on and so forth in a huge HTML page which requires JavaScript to be pretty like having these page tables and stuff like that but without JavaScript you will also be able to see it and we'll just be bigger. Sure, so this table for example is a table of edges with most cycles through them. I will explain that a bit later but you see an edge defined as a source package depending on, built depending on a binary package and apparently there are 595 cycles through that which means if you would be able to build lib decrypt 11 without that already 600 cycles would be broken. That's the idea. So simple heuristics, there are ratio-based heuristics for example you say if I could build a source evolution without libmx then I would easily lose the connection to 55 other source packages or if I could build source tracker without dia then I would lose the connection to those 22. Another one just gives you the amount of missing dependencies so if you look in the graph and you see source packages which only have one build dependency missing you might just say well that's an easy one and drop that one and build it. Another thing is what we call weak dependencies which are a set of well, user-picked dependencies which are commonly used for documentation generation which is of course also only heuristic and not always true. Another thing is strong bridges. If we take this graph and a strong bridge would for example be this red edge if we remove it it would split the graph into more strongly connected components which would look like this. It's all done automatically for you so you don't need to look at this graph and search for this red thing because you get it in these HTML pages I showed before or strong articulation points which is not the edge but a vertex and once you remove this five it is split in more strongly connected components which are then easier to analyze. You can identify small cycles which is particularly useful because self-cycles have only one way to be split. For example here we see that UDEV builds depends on USB utils and of course this cycle here this two-cycle can only be broken by removing this build dependency because you can't break the binary dependency of USB utils on something that source UDEV builds. And as a matter of fact for this graph it even solves the whole graph to remove that one. So the first thing you want to do is to look at all these small cycles which leads to another thing and generate but this time for all architectures which can be accessed using that URL up there which are dependency cycles which are sometimes non-obvious self-cycles. The type two is the type which is non-obvious. Type one is all those self-cycles of source packages which build depend on packages they build so that's really easy to identify you just search through all source packages and see if any binary packages they build is in their build depends line. Type two is more hard. It uses the concept of strong dependencies you can read like it's a dependency which is non-optional even though disjunctions might be involved so it is probably not very intuitive to know that package config, the source package package config can't be built because of libglib2 minus def or that libx11 can't be built because it depends on graph which kind of even though it's in a self-cycle there's no other way to split it except for really building libx11 without graph short of of course cross-compiling things. It builds a documentation. Right, exactly, so if graph was, well as it probably is for documentation generation you would put it in build depends in-depth then hooray this cycle is solved. Yes, another thing which, well we already mentioned before is edges with most cycles through them. It is probably hard to see that for this tiny graph with only 15 vertices in contrast to the big graph with a thousand vertices for this tiny graph there is one edge of all of these 31 which if removed makes this thing acyclic. It's hard for a human to see that so the heuristic of edges with most cycles through them is used to identify that indeed this edge here has many cycles through them and removing it indeed makes the whole thing acyclic. Yes. So again, well adding yet now some meta information this is used for testing and well marking this kind of dependency as a test dependency which is not necessarily used for the build would be very useful. Again, so. What do you mean by test dependency? So, Python just built depends on XORs. Yes. To run the decay tests. Ah, right, yes. So, if that would be recorded somewhere then it could easily be solved and algorithm could then say, well we can easily split that there. That's a very good example of why humans are much better at this, right? Because you actually know what that dependency is for whereas Jonas has no idea. And as you say, if we put profiles on them then it becomes. Right, good. Yeah, so good to know that that works out easily. Yes, another thing is calculating a feedback arc set. Feedback arc set is the set of edges which if removed from the graph makes it acyclic and we want a small feedback arc set because we want to modify little. And I tested that with Debian SID as of today at, well that's from the Debian snapshot. The graph has 28,000 vertices and a quarter million edges. The biggest strongly connected component has 1,000 vertices as we saw before. And assuming everything can be broken all of Debian can be bootstrapped by just modifying 51 source packages. Which is of course a very optimistic assumption. So we used a more realistic data which I got from Gen2 by analyzing use flags and manual lists by Krossenglaser, Patrick, Mectamon, Daniel, Shetland, Wookie to get a more real list of what is actually breakable and came up with the number of 57 source packages. Might still be bigger, but it's probably in that ballpark in the end. Of the packages, right now? Yeah, sure, of course. It is also on the website. So if you go back to the website, MsMoffin.de slash bootstrap slash stats then you will also see the list of packages which this algorithm suggests. Not only packages, but also the build dependencies that would have to be dropped to achieve that. Right. Which of those is it? A managed feedback log set, only weak. It's not that one. I know, it is that one, yes. It is that one, MsMoffin.de slash bootstrap slash stats. There's a table of contents and I was wondering which of these various things was the set of 57 packages? So. Because they're all called things like strong articulation points and ratio binary heuristics. Like it's written by a mathematician. Yes. Yeah, a feedback offset, there it is. So you would just disable uncheck the other things and then you are left with only that one, if... It's bootstrap slash stats. That. Right, yeah. The website contains loads of things, which is why there are these checkboxes on the top to just disable what you don't like. And for example, look at the feedback offset. The feedback offset is calculated using, like it's not complete because, of course we can't assume that the data we got from Gen2 is correct, or the data that we provided is correct. So it's still only a suggestion. Okay, we were there. Yes, so 57, maybe a little bit more, but there is nothing more I can say right now because we only did the thing in theory and until we do them practice, we don't know how big of a difference the real number will be, at least theoretically it is in that area. Right, we also want to show that the thing is relatively fast. So the blue part here is the calculation of a subset of Debian, which is smaller, but includes the problem, like the big strong component, which we actually care about. But once you reduce Debian to that, all the algorithms of course get much faster because they have to work on less packages. So if you reduce it, then even including the reduction, you'll get to overall 62 seconds of execution time for the algorithm. And if you're a developer and initially you care about this big blob, then you only run the blue stuff once and you only run repeatedly the green and red parts which only take a few seconds together. The thing gets longer if you use full Debian and they get even longer if you calculate strong dependencies, which is again something that you have to probably, well, I can explain to you later, but it is important to generate the HTML page I showed earlier to get all the information from there. It takes a bit longer, but of course, you don't have to generate that every now and then. You don't have to generate that always. What you care about is the green and red part here as somebody who would use that tool to figure out what to modify. And that's only a few seconds. You would use this big thing probably as a daily build somehow that updates the website, which is not done yet, so there will be some time until it's updated next. But it only takes six minutes again, so even doing it for architectures will not be considerably long. Right, resources are here. First of all, these slides can be downloaded from there, so you don't need to type it all down, but just to know where to get the slides from. Yeah, and then the rest, there's my blog where I occasionally write about this kind of stuff. The mailing list, which we had, well, because also cross mailing list was missing, our IRC channel, get repositories with the software, those are three, the wiki pages. The to-do page is specifically of interest because it lists lots of things that needs to be done for this to be possible in practice. It lists all these things like build profiles and so on. My thesis for, well, a really in-depth explanation of all that, ends the threats about build profiles on Daven Devil. Right, so conclusion. We could have easier porting, more custom ports, remove the need of Gantor embedded and other stuff, but we're missing decision on build profile format and we're missing several fixes to support cross compilation better. So with this talk, I wanted to convince you that by having all these algorithms in place, we are good to go to do all the needed plumbing to actually have that working in practice at some point. Right, questions? Good, yeah. That means everybody answered everything except for Doko. Well, the thing is that well, I was involved with wiki and the bootstrap of new architecture, did continue that and it's funny to see some recommendations here, for example, to break the cycle of building ECJ, dropping GCJ. Right. And things like that. Yes, of course, you can't always do that, but that is something that somehow has to be solved by one way or the other, because it depends on each other. Right, so what I would be more interested in is we know that, well, when I always see some dependencies on tesh live, how can we make some automatic recommendations to avoid or to break these cycles in the first place in the packaging? So in that, well, you don't need the documentation for a bootstrap build and things like that. So that's nearly always build depends in depth, isn't it? It's just, no, okay, because? Because of bad packaging practices. So what I do to see, well, packages with packaging from the Stone Age to just have these build dependencies encoded for every build. And I think it is not clear or we don't have any recipes for configuring an arch only built without documentation and an arch all built with documentation. So how to propagate the configure flex from the build arch and build in depth targets down to the configure step. So it can be done, but I think it's not obvious and many maintainers are not aware of that. So for many years we proposed having a dead build options, no docs, for example, which would just knobble that throughout the build. But that would not help the dependency resolution because it would still need to build proper for that. Yes, that's right. Now we have a mechanism to do that which would be more kind of, would be exposed to the analytical tools. But yeah, it still requires package modification to actually make it work. Maybe the bright side is that those packages which are needed to be changed are core packages, more than leaf packages. So they are like the most major, like in the big 1000 vertex dependency graph which has to be solved. There are 1000 vertices as I said, but out of those 1000 only three, so there are 360 source packages in there. So in the very worst case, well you remember the number of 50 something. In the very worst case, you would have to modify 360 and it doesn't go more than that because once you modified all of them, it's all solved. And those 360 are well-known packages. So you would also, for other reasons, be interested to have them maintained and properly well done. So that's maybe a plus because we don't care there about all these leaf thingies. Yeah, well I think it's all cool. Really great work. I was kind of looking for Pearl to see Pearl there, but I suppose you're excluding the essential set or something like that, build essential. I think Pearl is part of the build essential. Yes, one assumption was to not only have SD available subset essential and build essential, but also depth helper. And I think depth helper pulls in Pearl, right? And because 73% of the archive depends on depth helper. So it would be hard to bootstrap it natively rather than cross compile a bit more and then have depth helper and some other things available. Thank you for this presentation. It's an interesting work. I have just one remark or maybe one question for you. Do you think that it's possible to have, I mean, better results or to enhance more of the heuristics using a labelled graph? For example, putting using a labelled graph for dependencies. Labelled how? I mean, the name of the dependency or like labeling the dependencies by type to know which, I mean, if there is strong dependency or... Yes, that exists. Ah, okay. The graph is labelled with several properties. For example, being strong or not strong that only we have two graph types. It only makes sense for one of them to have them labelled as strong or not strong that is used to, for example, calculate the strong self cycles which some of the tables display. What other labels would you have suggested? Well, I'm not specialized, but I thought that maybe we could classify the dependencies by types and then... I would be happy if there was more that I could rely up on dependency-wise like where I get a bit more semantics like meaning of what the dependency means. Yes. But the annotations are missing. Like, build profiles would be what would annotate dependencies as meaning something. And that's not there yet. Okay, so you think it's gonna be more helpful if you have these annotations and use the semantics with the syntactic... They are needed. And the heuristics are there to find which source packages to modify. And once this information is there and complete enough, then another algorithm can come and develop a linear build order. But of course, that needs this annotation which has to be done by a human at some point. Okay, all right. Thank you. So, presumably the set of type one cycles is the set of things that's fundamentally difficult. You know, that's compilers that build with themselves. Yeah, yes. In that case, yes. It is also the easy to find type which you could also find using something that's not complete grafting because it would just go through each package and check if the binary package builds is in the build dependency list. So it's the easy to find list, but probably hard to solve list because it's compilers depending on themselves and stuff like that. Yes. I think we care about Haskell a bit. Yeah, if there are any questions or if you have any suggestions on how to improve that, I would be very happy to hear from you because I'm only doing the theory stuff. I'm not actually bootstrapping or cross-playing anything. Yeah, so the next part of this is like if anyone wants to go around and actually fix any of this stuff, so we have a student who is endeavoring to do some work, but he's actually mostly building scripts to help the actual process of building stuff and uploading stuff. So yes, anything which is in this list and you feel like breaking connections, then that will be handy. We could make our set a bit more linear if anyone gets enthused because it's only 57 packages in theory. Yes, but a promise. 660, somewhere in between those two numbers. Yes, unfortunately, the problem is nobody can implement anything yet besides comments in their packaging because we have no decision on the syntax yet. We couldn't get yesterday. Right, so I really think you can do two things. The first one is to split out the documentation built into its own binary packages. Architecture all, so you can move the build dependencies into the independent build dependencies. The second thing is to document the test dependencies which packages are only needed to run the tests. I think you will have to do that work anyway. I already see that many or some packages at least have sorted their build dependencies into sections with comments in between, saying, well, these are needed for that subset. These are needed for the documentation. Of course, most helpful. Right, and that makes really sense to do that. Yeah, and as I said, going through that website, I think even the, yeah, also the, if I remember correctly, even the list of what source packages are included in the hard, big, strong component, the list is available, so you can always check if your package is in there and if you would like to at least comment something in your code. For the purposes of moving this forward, comments in the build-dependence thing is nearly as good as the actual build profile syntax we'll get to one day. If someone works out what it is and writes it down in comments, it's a very simple problem to come along afterwards. Whereas, and a maintainer can do that, right? Whereas a random person goes, I have no idea what this is and I don't know what any of these build dependencies is for and it'd be hard to work out. So they're writing down comments. I'm talking to the internet now, just so that you all know. Comments in your build-dependence is saying what they're for is really, really useful and there will be some syntax along very shortly and some mechanism to make that all automated more lovely. Is there any plan on the build profile stuff to have information given to the rules file that says which profile is being used? I'm thinking of GCC that you'd want to bootstrap where you can do the stage one thing and if your rules file will be different if you're doing the stage one or you do different things. Well, so where a profile exists as declared by a package, then you can use it during the build to do stuff and it's kind of domain-specific knowledge. I mean, at the moment, we've basically decreed that there are three profiles, stage one, stage two and cross, because that's the only things we've thought of use and test, sorry. I think the set, sorry? We'll be variable that you query in your rules file and then do the right thing. So I mean, it will always be domain-specific that any tool that does this has to have some kind of knowledge, so we define labels for purposes and then we use them. Did that answer your question? Yeah, yeah, because the other possibility was just that like S build would be able to drop stuff if it knew you were doing a certain profile and then you'd have to figure out in the rules file what was actually installed and behave appropriately, yeah. What is really missing is a way to find these packages which support some kind of stage build. Currently, you have to look at the rules file and you don't know, well, here's something which could break a dependency cycle and yeah. So I would like to see this kind of information in the control file. So supported stage or supported profiles or something like that. Yeah, so I think we're very close to actually having all that sorted. We had a demonstration, we've argued about it a bit. I think we've pretty much agreed a syntax so we just have to do the new implementation because we've changed it slightly and start using this maybe next month. So yeah, we've got to argue with deep package people again, but it was them that said change it. So hopefully we've pretty much changed it to what they asked for. So I think that should be okay now. Yeah, right. And we're done? Great.