 Please welcome with me Simon Richter who will tell us, hopefully not as loud as I'm speaking now, a little bit about an alternative approach for package management and auto building tools and how to reinvent the wheel. Hi, good morning. Well, as you probably guessed from the title, I'm reinventing the wheel and the basic question, we should probably start out with is why. Wheel-reinventions are pretty common to a history of mankind and sometimes has been a reason for that, sometimes not. So basically, about two years ago, I wanted to start work on a new alternative auto builder for Debian. During that, this was originally supposed to be based on apt. During development, I found several shortcomings in apt, which made me consider writing an alternative. About half a year ago, I tried to implement a small tool to generate a CD images, directly from a Debian mirror that unlike Jigdo doesn't need any template files, it will just pick up the current files from the packages file and be done with it, basically. Again, I failed because some missing interfaces in apt and some shortcomings in design. The big problem with apt, I see is that apt is a tool, it is not a framework. It's designed for one specific task and it's very good at that task. But it's very difficult to extend beyond that task. For example, one of the big problems I saw while writing the CD image generator was that apt is totally unable to handle multiple databases and multiple architectures at once, which would be a nice feature to have if building a CD that holds both Intel and PowerPC packages, for example. Also, it makes the assumption that only one single system is installed with it, and that this system is currently running on. So it's very difficult to install packages to another system to switch system route after apt has been initialized. You can only initialize apt once, so you basically have to start a new process and re-initialize every time you need to swap to another chain chute, another architecture, anything. What I started out with was some basic requirements, what I'd like to see in a package management tool. It should be, of course, extensible, possibly you should be able to strip it down to the bare minimum for use in space-type environments or embedded systems or maybe the boot floppies or Debian installer. You should be able to handle a different package types like Debian packages, RPM packages, LSB packages. It should handle all of that at the same time in a single process, and it should also be possible to declare dependency between different types of packages. This is going to be interesting with the LSB transition, where an LSB package might be installed. It should be installable to the normal package manager, and that LSB package has some dependencies on Debian packages. So basically, if you install Oracle from an LSB package, it should pull in LSB base. That is the goal. Well, the approach, also there were some minor objectives, like being able to download half of an upgrade, do that part, upgrade that part, clean out the cache directory, download the next part, again for space-type environments, do all changes on the file system in the form of transactions, that you can always or back in case there's an error. And yeah, those were the minor objectives. So the basic question is how to achieve that. I've talked it over with some people in the IT department of my uni, and the basic approach we choose to take was that of a relational database. It's stripped down accordingly. The main idea is you have a set of packages and relationships between those packages. A package can be anything. It can be an RPM package, can be a Debian package, and it can be a Debian source package. There is no fundamental difference between the source package and the binary package actually. The only differences are in the detail like a source package can be installed to a particular directory, whereas a binary package has a fixed location, and a source package can generate other packages while a Debian binary package cannot. But there's no reason for me to make that a requirement that only some special source packages can generate other packages. I'm going to talk about later. The relationships between packages are then expressed inside that relational database as well relations. The basic idea is to expand those relations only when they are needed. You can leave out handling for like recommends or suggests from an embedded system, or you can leave out all the built dependency handling from a system that will never ever build a package. You can leave out the RPM support from a Debian-only system, so on. So I'm going to summarize this for a second, and make a bit of examples. I think you can see that behind there. Okay, I'm going to make this a bit bigger. Okay. Well, specific challenges, one example of a specific challenge, a package management system has to face nowadays, is an upgrade from A version. It's an upgrade when you have strictly versioned dependencies, which happens pretty often, especially in packages like OpenOffice, and you want to do an upgrade from here. You want to basically want to upgrade both a package B from version one to two, and package A from version one to two. The only way to keep a system like this consistent would be to remove or at least de-configure this package, then you can upgrade here, and then you can install this package. This is one of the big challenges in a writing package management software as all of this will need to happen inside the transaction. So you can roll back if anything goes on like the system disk goes full, which is basically unhandled at the moment. So a big challenge was trying to get a model that will model this case onto something sensible. Well, I'm afraid I can't present any code at the moment. I have spent the last days implementing a lot of stuff for this, and basically I've re-implemented Quintif in about 20 lines of C++ code. If you leave out the parses that are part of a library now, then the remainder is pretty small. The big problem I was facing was that the packages file is not in its entirety UTF-8 encoded, and so there need to be a lot of really awful hacks to pass it into C++, because C++ IOS seems cannot change encoding on the way. Well, that's an implementation detail basically. Yes. So well, I'm a bit unsure now whether I should tell more about the concepts or whether I should talk about auto-building at this point. No opinions, great. Okay. I'll tell you a bit about my requirements in auto-building, and then I'll go back to the concepts of everything and trying to explain how to integrate all of this. Okay. For auto-building, I have also set up a set of requirements. I mostly think would be nice to have currently, but are currently very difficult to implement because of the structure the auto-builders use for representing the state. I've thought about two basic issues I've seen in the last years. The smaller one of those are ABI transitions where the surname changes. So the need for rebuilding a set of applications against the newer library, where just the ABI changed, the API did not, and well, everyone who depends on the library that changes the surname needs to upload a new version in order to kick the auto-builders into compiling the new version. But unless you specify an explicit version to build dependency, you cannot even be sure that this re-uploading, this re-upload of the same source will actually build against the new libraries. So new libraries have to settle a bit into the auto-builders or you would have to upload a new version again in order to have it auto-built. Basically, there's no real need for that. I think the auto-builders could automatically figure out that a package will be going away. Well, it's pretty easy to see since if a library package with an auto-name goes away, it will claim to be built from a specific source, and that specific source has advanced to a new version, and whereas the library package did not advance with it. So it's pretty easy to see that this package is probably not any more going to be built from source and should be removed. So the auto-builders can see any package that is depending on this library should probably be rebuilt now. At the point where the auto-builder can actually see that the library package being outdated, it can also see the current version. So you can be sure that the new version of the library has been picked up at this point. The actual bigger problem we've seen twice in the last years, which should also be handled by the auto-builders is big ABI transitions like the C plus plus ABI transition. The idea I'd like to propose is to implement an ABI tag on binary packages. The control file would list which tags should be present in the final binary package, then during setting up, during generating the final control file, you would, the package GIN control would look up to respective settings that are currently in use for the host system, and write them down into the package file, into the control file, and well, the rule of thumb would be, no package can declare dependency on another package that differs in an ABI tag that is present in both. So if we upgrade, say, libqt from C plus plus ABI 1 to C plus plus ABI 2, like it will happen soonish or has happened, I didn't follow that actually, then any package depending on libqt and specifying an ABI version of one will have its dependency broken, and then the mechanism that picks up some name changes will kick in, you see, oh, this package is uninstallable, it still depends, are still installable, let's try, and well, it will then at this point, rebuild the package if it works, just it's fine to upload. Mind the note at this point, all the builders would need to generate new versions all by themselves, which would need to be addressed as well. Okay. Those are the basic requirements for the entire project. I've already explained also why. Well, and I've told you a bit about the concept, I think I'm going to continue with the actual execution layer that actually does things rather than just think about what to do now, how to install a package, and the idea of the basic problem, one is facing here is that the package dependency graph is not a tree, and it's not even a-cyclic, so you have lots of cycles in the dependency graph which need to be broken in order to do anything, and basically, the idea is to compile the relevant portions of the dependency graph into an execution tree that can be executed one note at a time, and well, when you're done, you're done. The basic idea I have for the tree is to have, well, to have each node have export three different functions. One of them is execute, which is pretty easy to understand, and one of them is split. Split says this action might need to be split up into subactions in order to break some loop that is in the dependency graph, and additionally, I'm going to add three meta actions which are parallel, which some of the GCC hackers might guess what it means and or and means. Do the first thing on the list, then the second thing, and so on, and if something fails, tell all the nodes that you've already executed. Or means do the first thing, if that fails, do the second thing, and parallel means these things are in no particular order and can be executed at any time while you are in the node. This is going to be important when we need to install packages right in the middle of a download or parallel to downloading other packages. Download themselves are actions like installing and upgrading and so on. Basically, typical action would be to a simple typical action, upgrade package A. You would start out with a single node, upgrade package A, and you can either call execute on that, which would basically, which would most likely split the node and they call executed recursively, or you could try to split at which point you get download A and install A. This is basically an end condition, because you need to download it, then you can install and, well, the next point is to split up the download into acquire this package, verify the MD5 sum or the SHA1 sum, and so on. So basically, this is acquire, this is verify, and again, acquire can be split up into acquire a difference, a diff file, or acquire package. So after you've been through this tree, to this part of the tree, you will have either a difference file or a package file. This part would then verify that the file you've actually got matches the sum marked in the packages file, then download is complete, and then you can go on installing. You don't need to unpack the entire tree at once. You can do that as you visit each node. But there's nothing forcing, but in case your action is going to concern multiple packages, there's nothing that stops you from downloading while you're installing other packages, as those are basically running in parallel. This would be an example of the parallel sort of action. Well, this is basically the concepts. Are there any questions up to this point? Is anyone awake still? Well, I've seen one hand now. Okay. Well, yeah, next thing I think I'm going to tell you is about the current state of the implementation. The best concepts are, well, if they are the best, I don't know, but the best concepts are worth nothing without actually implementing them. Currently, what is working is the package list parser, but it's almost working actually because of the issue with encoding, which is a call to all Debin developers in some strange sandy countries to actually use UTF-8 for their names. Well, you can parse a package list, you get two C++ iterators for the package list, so you can use all the STL algorithms like sort and copy on package lists. Well, there's a small container that can contain packages, and work is on the way on the scheduler engine at the moment, which I can test independently. So that's the current state of the implementation. I hope to have something working very soon. Is this a question? Yeah. One of the things that doesn't do, which would be really cute if you were playing around, writing something to replace it or whatever, is there's no particular reason why you should download all the packages in one run and then unpack them all in the second run. The package model we use allows you to, as you're downloading the next one, be unpacking the previous one, leave it in an unpacked state, and then just simply configure them all when you're finished. We've played around with this with apps, and it cut the install to the upgrade time in about a third. Is that something you'd want to do with your redesign of it, to make it a lot more parallel? There we go. That was the one I was talking about. This is basically what this is about. So it allows us to say that these packages are unrelated. If they are inside a parallel node, they will be unrelated. If they are on the same level in the parallel node, they are unrelated. So you can actually download one, install that one while the second one downloads. Yeah. But you don't even need, you can install, let's say you've got a dependency tree. There's no particular reason other than pre-dependence and conflicts that you need to install that dependency tree in order in reverse up. You can actually just unpack the whole lot onto disk, leave it all unpacked state, and issue a hotel de-packaged just to configure the lot, and de-packaged will work. Yeah. So there's no particular reason to go through the process of trying to do too much well, we'll download all of those 30 in one go and then install them. You can actually just download each one at a time, just unpacking it at the same time, you're downloading the next sets and stuff like this. If you say, I'm going to see where I'm going. Yes. This is also possible. I've not expanded the install part yet. It's perfectly doable to the install part will split into unpack and configure, and there's no reason to not delay the configure step. Right. Okay. Okay. Another question? So when I got myself out of bed this morning, I looked at your talk and it said workflow. Does this apply to the way that people interact with this entire system or are you actually referring with workflow to the topological sort of the tasks? I'm afraid to say that I've written this so long ago, I don't really remember. So basically, I think I meant workflow inside the system. So this graph that you've drawn there, this decision tree or whatever it is, I'm not entirely sure. Are you aware of workflow diagrams and the topological sorted task diagrams? Not really. I've talked with some guys in our CSD department, but well, only to the point where we had some something that might seem to work. Okay. Thanks. Okay. I just see I still have 10 minutes to go, and well, I'm not overly sure what to tell it, but what to tell next. Hmm. This is going to be difficult. I don't know what to tell us. No, that would look ridiculous even this early in the morning. What else could be told about this? Well, I could tell about the current issues that I'm facing. Well, this is mainly a cry for help. I'm still working on the part where you go from the Cycles Fold dependency tree to that action tree which is an actual tree. The algorithm for that is not entirely fixed yet. I think I have something that works, but I'm not sure about some corner cases. I'd really appreciate it if someone spoke up and brought me some corner cases that is not handled at the moment. So, if you think you found a problem, just tell me. I'm grateful for that. The basic idea is this question. So, if you encounter a dependency cycle in this graph, how does your algorithm actually decide where to start? I thought this was pretty much a manual process. Baking actually dependency cycle is, well, it can only be done at random. Basically, a dependency cycle cannot be broken without breaking the assumptions the packages will make. You have to guess which one you can break. So, you can just print a warning, break any of them, and see if it works. This is also how deep a channel is at the moment as far as I know. If it breaks with the help of your transactional system, you can just roll back and try the next one. Yes, that's the plan. Good idea. Thanks. So far, it hasn't broken. So, there's no real need. We can just print a warning, and as warnings will annoy people, they will at some point perhaps degrade one of the dependencies into a command. There should be actually no particular reason for a cycle dependency. There was a long thread about this on the Vindival, by the way, in case anyone has missed such a big thread. Okay. The basic idea behind this algorithm is I start out with this example, because it's the problem case where you actually have to think about what you're doing next, and you can simply go and install this, install this, install this. Oh, we're done. Like my last search upgrade went, which was fine, by the way. So, how do I start on this? Basically, I start with a list. I want to install B2 and A2 in no particular order. I think this list can be retrieved pretty easy from the dependency tree, dependency graph still used the word tree inappropriately here, and well, as soon as I know, we want to install both of these packages, and they're both upgrades. Then I know I have two actions, upgrade B, upgrade A. This is an and, whatever, and thing, which I've practiced drawing this character a lot, but still, it doesn't work. Basically, we try to install B, and we need to install A. The order here is arbitrary at this moment, because this is something dependency, thanks. Because this is something dependency tree might actually tell us, but we will get to later anyway, and the dependency tree might also spit out incoherent information at this point, which basically happens with dependency cycles. So we know we need to upgrade both these packages. Then we can take a look at the dependencies again. Since we still need to figure out the order, basically, this is an and with unspecified order yet. So we look at the dependency list of A. See that A depends on B, and we see that B does not depend on A. At this point, we can establish the order, which would be B and A. And in order to and the upgrade of B, the upgrade of B also has a dependency. It has a dependency of by the direct version dependency of A1 to B1. B2 now conflicts with A1 implicitly. And so we see that we actually need to split up this upgrade into remove A, upgrade B, and install A, which are basically subclass of upgrade A. And well, this is how the big problem case is handled. So far, I haven't found any other big problem cases that I would need to handle in a special way. I'd be glad if I could get some input on that. I'll try constructing some strange cases, but so far, I haven't encountered any problems. And I don't think it can be that this would work so great. This cannot be, simply. Good. Any questions up to this point? There's not much I still can say about this. So I guess this is going to be the end then. OK, thanks.