 Hi, I'm Mike, some of you were at my last talk. Anybody here not used Koji? I think this is more of as easier. Anybody here not actually built packages in Koji before? Alright, I didn't figure it, so this is about reproducible builds in Koji, and you were my last talk, same slide. I've been working on Koji since the beginning, and I've worked quite a lot of it in maintaining it. I work on the release configuration management team inside of Red Hats, which is a big, fancy, long-worthy based on its release engineering. Except there's another team not really since engineering is really confusing. A long, long time ago, I used to work on the Solar QA, and even longer time before that, I used to work on it. On Fedora, I'm Mike M, and right out of Mike M, I'm Freda, and Mike M. Other places, I'm not Mike M, so don't just hit random Mike M's on the internet and expect them to be me. Who are you Mike M 23? I'm Freda. I'm Mike M, and Mike M 23. Mike M didn't log in for a year, and I got it. I successfully sniped a free note account, so I am both Mike M 23. But, for example, I'm not Mike M on Twitter. I would like to be the guy that's Mike M on Twitter, doesn't seem to be using it, but I don't know. And there's actually a lot of Mike M's at Red Hat, but I'm the one that's Mike M. Sometimes I get their mail. About Koji, you guys know about Koji, just one thing that sometimes people don't realize is that Fedora is the only project that uses Koji. There are a bunch of others, and that's not when you're completely obsessed with just some prominent ones. So, I pulled this definition off of Wikipedia. Reproducibility, in general, is more of a term that you use for scientific study. But it's the same basic idea, which means that somebody did something, and we want to do the same thing. So, in our case, the experiment is a build, and they are definitely experiments. Frankenstein experiments sometimes. What goes into a build is more than sometimes people think. And yes, it's very important, but that's not all the information that is entered in when you initiate a build. There's also build parameters. For example, anything you might pass in the mock, any sort of macro definitions or build options that get passed on the command line. When you build those, those are part of the ingredients, and the build environment is part of the ingredients. So, if you build the send source in a different environment, you'll get a different result. As an extreme example, if you use a completely different version of GCC, you will definitely get a different result. If you get a result at all. Good luck with that. So, Koji's been around for a long time. I think in the last talk I talked, initially started development over 10 years ago. It's been about over 8 years since 1.0. And Koji has always been, from day one, a concern for reproducibility. Koji's approach has been not so much to focus on what folks are now worrying about, which is by reproducibility. But just about reproducing the build environment and logging everything that we need to build in the same way. So that you have all that data, so if you need to do it again, you can do it again. So, just as a review, I think a lot of you know this, but this is just to put over you what Koji does when it actually does the build first. It creates a fresh build route every time. That is generated using Mach. Of course, when you have Mach generate a build route, you've got to give it a repo, a young repo to use as the source for the RPMs. And that is a repo that is generated by Koji. When Koji generates a repo like that, that represents content in the build tag. And none of that realizes this at a specific point in time. And that's log. So every repo that we've ever used in Koji, we have an event ID log. So we know which tag content was in that repo, yes? It's not a random question. It's just kind of, are those reboots? No. Which is why we have... So Mach doesn't make repos, Mach's repos. So the input to Mach is repo. No. The input to Mach is a good tag. It takes a good tag with everything that's in the hub and everything that's in the place and then that's great repo. Okay, good. But it can do multiple. Mach uses great repo. Yes. Yes. And then it takes the result of the create repo and combines those into a repo, yes? It runs great repo in place. Oh, right, right, right. It does that in a month. Okay. Yes. No, we don't use Mach. Mach is great. Mach, interestingly, Mach didn't exist when we wrote Koji. Mach came about out of the... Yeah, Mach came with Lungy. So... So Mach is quite a bit slower. I mean, not terribly slower, but a bit slower than just running pre-rep by itself. Yeah, I mean, it depends. It can be really slow. So if it can't help with the files, it copies the files in place. And then if you process a HTTP place, it will run download all the packages and put them into a passionate hardware again in place. Okay. So it's a bit of a... Going down a rabbit hole here, because it's not really pertinent to... Yes, right. No, but it's an interesting point, because I do want people to understand a little bit of the detail about what happens. So at the moment, Koji just uses a plain old free repo from a single arch repo. So no multi-lib is going on in the Koji repo. Which for 99.99% of builds is all you need. And a couple odd builds really want a couple odd multi-lib things in there. For various ArchBrew strappings. So we have a couple of workarounds that we use in Koji to fix that. So in order to make sure that we could do it again if we had to, Koji tracks a bunch of data. And I sort of cast this in terms of the ingredient list that we had before. So in terms of source code, well, we have this for... And also I apologize. Koji can build more things in RPMs, but this talk is really talking about just RPMs. Which for Fedora is... We don't really know what's all we care about. We still care about images built, image builds too, but that's just slightly different. But for source code, we have that captured two different ways, because every RPM build we save the SRPM, so we have a source that way. And as long as they've... The developer has built singly, we also have to get ref stored in the task info. So in Koji 2.0, I want to get that into a saner part of the database, where it's queried more and more of a first-class citizen, but still we have it. We have it. But it's possible to use generic reference like FAB, so... Right, I said as long as the developer is doing things singly. Another thing I'd like to have in Koji 2.0 is have the ability to have some sort of policy checks implemented on my last talk, but to allow people to run their Koji instances to set rules about what sorts of references are of a valid or invalid. And you may want all sorts of rules, if you want to say you can only go from this get branch for this tag and things like that. And those rules are hard to implement now without writing a very complicated play. So anyway, source code, we've got it checked. One way or the other, we know what the source code was. Build parameters. Well, there aren't generally not too many build parameters per se that go into Koji.0. But what there are tend to be captured either somewhere inside the SRPM itself, because you're going to set macros in the SRPM and spec file, and that's just already there. Some builds will have a task parameter that treats something. Particularly if you look at the way Naven builds work, which again, I'm not talking about that Naven builds have lots of different options to get passed in. We've captured, or the build tag itself, for example, just tags are the way that we usually do them is we build a package inside a build tab that creates a just tag in the build root. And that's sort of a build parameter that we still don't build. So really, that shows it more in the build environment. And lastly, we have the build environment. We also have Koji also sort of doubles up on this. So, Vault 0.1, when we create the build environment, we have the repo ID that we used to refed them off site here, build this build root out of here. So we know which tag and which event in time that we used to generate that repo. So we can look at the history of the tag contents inside here. This is what was in there. So we can always, so we can remake that repo for YouTube. At least the data is there. But also, we've reported all the RPMs to go in the build reading. So we have that log as well. We don't track everything in the world. There's some things we should probably also track and probably track some of this in the future. The software outside the build root, we don't have a record of right now. So the exact version of Mock that we used to generate the build root, that could be important. We don't track it. We will in the future. Likewise, Jam, likewise Koji itself. And for that matter, the running kernel version shouldn't affect the build theoretically, but somehow it does. Jboss developed a blog about interfaces between the kernel and the user avatiles. So I like the thing that was prepped there. It might be able to use native. Yeah, that's usually what happens. It tends to be unidirectional for the most part. What happens is if you have a kernel that's too old, some packages will fail or build weirdly because they say, oh, is this kernel feature around? No. Well, let's not build this whole other sub-module. Even though... The second version, but it was assumed that there would be a three. It assumed it would be unidirectional between the kernel version and the kernel version. Supposed to say we could do a better job if we will. So let's talk about Debian. Any Debian fans in the house? It's okay. I won't judge. So who's familiar with the Debian reproducible builds project? So these guys are doing great work. And since some of you aren't familiar with that, I'll explain it briefly. Debian reproducible builds project has a very laudable goal. They want for every package in Debian, which is a lot, to be byte for byte rebuildable from source. And it's going to be a lot. And the reason that they want byte for byte is they want independent verification. They want people to have confidence that the binders that they're shipping do in fact correspond to the code that they claim that it does. And really, from their perspective, the most direct way to prove that is to rebuild it again and get the same bytes. So in order to achieve that, they've been working... You could read about in their project, there's two major areas of work as I see it that they've done. One is they've built a new tool chain that records their build environment and also a tool chain that allows them to replicate that. And two, they've been fixing individual packages on a case-by-case basis when they find problems. So... Yes, so individual packages... So we talked about recording the build environment. And Koji's been doing this since day one. They have these build info files that they write, which contain very similar data to what Koji is tracking, including some but not all of the other stuff that I said that we should be tracking but aren't yet. And in an ideal world, if you have the same inputs, the same source and the same build parameters and the same build environment, and you run the build again, why shouldn't you get the same exact set of output? Well, you don't, in a lot of cases. And they found a bunch of reasons why. It boils down to... There's a long list on their project website, but it boils down to two major groups. One is non-determinism. And you're right, time stamps. Time stamps are a different one. Time stamps everywhere. Yeah. They put time stamps, they put time stamps in string, in binaries, in what? In documents. Yeah, in documents, in help strings, anything. And randomness. Some builds either have, some builds use a number generator to decide something or other. There was, I think they talked about one build using fortune. To generate some documentation at build time. So when they find cases like this, they go in and they patch the build. And they usually patch it upstream. And then there are other environmental factors, so where a build is pulling in something from the environment, that really should not. It shouldn't be pulling in a particular UID or GID from the build environment, which is probably not an invariant number every time you make a build environment. It shouldn't be pulling the exact running kernel version down to the very last bit and sticking that in a file or a string because that's also, shouldn't matter to that fine granularity. You definitely don't need to stick it in some common somewhere. And so they're going through and they're, I think there's a long list, if you look on that. The question is, if a package were to save its own build into a format and like some arbitrary file, do you just want to pursue that in the package? The package were to what? If a package were to have its own like you put in the build input, like yeah, there's a build on this system. And so it's in a separate file that you don't need to be excluded from the package. Well, yeah, I guess in terms of packaging, sure you could possibly fix that at the spec level, sure. So, right, without, without. So this is from the Debian website and you can see a very long list because I know you can't read this. It's a very long list of different types of issues that they have. Time stamps from C++ macros and the number of effective packages different to UMass. This goes on for a very long time and I don't think it's even time stamps. Time stamps, time stamps, time stamps. Randomness. So personally, I'm not a huge Debian fan. It's always seemed like a weird alien to me but these guys are doing an awesome job. They really are. They're, it's a, it's a peculiar task. There's so many packages out there and there's so much non-determinism and they have a pretty graph of their progress and they've made a lot of progress but it's really tapering. I think it's gonna take them a long time to get the long tail but a lot of the work that they're doing they're fixing upstream. So we'll get it too, eventually. Can you give me Zoom? I've been in touch with them a little bit. I really need to get back in touch with them and see what I've needed. Because I don't, I mean, and I'll get to this later. I don't really have the time right now to lead a reproducible builds initiative. But, and there's a question of to what extent do we have the same motivations as WM? So let's, anyway, back to Koji. So as I said, Koji's been tracking build environments in state 1 but we sort of have different goals. When I put these features into Koji I wasn't thinking let's, let's publish this data so that people can replicate our builds. I was writing this for the internal red hat build system. No, this was about reproducing failures because our previous build system did not have any level of reproducibility and it was a terrible situation to be in because if something went wrong and you needed to rebuild it you could introduce a bug from a changing build environment when you rebuild a simple one-line patch and not have any idea why. Or you could have a failure that you couldn't explain but then you couldn't replicate it and understand it. What did I mean since you had that the build was updated while the kernel build was going on? Quite possibly. Would that mean that your kernel would not work? Would not surprise me in the slightest. When I first started working with the CentOS guys they had the reverse engineer all this information about the different e-hive builders because e-hive has had build routes. e-hive had a build route that lived for months, years and just kept getting package updates in it so the same build route is sort of like imagine if you had a laptop that you had started with Fedora 1 and just kept YUM updating without doing any of that fixing up the broken packages at every step all the way up to 22 what kind of install environment would you have? Well, you'd have something like a big build route. So when we wrote Koji we really wanted to have some sanity here so that's what we were going for was sanity and also the ability to reproduce to others. And since we drunk all this stuff it's one really nice thing about it is if you do discover that say there was a bug in glimc version XYZ and it could affect things that were built with that version of glimc you can actually go in the database and figure out exactly which things we built with that version of glimc that we might need to look at everything whereas if you did try that with the previous system you would just exhaustively examine every binary in the distro and figure out if they were appropriate. So that's sort of why we have the reproducible data we have in Koji and by the way a couple of people have asked questions but please ask questions if this is a little bit of a duplicate slide from before just a reminder for Koji build routes we have the input is coming each build task is given a source code and a target, a source code target those are two things to provide. When Koji does the build it's going to look at that look at the build tack for that target get the current repo which references a particular point in time generate a mock build route or run mock build and we have the long data trial that we talked about before Yeah So it says that the build records for each RPM is there a way to reach back in time and is there a way to access it? It's not that accessible but I have not a demo because I'm not doing a live demo but I have pictures of doing exactly that It's possible but it's not allowed for ordinary user or not allowed to even scratch builds from a particular OID it is in the IPI at all but it's not allowed in the policy Yeah, so I'll get to that So yes, you can do this So let's look at too far Let's just place as you would look to get this reproducible data we have in Koji for the task parameters that come from the task info you can get down the command line get in the web UI, you can get it in the API build route contents I've got the list build route command there's also web UI also API because there's nothing in the web UI or the CLI that you can't do with the API and the yum repo for a recent build, the yum repo may still be there so if build happened earlier today you can probably just reference that repo and do the build again for all the builds, we still have the data and yes, Koji can remake that repo and I'll show you how Unfortunately, it's a privileged action so you don't have to be an admin to do that part but you do have to have the repo part and we generally don't recommend handing out the willy but so anyone that can run Koji can also generate the repository for all the repo ID that part but before we end of that let's talk about local mock because I know we have the Fed package command that does mock build but I think the default by default that references the main release repo well, if it references the way it calls Koji and mock config but it just gets the way it is maybe with the option I think but by default I think it actually references the distro repo yes, exactly, but there is also a Koji repo defined there which is not the same repo but Koji does add to any core view but there is graphite and it's disabled by default but with the command line option it's possible to manage it there is an option for package with the package mock build it might not come from Koji right, uses the latest but if you want to actually use a specific repo say one that's not the latest but one that was the latest four hours ago and still around this big command is a bit of a Swiss army knife for providing mocking things and this is an example that I ran through last night so I went and grabbed a recent successful build in this case it was AST I grabbed the task ID for its X-7664 build arch task and I said Koji, could you make a mock config from that task ID and spit that out and that's all that command does make the same mock config that this task used and I have to get the source RPM move the mock config somewhere where I can use it and rerun mock built to CGS is it byte for byte? I should emphasize that's not because of using mock this way that's because I think that's the packaging thing that's going to vary by package so this is sort of DIY replicate on your own system as long as the repo is still around you can reference it and in the next example you can see how you can with the ripe ribs in the future I really want to make a client side command that would make the repo locally but referencing the content in Koji so you could remake the original repo locally and use it and then you wouldn't have to have the privilege to just encode you if you just keep pulling back we have all the data so it's totally doable no privilege is required to do this now can we make Koji do it and the answer is yes but it does require some privileges so the approach we take is first we got to extract the parameters that we need from the original build source URL build tag and the Koji event ID from that build tag or the Koji event ID from the repo that was used step 2 get Koji to replicate the repo step 3 using the repo that also is a privilege command a privilege option it's just a dash dash repo ID option to the build command but if you don't have the right privileges there will error and tell you you're not allowed to do that that's actually something we can adjust in the current code depending on whether you want people Koji has to be able to do that I think I think the maybe because of garbage collection the repo is already used and it can't be there so there's two issues do you want to support it do you want your users to complain to you when it doesn't work and do you want do you want your developers to be able to build from you'll really be able to build from last week's version of the repo for your product that may be important that they don't do that yeah but in some cases it's useful to do that oh I agree so this is something that much you can adjust at the Koji app making set policy or even to like you wait for some package to appear but by the time you stop me it's an attack another reason that you might not want everybody to do this is that it does put a little load on the system repo jobs are not they're not free in particular I think these repo jobs we can do a better job of that but I think these repo jobs tend to be a little heavier than most because they don't they don't make as much use of that yeah yeah I think they end up making it from scratch I'm not sure about that we should probably fix that that's our approach and so I said no live demos but so I did a few tries I ran this one last night so like UTFPock that's a piece of the build info page just to show you where you can pull the data from for the build info page I'm mainly looking to go straight to the task page so that's the link down there you could do all this in the API too for the build task again screenshot here is where we get the source URL of the three ingredients that we need so just cut and paste that there are other ways to get at it I actually have this semi-automated in a really hackish script that will rerun a build task out of you yeah and then we dive down from the build task to the build arch task a little bit more data a few pieces of data we need the build tag which we could have guessed from Target other ways but it's right here so build tag is that 24 build and here repo ID that repo ID was chosen by the build task and passed to each of the build arch tasks so each build arch task uses the same repo ID so we have the repo ID but that's not quite what we need what we really need is that event ID that is associated with the repo ID to get that I had to go to the API I'm sorry that sort of low level stuff and the web UI doesn't really expose it but the call repo info call for that repo ID gives us the event ID and I highlighted it in red yeah it's a really handy short cut for when you're feeling lazy there's also look at the help on that you can the default behavior is to sort of do this weird auto conversion of your bash arch to XMOP CR so if it looks like an integer it turns it into an integer if it looks like it's a boolean true false yes no turns it into a boolean if it looks like none if you write none it turns it into a none and other than that it says string if you want to pass a list or dictionary then there's a dash-dash Python option where you can embed Python expressions on the command line get parsed with AST literally bound if you have AST installed if you haven't moved up Python to have AST so it's a fun little command completely silent but yeah sometimes you just don't want to fire up a Python shell and type all the magic session set up and just run the coaching so let's do this we have all the data so not a lot of commands so here I'm using call again because I can't just do regen repo because I need to pass in that event ID arg I need to say regenerate this, make this repo for this build tag at that event that we pulled from the last slide so interesting when you regenerate a repo in Koji at a specific event is not current when Koji generates that repo it will not market latest which is very important because I don't want to accidentally affect somebody's build in a weird way by having this not current repo suddenly become the current oh wow I'm almost out of time yeah fine flies we have fun alright so watch that task so that we wait for it to be done successfully we will need the result of that task to get the new repo ID that we just created which is that first one and the other thing that that particular task returns is the event ID which is the same on the past then we say Koji build I added in a way to the arch override because I didn't want to scratch I didn't want to do a real build and I didn't want to waste any arm time but same source URL and that refought dash dash refrid equals refrid above so we're rebuilding the same source using the repo we just regenerated watch that task and wait for it to finish it completes and in the end it's identical well so after it finished I did a download build of the original and the new download task we just added a 110 of the of the scratch build and I'm cheating here I'm doing RPM-IT because one thing that we have in RPM is timestamps and the RPM header itself there's lots of timestamps so even if you make exactly the same files in your RPM you will have them have different timestamps in the CPIO archive in the RPM so when you do the RPM if you have to ignore that's what the IT does is it ignores those timestamps but when you compare those two RPMs I put the extra dollar sign no output so in this case we're going to use the RPM-IT by reputation so it is possible we do track the data and yeah so hopefully that wasn't too tedious it sort of shows what the data is and how you might use it so open questions I have no idea how much a fedora would replicate by if we did this for everything we could find out not not something I have enough time right now to dive into do we have failure cases beyond what Debian is finding are there things that are there cases where Debian can get a byte for byte from the same source that we can't do there is a case for architecture builder some of our packages can be built from either architecture can be built from different architectures we do actually know that though because we do know what architecture on the large packages at that time yeah but then if we use our override we would have different architecture you're right we would have a hard time forcing that we have the data but we don't have the tools but we don't quite have the tools and last is there interest in fedora in a fedora reproducibility effort I don't know if any of you are interested in this I'm happy to help I'm happy to point you at scripts and tools and data if somebody wants to to take point on this that would be awesome but at the same time I think we have somewhat different goals than Debian so we may or may not really need to do this right yeah that was shocking to me when I went to the talk yeah in a sense the Debian reproducibility a lot of what Debian has been doing because they weren't tracking me in a state before and they weren't even and they weren't using any kind of same build environments like I said I think even still at least when I saw them saw their talk at Faustum they still have processes in place where a developer could build on their local system and upload it to Debian and that was something that would ship so to their credit the Debian guys that are behind this effort they hate that and they want to get rid of them so that's yeah so I sort of ran out of time for Q&A but if anybody wants to catch me later I'm happy to answer questions and thank you