 Yes, so I'm going to hear, I'm here today to talk to you about Diffascope and How you can just use it as a better diff or for quality assurance, etc. and things like that little Moin, apparently that's like a North German thing to say welcome North German North Denmark Scandinavia that kind of thing I'm told People aren't shaking their heads, so I'm going to assume that's true And this is my first PC and IBM 5155 Sometimes when you rebooted it it would launch into it would somehow revert from booting from the hard disk to Booting from a basic ROM as in the program language ROM that was on our motherboard for some reason So randomly you just get a chance to program in basic and then sometimes you wouldn't I don't know why but yeah It's quite fun with this kind of clicky Keyboard and that folded in and it was this kind of big desk thing. Anyway This is my first Debian at the time. It was already old What's this one? Is this slink 2.2? Yeah, and this is when we had like us and non us So that's really dating it remember that This is my first contribution to Debian 19 December 2006 sending a patch to nearly pond Just kind of interesting and the response was oh, yeah rock on many Thanks, I'll upload this and it will out into etch and this was like super motivating because etch was just coming out And it was like great I've got like one line of like tiny patch in a release. This is like super cool and Thomas's like response was super motivating. So after that like that Christmas I basically spent like reading all through the Debian web pages and stuff. Yeah, so Very well timed. So yeah, I mean that's kind of a good You know, if someone sends a patch be like, yeah, cool. Thanks. Yeah, I got a little notice in the change log It was you know so stupid, but Yeah, so do that kind of thing Cool. So I'm moving on and so why differs code. Why did we write differs code? What's the background here? So it comes from reproducible builds So very quick outline of that is that whilst you can get the source code for free software You can download the source code for you know engine X or whatever pretty much everyone just Downruns binaries on their servers or their systems in the app to install blah yum install zip whatever Android Play Store or whatever And can you actually trust whether these two things correspond with each other like you can look at the source code? Yeah, it looks alright, and then you install this binary Yeah, what happened who generated that? Can you trust that process? Can you trust who generated it? Even if you could trust them, can you trust them not to be exploited etc? This is a big problem because you can exploit a build farm and then obviously exploit all of that You know add Trojans and add Trojans into the build farm. So every single binary that comes out It's compromised kind of problematic You can also target individual developers machines So I could go off to say your machine and add a little back door to it So every piece of stuff every binary that you give to friends and things like that Are compromised in some way steal all your bitcoins or whatever and I can also Turn up to your door and blackmail you into producing Software that has compromises or extra features So we say that don't exist in the source code So what would happen there is that you'd release your source and but the binaries that you produce having this sort of back door That's you know, someone's sort of forcing you into producing so you don't do that anyway Enough of that what you do for reproducible builds is you ensure that Every time you build a piece of software you get the idea an identical result multiple people then compare their builds and check whether they all get the same results and This means that an attacker must either have infected everyone at the same time or they haven't infected anyone because that kind of thing So the the point here is that you have to ensure the builds of identical results. Okay, great. Okay Identical results. Okay, great. So we You start we started reproducible builds product, etc. Etc. And we build two devs. I'm sorry about the colors there You probably can't see that that says char one some a dev and be deb Well start up there actually Anyway, we're comparing the char one sums of two Binary Debian files. Okay, great. So these two files differ. Okay, they're not reproducible Why is that so we'll run a diff on them? Yeah, so what can we learn from this? Well, not very much Visually, they're compressed so as soon as we see one change We'll see they'll just cascade changes because that's how compression works And we I guess we know it's a dev and probably an AR format file with you know with the Yeah, not very useful. Okay, great. So we'll go one level in we'll Do a binary different. Okay. Well again, that's not really telling us very much with the With the diff there. Yeah. Okay, great So let's go one level in a RX. This is on the New maintainer thing how you unpack a deb everyone remembers this, right? You unpack a dead with a RX and we do that to the B.deb and then we diff the results of that. Okay, so Yeah, seven zip Okay, compressed contents not very useful. Okay, so Let's unpack the control tar inside that Inside those depths. Okay, and then we run diff on that Okay, yeah, it's still not really telling us anything useful about how to make this package reproducible So let's unpack the tar dot xz into the tar. Okay Inside that tar. There's a file called MD five sums And it was starting to see some difference in between some files in this in These two depths that's something meaningful. So now we have like some idea that it's something to do with this User bin PN mixer binary. Okay, interesting. Okay Well unzip that and then we'll do a diff on PN mixer itself. Okay, well now we're back into just binary gold with a gook mode This isn't very helpful And this is taking quite a while and if I remember correctly Debian has a lot of packages. So this might take a little while So basically I don't know if you know this particularly mean I should build a better diff. You know, that's not quite true It's actually it was Luna that started this project and it was originally called dead been diff because we wanted to diff binary Debbie in packages So this is the initial commit 2014 This version is successfully able to report differences in two changes files not with much interesting details But it's a start. Yes, and it was a start So Fast-forwarding. Oh, sorry about these colors. I don't know if we can do anything about the lights do that ruin Yeah that No okay, whatever and basically we're Differscoping on It works kind of like diff does normally you give it two files. You'll output a sort of unified diff. So Differscope a different and B and one file contains the word foo one contains the word bar Brilliant nothing actually that would after the ordinary. It's sort of colored by default. So that's why you can't see it but whatever And it means it supports archive formats, so if you give it to tar files So if we then tar up our a file and our b file into an a dot tar and a b dot par and then run Differscope on those tar files We get this kind of like I'm hierarchy here. So it's saying that okay. There are differences between these files in the file list they have different Time stamps because I made them at different times and and here are the contents So we get foo there and bar there so we can see the difference between them Well, I can I don't know if you can you get the slides later If we G zip these tar files and then run Differscope on those G zip things It'll say okay. What we've done is we have to unpack it first and here's the metadata about the The G zip process and inside that are off a dot tar and b dot tar from the previous slides and Then the a file and the b file. So it's already going two levels deep into this Into this tar dot gz file. That's pretty cool. It's completely recursive. So I think it'll it'll actually blow out after I think a thousand We try Yeah, well, so I've just bumped back a bit just in case Yeah, thank you Yeah, so I mean yeah fubar fubar. So that's the a b file We've tarred them up. And so now you see the hierarchy our foo and bar file there and then we've G zip them So there's a G zip layer as the tar layer and then there's the files themselves and This is our on a real dead from the archive Inside this dev. There's a data dot char dot xz and in that xz file There's obviously a data tar and inside that tar file There's a file called a ff and inside that there's a version string That's different and that looks like a build date. So we probably know that if we went back to the source package We can very quickly Work out, you know with a very quick grep work out where this file is being Generated from the d.d a ff file and Then just it'll probably usually quite obvious that it's using the current build time And then we can just you know patch that be like fix it etc. It's a great So this is gone from a to rather obscure binary devs all the way to the fix probably in about, you know, five Five minutes, you know, and you could probably send a patch off in that time because it'd be quite quick And without without a different scope here without this sort of recursive unpacking You'd be just completely lost you'd be there with hard a ARX all day and working at which files and difference and trying to use xxd and all this kind of nonsense Diffoscopes and got some other things as well So if you're trying to do reproducible packages and things are varying just on the line ordering We detect whether a file differs only in the line ordering So here's file a these lines are in order File B has these order are in lines. That's very difficult to say actually It's like those tongue twisters Run the first scope on those two and it says it's got ordering differences only that's interesting So it probably needs a sort you can go all the way back to the source code Work out very very quickly if you know, if you know, it's just ordering differences You just kind of know and you know what the inputs are going to be you kind of search for order and are And you get the right files. I just have a sort in the right place. Bam. Send the patch off. Everything's great Oh and send it upstream as well because you're good It supports a lot more things so um that we've been showing the terminal text output here Moving on it's got a html output mode, which is really useful in the Hierarchial thing when it gets a bit more complicated. So The dip instead of being layer on top of each other like a unified diff you get the The diff on the left and the right and you get sort of a nested Thing inside with colors and lines and you can link to set various things in it include bits of metadata here and Other bits here and what command it used So that's the html output. We also support a lot of file format. So it's not just on text Um, it supports all of these so to very quickly run through some of them So you give it to android apk files, which are kind of like zips, but magic Um, and it'll it'll know how to um compare them So there's like a manifest file that needs decoding It supports berkeley dv databases um word documents So that's a word document with a and that's a word document with b and it'll Correctly do that if you run that through diff normally that would obviously be a binary mess. So completely useless um ebooks like there's an e pub. It also supports mobby. So if you give it to Um, e pub files, it'll say oh, they just differ in this state brilliant Um, and normally that would be completely useless diff binary diff again So you can be like uh e pub date one. Okay Grep the source code for that Bam make a patch very quickly Mono binaries um git repositories. Yeah, why not? numeric spreadsheets um iso images. Oh, yeah iso images is really cool. So um, it'll basically unpack the iso Then inside that they might be like say a squash fs image and then it'll just completely go down into that and work out Any differences between the two um contents in the iso file Including any metadata. So this is on the squash fs metadata headers. I think but say inside that iso there was a file that was a You know, there was a pdf and inside that pdf was a A a ping file which varied it would basically go all the way down and say yeah Yeah, it's actually here in this ping that the data differs And that means that you can just go again all go all the way back to the source and say Okay, cool. We know how to fix this quite quickly and this is really valuable in getting the recent tails distribution reproducible. So their isos are reproducible So if you build one and I build one we get the exact same one and that's kind of useful for Something like tails where you would probably want to of all there's a lot of projects You might want to compromise you you might want to go after that one because the kind of people who are using it, you know, whatever Um, we support comparing images. So this is using um, um I think it's using as sng to text So and then just running that through diff And that is a Linux penguin and that is something else. I can't remember now Oh Ft or something anyway It supports images It supports json and it'll um pretty print. So if you give it to uh json files one with key value after It'll do a nice diff of them. Um It will first Um pretty print it first before doing the diff. So it'll actually give you something clean. Otherwise, I don't know if you've ever diffed Two very long json lines if they differ in the middle, you just get a huge long unified diff But here it's like, oh, just those two things that have changed cool brilliant Um open document text formats um org audio files Because why not? Um tcd dump capture files. That's actually quite useful uh pdf's so, uh That pdf says hello world and this pdf says hello six sad world. I don't know why that particular text in the demo But yeah, um, so yeah, again run that through normal diff program garbage xml documents, um Again, it'll pretty print them. So it's nice actually nice to read So, I mean if you want to get started on diffiscope the um very easiest and quickest way to do is fire up your web browser try.diffiscope.org Select two files Press compare It'll annual upload them and run um a run diffiscope in With all of the support for all of the file formats in the cloud for you and give you a nice html page That you can then link to people. So that's that's the very quickest way to get started The next weakest way is to install try.diffiscope and then you can just run that on two files And it'll basically do the same thing or run it in the same cloud service as try.diffiscope but either give you the um Result on the command line or if you pass the web browser option It'll give you an url or load your web browser. I can't remember exactly which Um, we're the same results. You don't have to install any this is, you know, one kilobyte of python, you know, nothing basically Um, so yeah, that's the um the next one this way. Um, but you can then install diffiscope itself on your own machine Uh, I recommend not installing recommends because um all of those file formats might drag in extra Things so that'll be all of tech. Um, I think all of open office all of mono all of java all of yeah, um, so Yeah, and uh, androids get quite big. Yeah, that's one um Bunch of interesting. I think there's another big one. I can't think of Yeah, so they're all optional and it'll say, um, oh by the way, I support um tech documents or whatever mono or whatever but um You need to install this package and then you get like full A pretty printed support And it'll tell you that when it's missing. So if you just start with install recommends Disabled Run it on your file if it says please install this package You can then install them as you go along as you want rather than you know installing everything And then you just pass it to files and it works as before So, um, how you can improve your own quality assurance and Debian packaging with your fiscope Um, the biggest biggest value here is not necessarily for reproducible builds It's for basically just seeing Where you do want to have a diff and you're expecting a diff And you're expecting a particular type of diff and in particular way You can basically see those changes. Um, and if you Built two devs normally and then just ran, you know Well, I'll try a demo in a second But if you build a dev with a patch applied and then build the dev with a patch applied you could obviously run A diff on the source package But that's not very useful because you know, it's the binaries that are going to end up on the people's machines But if you run a diff on the binary itself, you're like Did the did my changes actually hit the binary? I don't really know So, yeah, um, so I'll just I'll just run through a very Live demo, of course. So it's going to fail So switch to a is that big enough for everyone? Yeah Let's find some I'm a dd. I promise So we'll check out some um, we'll get this lib netx java and um We'll just build that once Okay So let's say we are On security team and we want to apply patch and we want to like just be really sure because we're going to push it out to all our users so First we will make a Change log Closing a bug Cool and then we'll Find some Find some java file to Change let's pretend we have a real patch um Let's pretend that the security floor is Right, okay, let's replace that equals equals Like say that was the fix and make it up probably isn't might break Okay, so that's the patch from upstream upstream blessed patch Yes, yes, yeah, probably Okay, so when we build this what we want to see is just that change in the file We don't want to see any other nonsense changes have accidentally done And but we also definitely want to see that change because if our binary Our far security release doesn't have that change then we aren't actually fixing people's machines will issue a dsa Everyone will install it and be like oh, we're just nice and secure but actually you want I mean, yeah, you should do proper testing as well, but hey, we're multiple levels, right Cool. Um, so we'll Build that again That there's no test suite so that's good. All right, um And then we will Um, let's say that directory. Oh, I don't want to do that so Right, um, let's All right, we want to do Output html So we want to diff the original one zero five Oh, you probably can't see that we want to diff that one with our Fake security one there, right? So you see Um, nice little progress bar 100 percent One there are differences. Okay, there should be some differences So let's see what those differences are In our web browser using the nice html output Is that big enough for everyone? Yeah Okay, um, so let's have a look um, are we seeing what we want to see so Uh, okay, there's some changes in the in the data tar. Okay, we kind of expect that What's changing our control file, okay, well the version's changed. Well, we wanted that to change perfect and it's changed to 90 run. Okay, cool. That's what we want to see no other changes here So there was no weird controlled or in magic going on cool Right in our data tar Oh, we've got a lot of timestamp changes. Well, okay, we'll ignore those for now um The change log has changed. Well, I hope so because I added a change of entry. There's my cv number Right, here's what we want to start seeing we want to see a change in the the jar file, which is the the the java class java compiled um sort of archive format Okay, we're seeing some Uh, meaningless timestamp changes, but we can kind of ignore those. Let's pretend because that's just metadata. Maybe Okay, part of our class. Okay, so if you can see here It's basically done a Decompilation of the java file itself And is basically saying that oh it used to say if null and if not null So these are the actual byte java bytecode instructions And what's really useful here is that no other nothing else has changed We just expected that change between the two op codes of if null to if not non null Which is good because like it hasn't made any other code changes But also crucially we can see that it has actually made a change to the code um For example, it wasn't you're going to use some cached version or something like that So this is really useful and just running a Naive diff wouldn't have given that of course because it would have just come up with binary garbage And just seeing that the dep had changed again wouldn't have actually told you anything Because all of the change log would have changed as well. So it's like well, it's different but The meaningful change there. They actually fixes the floor Would have still been present, but we know it's there. Um, yeah, so that's That's kind of cool. You can be like, yeah, you so shipping this dev out I'd be quite confident that that Assuming that was the actual bug Assuming I'd be quite confident in pushing that out because it's very minimal amount of changes You know you want to do that with security releases, etc. So, yeah, that's Sort of live demo The other one is seeing no changes at all so if you um You could build once if you build is reproducible Um, you could build once change your compiler or change some other part of your toolchain Um, build it again and if you get the exact same results Well, great. That's not That's what you intended. You want to see no changes when you change some part of it Um, it's assuming you want to you want to do that and that's really useful Um, if there were changes diffiscope would highlight them and show you exactly why they had changed It might be some compiler optimizations might be some other thing as well Um, so you can use it in both ways when you expect changes and when you don't expect changes And if those don't match your expectation Difiscope will tell you exactly why Um, it's also useful when other companies Are doing security releases. So naming no names whatsoever, but they like to release patches as you know, just a new firmware for your router in big, you know fairly large sort of fast system images You basically have no idea what's changed between these two files Um, again, if you ran them through a diff completely useless you could start to unpack them with um squash fs and blah blah blah But yeah, they're probably sort of congratulated cpio archive, you know, so there's nonsense But diffiscope will just chew through those and give you actually what the difference is between these two files And so you're like, okay, cool. They've they've changed this. They've removed or added some gpl license code or something. Yeah, it'd be quite interesting um Yeah, yeah, for example, um Yeah, so it's very useful for for diffing those kind of binary blobs that come from various people So, yeah, what's the current status of diffiscope? So, um, the development of an up and down Um, I did a yeah, so again, it started around what was it made 2014 something like that Um, a bunch of work here. That's probably That is heidelberg, I think No, no 15 These are probably just dev comps basically, yeah Yes, although Okay Maybe these data anyway, yeah Don't know if I'm down. It's kind of interesting Um, it's used a lot in the reproducible builds project, of course. So every time we do a um a build on the tests dot reproducible builds to org testing framework If we run diffiscope on the result if if it's reproducible, it just says Hey, the files are the same cool But if not, we publish the diffiscopes of all your packages that are unreproducible So you can just go on there and be like what's the difference between these two things? Um I don't know the differences here I think it's some ordering Whatever doesn't matter. Um, cool Um I also did a lot of work optimizing diffiscope. It will had some rather perverse um I think n n squared loops inside it. Um, so Managed to cut down some of the time here cut down down here. I'm here So Yeah, so there's been quite a few performance enhancements in the over the past. Um These are the git tags. So this is version 80. I think and this is version 50 And I just ran the same on a benchmark across them all Um, so this shows when I've introduced some rather Rather well, I'm gonna say rather clever optimizations. It's more like removing rather stupid code The embarrassing but whatever speeds up now Um, there's work being done right now on parallel processing There's been quite a few attempts at it before but adding it is kind of interesting Um, and difficult. Luckily we have a um outreach. She's student. Juliana. Is she in the room? She's hiding She's here and she will be talking tomorrow about her work on parallel processing in diffiscope And that'd be amazing because um a lot of it's sort of IO bound or waiting for external processes and with multiple CPU machines You might as well just like well whilst I'm waiting for the result for a PDF To be unpacked. I might as well be running something over on another CPU So I think we'll go see some real performance wins once we do get parallel processing Merged and working and stuff like that Um, you can check out our website diffiscope.org recently migrated to um salsa. Yeah Um, yeah Um, and everything everything the run reproducing world is now on salsa. She's kind of cool. Um, that's quite recent You know cutting it a bit fine Um, so, yeah, um, thank you very much. Thank you shun. Have you got any questions about diffiscope? Just launch them out. Oh, we've got one over here. Yeah, but anyway, thank you very much Thank you Yeah buzzword question. Can it if uh container image format? Uh, depends which ones So if they are just, um Directory for directories Then yes, because it's just a directory Do you have ones particularly in mind like docker? Yeah The obvious one is docker and then there's this oci. I believe is the standard one And that could make it buzzword compliant. Okay. Well, we're all about buzzwords, right? Yeah, I mean It could probably diffiscope Blockchains as well and then run your diffoscope on Kubernetes and see the difference between updates of your container images Bam sold Where do I invest? um So I wasn't aware that oh oci. Oh, is that what it's called? Old ci oci. Um, so no it doesn't basically doesn't support that right now Um, but it wouldn't be too difficult Presumably there are tools to unpack it and as soon as we have a tool to unpack it It can then just go into into that. There is a open wishlist bug for docker um containers and To the point where I think it would be really nice if you could just give it say two image names or whatever the noun is So you can say oh, yeah, please diff these two docker images that are available and it could look at your Local thing and do a diff on them But currently it's not supported, but there is an open wishlist bug Yeah, shouldn't any company that releases binaries be interested in Supporting diffoscope and using it Uh, the response just for the microphone. Well take the microphone base Basically when company releases binary they are not interested in users seeing differences Yes, um Yeah, I'm surprised that um, actually that the the docker bug was only opened Two months ago. So I'm actually surprised that there hasn't been More interested in diffing container images, but if you would like to open one for oci, that would be really appreciated And um, we can get on to that That'd be great Looking at the page for oci. Uh, it says that it's uh, based on docker basically So once I think you get oci for free once you've sorted out docker. Nice. Good. If you're lucky Oh, okay The oci image format is they've wrote down how docker image. Oh, okay. Okay. Perfect. Um, so yeah, we'll we'll sort that out and um Yeah, and it seems like we're using a docker a little bit more in devian. So This is what really quite interesting. Yeah Cool. Any other questions? Uh out of curiosity, um, which algorithm are you using inside? Uh, are you using some bio Informatics Algorithm to diff trees efficiently No, um, it's really naive um, it just just does a it basically is, um All it does is um run the normal diff the normal diff tool but it will um Do a I will try to identify files and unpack them first if you see what I mean. So it will use the um file utility, you know what I mean the file Identifier thing that says this is a pdf and goes. Ah, okay. We're a pdf. So we'll try and unpack it first Yeah, so it doesn't do any clever matching the only the clever matching it does do is there's some fuzzy matching as well So if you just rename our directory between two inside a container it'll say Yeah, there's a massive fuzzy match between these between these two files things like that Yeah, so that's that's kind of useful. But apart from that. It's not it's not that clever Which is kind of I think What you want because if it was too clever it would start to be a little bit opaque. You'd start to be like, well Yeah, I mean I personally like quite dumb tools. You know, yeah, excuse me. Yeah So I mean one question to you would be whether we should start, you know, how if you want to do a um a release to um stable or something like that, um, you get asked for the um the deb diff So I'm wondering if anyone I mean I've also when doing that myself. I've been submitting Uh, diffiscope outputs as well because there's just slightly more readable and useful So I'm not sure if anyone would have any objection to People asking for those Yeah, yeah Instead of just running on the sort. Yeah, we have at least one thumbs up. So yeah I'll I'll propose that to the release team see what they say Yeah, yeah, cool. Anyway, thank you very much. Any other questions? No further questions then Let's thank Chris again. Thank you very much