 My name is Marcin of Chandel and I'm here to talk about resonating craft. First of all, how many of you actually tried using crafts? It's okay to bet. How many of you found it completely unusable and gave up after a few moments? Ah, no. Not bad. First short introduction for those who didn't try. Craft is a package which has a tool which finds things that should be on your system but aren't, or which are on your system but shouldn't be. And just an example of how it does is you invoke it as a root because it needs to respond to the file system. And after a short or longer while, depending on your system size, it's out there recording which things it didn't like. No, on the things that it didn't like, if all is fine then report is empty. Now how it does it? Craft does its job in a few stages. First of all, it needs to find what files are on the system. And just to make it clear, when I mean files, I also mean all kinds of stuff like directories, so crafts doesn't really care. So it invokes finding all files on the system it can find in turn, ignoring special file systems like ProCo and FSM. How many others? And it just dumps a list of all the files on these file systems to a set of lists which tell what files exist. Then it needs to find out what should be in the system, and that part is taken care of by so-called explained scripts, which could theoretically be provided by individual packages, but so far only provided by craft package or created by local systems, I mean. And these scripts produce currently, oh, another thing. I'm talking about the craft which is currently in experimental use, one with 0.18, something that I could do. Is it version support? Well, the previous also works, but it works a bit differently. So this will hopefully be how it will work. And each subscript is invoked and produces three list of files on different files. First of all, list of files which may exist. That is, it's okay if a file like that exists on the system. Then files which must exist. It's an error if such file doesn't exist on the system. And which must not exist. Examples of the files which may exist are things like logs or spool files, which probably exist on the system, but it's not an error if they're missing, if you delete it, or if they're not yet deleted. Files which must exist are usually files which are installed by the package or are added files or some other things as well. And the files which must not exist are somewhat special, mostly for use by package.com, which purges some files. In general, it's not allowed for a package to delete files from other packages, but it's one of those examples of exceptions. So in that case, a wide explanation can say that a file must not exist while another explanation says the same file must exist. Yes, I'll talk about that later. Then Kraft needs to compare these two sources of information. What is there and what should be there. And it produces another three sets of files, lists of files which are unexplained, or missing, which should be there, or forbidden, which must not be there, but are. And these are created in a single run by a clever, modified method algorithm which does the job quite efficiently. And then these files are merged from the invented output as a report, which you can either Kraft can send by email or just output to standard input so you can see it in a pager, or you can use a Midnight Commander VFS plugin which allows you to see it in a three-like way, which is especially useful if the report is something like 100,000 files to different directories which does happen on the real system because Kraft isn't perfect yet. That was the core functionality of Kraft. There's something more. First of all, there's an additional possibility to do some, I call it integrity checking, but it's not really integrity. It's just a way to take advantage of the fact that FIND already scans all file systems and output some list of some files to another stream and then check them in some way. I will talk about it later. And also there's another, let's say, layer called filtering, which also makes Kraft much more flexible in some cases as well. Let's say about checking first. Maybe it would be easier to tell how it works now. Basically, there's only one such stream currently and it's taking care of simulators. So FIND as it goes on all the file systems just outputs all simulings to a file. Then a special checklist is found on this file which just checks if the file is broken or not. And the list of broken simulings is then appended to the record. You could also imagine other reasons to do such checking for things like directories or maybe it's a nonsense but I've heard that Kraft uses additional metadata in some files to implement translators so maybe we could do such checking as well. Anyway, it's now quite well abstracted and it's easy to add another checklist to this scheme. Unlike it was before. About filtering. Filtering is performed by a simple script which just reads all its input from previous stages and outputs everything it doesn't filter to another output. It's a single program which reads different filter files depending on which part of filtering it's supposed to do. All the filters are pretty simple as you can see. They are using an additional syntax to get rid of all saturated retrieves which is similar to Apache and uses if you know it. Why are we using it? Theoretically it could be possible to just do all the job using explain scripts but as in real life there are some problematic cases like some files moving or renaming files deleting files from other packages or conflicting explain scripts or other things that I don't even want to think about and having such an additional layer makes it easy to just get rid of some reports or some part of reports for this assignment that it doesn't want to care about the fact that some package is buggy or that there is a conflict. If you also let the package use the extra files let Kraft use the package extra files when it is finally implemented and in general filtering files is much faster than using explain scripts because explain scripts usually have to first there are different additional process which is to be created and second what they usually do is just run another fine command to just sort some subtree and then output that to standard output so Kraft can then remove the files from the report but filters make it easier and faster and now a bit about history that was written a long, long time ago by Anthony and even as it was being developed there was Anthony so that there is a need to register some files which I wouldn't register anyway currently a prominent example is if you look for FEC password and using the package search it doesn't find it it's not perfect especially for new users but I think having an additional database of all the files which are installed by David would be good so he proposed an extra file to change to the policy unfortunately his proposal was abandoned and the package was not changed to add this optionality and so he had to slowly be plotting over the next few years with just a couple of non-maintainer downloads curious maintainer, non-maintainer some time ago then that kind of thing in Finland I think I asked maybe if it was okay for him to try and take care of for everyone so he can get anything and come out of it so I imported the source to the subversion repository which hopefully makes it easier for other people to submit changes to the numerous filter and explain files I fixed many bugs just a couple are left and I did some refactoring of some code some of which was just quite ugly some of it was impossible to fix some bugs without changing it so it's now mostly cleaned up and works can be some that does not report false positives on a base install that is if you run the bootstrap and run craft inside of it it doesn't say anything but as soon as you start using the system there will probably appear some files like long files or save files there is still a long way for craft to go but why why the additional words in the title, the subtitle of this talk I found out that the ability to have another database of files which are installed by a certain package is a good way to test the package itself I found a couple of cases where the posting script was buggy didn't do what it was supposed to do with alternatives or the versions somehow the system worked but the files were like as they should be craft detected just from such things it also detects things like random files appearing on your system because of a package bug or because of another administrative address it didn't clean up after themselves and that's another reason we should I think we should have craft just to conclude the links to threads on the game policy and the game development that I could find from past I wish this part for I mean I come under which contains the visual file system plugin which is useful for live reports that's about it, any questions? no questions which is slightly unrelated but what is the status of these extra files? the status is there is a proposal a long time ago and that's it I don't have one on the WD package but you have to look at the list to see if I can find it is there any way to make this file declaration run any faster takes ages since the first round doesn't well, every round actually never tried it twice possibly I recognize that this is just a beginning make it work properly first before optimizing it so it's interesting to me because more than a decade ago I was managing group of people to maintain several thousand Unix machines including mostly HQX systems but also some Solaris and other things we had a system that we put in place as a result of demands by our internal auditors that had a data collection engine that was very similar to this in the sense that it was running big finds over the file system huge lists of files and then parsing those down to eliminate things that were understood to be part of software packages delivered with the original OS and then what would happen is we would take the resulting long list of files along with their permissions owners and all that other useful group and that was all being delivered to the central database from which we had one of the very early sort of web interfaces to anything running and where it got really interesting was when you could see patterns of the same behavior happening across multiple machines or where you could discern a new rule that it would be useful to have in processing the data and I wonder if you thought at all about how to sort of take this beyond the here's a sort of human readable text formatted report I wonder if the future of this class analysis tool isn't somehow to figure out how to hook it into one or more of the interesting system frameworks you know come up with some reasonable database schema for what an exception or what a piece of data which might have exception rules and filter rules applied to it should look like and I think that kind of goes in my mind to the question of what people think the point of a tool like this is and why the other folks in the room would be interested in having it. Is it to help just general systems administration? Is it pure curiosity? Is it because you'd like to see if your packages are broken and doing strange things to the system? Is it because you run a lot of machines that are subjected to auditing requirements and you want another tool for that tool? So I'm curious what other if any of this names anybody else? That was a really nice question. Is this more like for developers to sort of clean up the packages or clean up all the people's packages or is it more for system administration to clean up the system after it's run for a while? It's not just about cleaning up the system, it's more about finding sorts of systems. Well then it really ties into an intuitive detection system which also does things like that. They also scan files and check if they've changed. The problem with intuitive detection systems is that they're not aware of the de-packaged database that has the potential to return a shorter report that's easier to read. But it's sort of some of the other ones that dance around in the middle of all these things. Well then I think you also have to ask the question of why does something show up in a crop report. I don't remember what it was called, but someone did a backup system for them in machines at one point that was interesting because it would only try to back up those files which weren't part of the process. Right. So in fact it was the if it's flagged as croft and it's out in some weird system directory it may be because there's a packaging error. If it's flagged as croft but it's in a directory that contains the files that actually matter to you on that system it may be a good indication that that's what's unique about that machine that needs to be backed up. An additional configuration file that results in running a spice job or I don't know whatever it is. And so this is where to meet. I'm really pleased to see that croft has had some attention because Edgy will tell you something I've talked about at various times way back then. This is the class tool that I think is neat because it's taking advantage of things that are uniquely interesting on Debian systems, but I immediately start trying to figure out, okay if we get all the bugs fixed and it's sort of working what would you want to do with it and how might that drive you to make decisions about what to do next. I don't have a lot of answers, I have a lot of questions. One thing I thought of was currently just as the simplest impossible release have lists of files with one filename in one line and no metadata apart from it. So one line would be to change the protocol somehow to let the sub-processors communicate something more than just a filename. For example because this filter process is not just to filter out but to categorize these croft items. So just one rule or leave it or drop it set of rules. Is it slash file directory? Is it a log file? Is it what? Well that really I think comes back to the what are you really trying to accomplish because if you're doing security auditing what you might really care about is not only that the file exists but what's the owner group and permissions maybe other Apple stuff or if you've got other things going on in your system. It could be that if you're trying to do intrusion detecting but you don't only care about that but you'd like to do the tripwire or any of the files that were delivered by packages on your system differing in their crypto check sums from what they were when they were installed and so this is where I for those of us like me who run a notebook that hasn't it's got a devian image that was first installed many years ago it can be very instructive to run a tool like this just on your own system and get some sense of how much true croft has accumulated but if you do that once every five years or something it'll cool not what else we should do with it it's the sort of thing if you're about to move to a larger hard disk in a few minutes with croft would save you transferring that handful of log files that you didn't even know were there that represent kind of six giga-day or something then it might be worth the effort so the truth you know, very interesting things positives to real positive work that my croft was so you know, saying that I didn't think of any more advanced things until I got it to this process reached when we get to it so what would you like to happen actually, like a stall go try the current version and see what we think well, I mean it's neat to see that you've been working on this, that's great but what would you like from us and what would you like to do next? I think the best way would be to have it work properly on something more than a basing store a small standard system which people really use and it's not just created from the bootstrapping and give it pristine and possibly there would be other things that would need more de-factoring of the design and I think at some point we could get to a point where the design is stable enough to think about other things like adding mechanics support or categorizing or anything else Suki talk a little bit about what these false positives normally are in the system anything that isn't directly installed by the batch which is created or created by a demo or another process having an example of one of the longer reports just finding out the stuff something you may be proud of but the rest of the system needs to be more regularized for example where this extra file instead of there is an opening perhaps there's a bug that's already there's one so while it's going one of the things that I was originally finding before I had just completely dropped it because extra files wasn't going anywhere was still reworking the code so that rather than just run and find it would do the directory traversal itself and only descend the directory if some of the explanations or something indicated that some of the files would perhaps not all of them underneath that directory would actually do all of it which would let you say if everything under slash home is unexplained without actually having to find everything under slash home which is probably more important for NFS sort of stuff too and I'm just running cropped now for the last five minutes I think it's not getting a result yet Yeah it would be needed but I just wanted to postpone any performance improvements after I got it to run Yeah well I thought the same and it's still in the first place it's around 70% Well for example one thing that could be like an answer to your question and Steve of course maybe not the main mode the main operation mode but you could besides taking the data from the real life installs take also be able to use locate dv because you have every file in your system registered there I mean just like crossing information between sources that are already available in the system instead of fetching it every time you run it and that's a good that's a good idea It would be similar integration with intrusion detection system of course and that would not work for intrusion detection because an intruder maybe the first thing would be to modify the locate dv or whatever but it works for package maintainers or for users just looking for cropped or for people like VDAs who have a long living system I didn't mean to integrate locate dv with intrusion detection to integrate cropped with intrusion detection to make use of their databases which might be more data rich than locate dv So one of the things that I found a little challenging the last time I played with it which was frankly before you started working was that it is expensive to do a complete run and one of the things I missed a little bit from this old system unfortunately I couldn't just go fetch again because I don't work there anymore was this notion that the process of collecting the data about the system and the further processing of that data were actually quite separated and so it meant that you didn't have to rerun the big find if you wanted to change exclusions or something like that if you wanted to say okay forget slash on you know I didn't I wasn't thinking when I first ran out I included all that data is that stuff easy to do now does it already okay kind of you see that there are three variables here files reports clean up I think okay so you could so if you can tell it not to run this bit which runs both fine and expensive so you could usually split that into two so you could have it on standard file system once and then rerun expensive because you changed one or something okay it's already been there from the beginning okay I see people are talking about EDS because of security everyone has security in mind but I another point because I tend to work like that for example I set up a compete system okay and I and edit a lot of configuration files and I create a configuration file and many other files in different locations on my disk once I have done this I will take all this data and I want to know what is important and what I have added to the system to make it run and it's very important to me because I have I put this this data and I put it in Puppet or CFNG which helped me to educate my system of network for now I'm using DevSams just to list the configuration file I have modified it's efficient because I get a list of something like I know that I've modified adjacent work and my extra extra and I just copy this file into my into my suggestion the Puppet repository and I begin to edit it to make it integrate with Puppet and I think it could be very it's hard to do if you stand in Puppet change the file and upload it to the system because it's long it's complicated if you edit directly the configuration file try the the service it's faster but at the end you must remember all the changes you have done so could you think that just to give you an example if it's possible I don't know if it's what you have in mind but when I have done something in Puppet just to establish that I have changes using Puppet I upload and explain file saying this file has been uploaded with Puppet it's normal that it has been modified it's here because I just put it here and I can see all the changes except the change I have done with Puppet to my configuration file and other data files that I have uploaded to the system you think it would be useful to use Puppet to list this change currently crap on the case about additions or no modifications at all yes that's the reason why I use DevSend and I think that you cannot cop what DevSend do for configuration file but I have the problem just to give you an example UTC can have image that comes should not exist so it is not listed in DevSend run but I have added to the system because I need it so it's a complementary regarding the use of DevSend just to know if the system can use your system to replicate configuration among many so some combination of DevSend's analysis to know which files crap isn't triggering have been changed combined with what crap finds for the files that are not expected of this yes I think I can use crap because he changed only one for this for this or no it will tell you about the files or no but all the files all the subversions but it doesn't know about files have changed but if we add some methods or checks on support I think it could be okay I didn't know it takes that long it could be because I used it only on my truth just to show you I have more than one that's true in the system so we're all watching a great curiosity whether you wait for it to finish or not my page so what the term is I don't know how maybe I have some old it's still running I have an idea for this this application although this would help a lot wouldn't it be cool if we had a file system that cashed in somewhere except for there well that's fine too but I could do both but then you could use it but then you could use it for rsync and devsums and cruft and a whole bunch of other things you could probably use the xt3 because it has the ability to store we could use lots more bring it up well then you could use you could do it like in some of the journal you could have it, you know, cash that stuff in some other files somewhere that would be really fast you could use the kernel iotify mechanism to have a a demo to do this yeah so in the file chain it would do it right yeah so we've got a two gigabyte file and we've changed one byte of it that's reconcuted, yeah the entire file one byte anyway no, you could do it incremental that's what we'll do with those that's what we'll do with the extra RAM it's got to be pretty fast just to get rid of the distance that's fine we can actually be able to compute you know, texums per per ton well I'll figure it out that way you can just might as well just hook it in the iotify and check to find that as it was created if it should be created but then it's kind of crazy it kind of is a new file system definition to store this along with the other file matter there there's probably a little stuff you can do like that well anyway here's an example of false and lots of false positives so why would those be there so how can those be there in the registry they're all so those would be hard links to a single binary problem that are created on post-hones yes there are all the facts of nasty things that developers do but we can get more than that so the problem is if a package is doing nasty things like those hard links where do you store the information about this not being cropped in the package or in extra files the possibility is packaged to be delivering the way we do this oh yeah it's finished running well basically if it's not installed by the package and it's not something like an alternative well why is it that PY is getting hit no because that's support that seems to me that pretty much everything should be stored unless you want to keep a huge database of everything so in theory it just has a filter saying that anything under valid Python support is controlled by Python support so it's all fine but if you had something under valid Python SUP or RT then that would be mistake because it's a typo so obviously something won't put there accidentally and it's not controlled by Python support so you might want to know about it also Python support doesn't do its job properly there's some bug which you can never find but then you never know about it have you looked at the way the filters work in log chain supposedly intrusion detection what do you call that then what do you mean but it it has just all this complicated system for either marking that something that an oddity is the responsibility of some but package it all of this log chain has been used as regular and quite extensive and all these shazos made of logs as opposed to the logs so it may be actually like what's the certain point good regular expression and it is better than just trying to quantize and craft it I imagine really well if Jean Lebson had actually worked his document at the time he might well have worked his document a bit now there's definitely lots of room for improvement ideas or what to help I have a machine that was installed in 1997 and it's been running on the table ever since then and I've just been totally sloppy about everything that I've installed in there is it useful to you to do a cruft run and get all this garbage back and then send it to you and that will help you to get it back and just fix it okay it might be interesting to realize where you can send the cruft reports like what kind of people are willing to do just some files that are already do you think that there will be an issue automatically I mean just have a comment on that I want to make all my files the good thing about popcorn is that it runs quickly running a cruft in the background well it kind of slows you down a bit yeah yeah it gives us better memory that could have saved you yeah but when I'm not using that stuff it just wasted anyways flushes out the cache that's the tough part I would like to at the moment to work on more packages so do you want to start collecting more rules about what sort of stuff is is it okay so do you want to work on getting speed better or what's the next question the ultimate goal is to have more patterns and trade with if the way the current performance is suffering people from trying to submit in cartridges or maybe improving performance is a more player I think it should be a bit of both because if it gives you lots of false pleasantries still on your own system you probably don't need a lot of input from other people yet you need to sort it on your system first and then you can work on making it more beautiful for more people to learn what do you tell package maintainers now who provide who deliver things to explain the way files that they create what does the same documentation doesn't say they should explain away things that are automatically created or log files I'd say for now the best way is just send an explain speed or a filter file so I mean I'll file it in PGS and put it in crap because I haven't really thought about the way that overrides should work because say a package has an explain speed and then the maintainer about it and the explain speed gets outdated after some time so then I then I ship another improved version of crap which overrides it but then again the maintainer does more changes and then he's got the better version at some point so we need to think of a way to let it work for the other one thing I thought of was to have a one line comment or explain speed or a filter file which would say the last recently changed version of the package last model so I think for most packages you don't need to worry about an explain speed but just a filter file which is just a list of files that you manually delete from a removable purge with the slash star slash double star and I think having a file like that for each few packages would really much get rid of those positives I guess my point is just the mentality of what are the assuming that crap just trying to accomplish of the different classes that we talked about finding junk on your system and these things are different please back them up if you tell them depending on what you tell the maintainer logs may or may not be interesting maybe you want to back up logs but if you're just trying to explain that it's like no this isn't garbage it should be here then you would have that in the filter so you may even want to define multiple classes of like things that are you'd have to come up with use models for craft for something like that you probably want to have a report so generate a report that these are the files I've got no idea about ones that should be here or ones that should be here but that's more give me a list of files that I'm going to be taken care of by deep package saying and then you're just excluding the deep package files, excluding all the files that are taken care of that's something that's where you're going to go again it's like known about outside of deep package and unknown as the two categories and then somebody who is choosing to use craft to back up their system would want to back up the things that are known about but outside of deep package and you would yeah I think that's more easy it's just the matter of the deep package in front of this because that's everything in deep package yeah I think it's still all the packages that you might install previously and I think they're all called files and all the files that you might craft so in theory I've been sitting here thinking about this though and the thing that's happening in the interim of the last few years that makes me wonder how many people would really actually do that very much is the emergence of tools that do Delta-based fundamentals like where it's in effect are syncing your tree over to a server and the servers on the store and the files that differ from the last time you did it which leaves you in a situation where you have a no-brainer way to put the system back to the state it was in at any point within the window you've configured it to keep history for and I wonder if that's sort of truly a generic approach for doing system restoration or replication isn't almost more interesting than something like this where I just keep thinking there's two more special cases you'd have to think about it makes me wonder if this isn't a more interesting idea in the context of either system cleanup or possibly intruding detection kinds of interfaces maybe at least in my mind maybe less interesting as a way to drive a backup or a system and restoration process I don't know different people with different values and different things putting this much energy in figuring out how to handle the stuff that's roughly associated with the packaging system and interesting servers and interesting machines tend to have the vast majority of their content and the other data that's on the list I just wonder how I'm excited about this I've certainly thought about it before a bunch of times myself and it's another one of those routinely appealing ideas but I'm just not sure it really does like it's around the tangent where I work for Windows PCs they use a proprietary tool called connected backup and it backs up a whole bunch of PCs but it looks at only 5 sums of files to figure out what's different and so it's backing up 100 PCs and they're all running the same version of Windows it only has to back up one copy of most of those Windows files likewise if 20 of those machines are running a particular application that it's still it only has to back up that application once and I thought that's kind of a neat idea to see for Linux where you could kind of repair we should ask Keith because he actually uses it I think that maybe something nervous does because he was kind of convincing that I ought to be backing up all of the annoying kids I might have to use a nurse to adjust and then let him act at the table so that we would be directly at the table but it just uses R-Sync but it's also maybe comparing the different machines but it does use R-Syncs to treat the server it's impressive that it's almost the same and it actually does see where it's going so you can actually go back not just per machine but across machines it seems obvious and logical that that would be there what if the code actually does back up and Keith would know I think the connected thing is even smart enough to do if the MD5 sum of the file is the same even if it's in a different path it still does the right thing to connect the backups but in the end we wouldn't have to worry about that exact issue you were mentioning because well in Windows it's regularly usual that low-privileged user installs an application on his accessible directory in a unit system it's not that common although of course in large installations it may be it's a really big deal when you get into production environments because then typically people are wanting to be very careful about how they let things change and migrate it's not, you know, for a bunch of individual developers or mixed population machines you're right they're going to be less similar but if you get into any kind of an office environment or certain kind of large production environments people really do want to control the process by which things are updated no, I mean of course I understand that point that point I completely agree for example with the solution you mentioned at the beginning of the talk that there was a backup system that took only the differences between a regular in the game install and working hardware and it was a package but you're saying it's less of a win on an analytics-based system than it would be on a Windows-based system it's quite strange for a unit system that's not a huge machine to have the binaries in different places specifically for a living environment although data yes and there's data as well so it's about unpacked kernel trees it's not purely filing that you may have 20 different kernel trees that are most of the same you could really win a lot of money ideally what it would do is download the 5-Sum the file oh that would be an easy first to do that would be a really good idea you just dump all of use the NB5-Sum just like Git does use the NB5-Sum as a file into a Git repository alternatively NB5-Sum into a directory that's keyed by the first two digits of the NB5-Sum and then hard-linking that back into the tree so you create a tree that hard-links into NB5-Sum named files and then the files are all shared across multiple pages it's pretty much just using Git to store your entire file system your entire file system that would not be hard to hack out I work with a similar tool that just goes through an entire tree and finds where files are identical and hard-links them and that can be evil but sometimes that's not the behavior you want I think he was doing it when he was building ISOs or you weren't going to change it anyway well the dervish notion of having multiple trees that hard-link the same file across them for multiple generations it's a huge feature for backups I can't imagine doing R-Sync makes it so easy you just make a hard-link tree copy and then R-Sync onto it so it's creating a hard-link tree on top of it it's really, really simple there's a lot of directory entries there who cares? I hadn't gotten around to figuring out what it was doing there was a point in history where I went and looked at a few of these and I couldn't find anything that I thought was better than hers but it's been a while since I looked and I never got around to implementing any of them I have my wife's machine still backup, so that's something actually I've been very lucky to last a couple of times since the previous backup it's been good so I think we're officially on a 10th now are we done with the crop still? well anyway I know there's place for much discussion still but I think it has to be my role to say this buff is over okay thank you